{"id":118,"date":"2014-10-16T21:26:25","date_gmt":"2014-10-16T21:26:25","guid":{"rendered":"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/?post_type=chapter&#038;p=118"},"modified":"2019-06-06T20:32:59","modified_gmt":"2019-06-06T20:32:59","slug":"regression-basics-2","status":"publish","type":"chapter","link":"https:\/\/opentextbc.ca\/introductorybusinessstatistics\/chapter\/regression-basics-2\/","title":{"raw":"Chapter 8. Regression Basics","rendered":"Chapter 8. Regression Basics"},"content":{"raw":"Regression analysis, like most multivariate statistics, allows you to infer that there is a relationship between two or more variables. These relationships are seldom exact because there is variation caused by many variables, not just the variables being studied.\r\n\r\nIf you say that students who study more make better grades, you are really hypothesizing that there is a positive relationship between one variable, studying, and another variable, grades. You could then complete your inference and test your hypothesis by gathering a sample of (amount studied, grades) data from some students and use regression to see if the relationship in the sample is strong enough to safely infer that there is a relationship in the population. Notice that even if students who study more make better grades, the relationship in the population would not be perfect; the same amount of studying will not result in the same grades for every student (or for one student every time). Some students are taking harder courses, like chemistry or statistics; some are smarter; some study effectively; and some get lucky and find that the professor has asked them exactly what they understood best. For each level of amount studied, there will be a distribution of grades. If there is a relationship between studying and grades, the location of that distribution of grades will change in an orderly manner as you move from lower to higher levels of studying.\r\n\r\nRegression analysis is one of the most used and most powerful multivariate statistical techniques for it infers the existence and form of a functional relationship in a population. Once you learn how to use regression, you will be able to estimate the parameters \u2014 the slope and intercept \u2014 of the function that\u00a0links two or more variables. With that estimated function, you will be able to infer or forecast things like unit costs, interest rates, or sales over a wide range of conditions. Though the simplest regression techniques seem limited in their applications, statisticians have developed a number of variations on regression that\u00a0greatly expand the usefulness of the technique. In this chapter, the basics will be discussed. Once again, the t-distribution and F-distribution will be used to test hypotheses.\r\n<h1>What is regression?<\/h1>\r\nBefore starting to learn about regression, go back to algebra and review what a function is. The definition of a function can be formal, like the one in my freshman calculus text: \"A function is a set of ordered pairs of numbers (<em>x<\/em>,<em>y<\/em>) such that to each value of the first variable (<em>x<\/em>) there corresponds a unique value of the second variable (<em>y<\/em>)\" (Thomas, 1960).[footnote]Thomas, G.B. (1960). <em>Calculus and analytical geometry<\/em> (3rd ed.). Boston, MA: Addison-Wesley.[\/footnote].\u00a0More intuitively, if there is a regular relationship between two variables, there is usually a function that describes the relationship. Functions are written in a number of forms. The most general is <strong><em>y<\/em> = f(<em>x<\/em>)<\/strong>, which simply says that the value of y depends on the value of x in some regular fashion, though the form of the relationship is not specified. The simplest functional form is the linear function where:\r\n\r\n<a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2014\/10\/100000000000004900000015489223001.png\"><img src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2014\/10\/100000000000004900000015489223001.png\" class=\"wp-image-105 size-full alignnone\" alt=\"Linear function\" height=\"21\" width=\"73\" \/><\/a>\r\n\r\n<em>\u03b1<\/em> and <em>\u03b2<\/em> are parameters, remaining constant as <em>x<\/em> and <em>y<\/em> change. <em>\u03b1<\/em> is the intercept and <em>\u03b2<\/em> is the slope. If the values of\u00a0<em>\u03b1<\/em> and <em>\u03b2<\/em> are known, you can find the <em>y<\/em> that goes with any <em>x<\/em> by putting the <em>x<\/em> into the equation and solving. There can be functions where one variable depends on the values values of two or more other variables where\u00a0<em>x<sub>1<\/sub><\/em>\u00a0and\u00a0<em>x<sub>2<\/sub><\/em>\u00a0together determine the value of <em>y<\/em>. There can also be non-linear functions, where the value of the dependent variable (<em><strong>y<\/strong><\/em> in all of the examples we have used so far) depends on the values of one or more other variables, but the values of the other variables are squared, or taken to some other power or root or multiplied together, before the value of the dependent variable is determined. Regression allows you to estimate directly the parameters in linear functions only, though there are tricks that\u00a0allow many non-linear functional forms to be estimated indirectly. Regression also allows you to test to see if there is a functional relationship between the variables, by testing the hypothesis that each of the slopes has a value of zero.\r\n\r\nFirst, let us consider the simple case of a two-variable function. You believe that <em>y<\/em>, the dependent variable, is a linear function of <em>x<\/em>, the independent variable \u2014 <em>y<\/em> depends on <em>x<\/em>. Collect a sample of (<em>x<\/em>, <em>y<\/em>) pairs, and plot them on a set of <em>x<\/em>, <em>y<\/em> axes. The basic idea behind regression is to find the equation of the straight line that comes as close as possible to as many of the points as possible. The parameters of the line drawn through the sample are unbiased estimators of the parameters of the line that would come as close as possible to as many of the points as possible in the population, if the population had been gathered and plotted. In keeping with the convention of using Greek letters for population values and Roman letters for sample values, the line drawn through a population is:\r\n\r\n<a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2014\/10\/100000000000004900000015489223001.png\"><img src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2014\/10\/100000000000004900000015489223001.png\" class=\"alignnone wp-image-105 size-full\" alt=\"Linear function\" height=\"21\" width=\"73\" \/><\/a>\r\n\r\nwhile the line drawn through a sample is:\r\n\r\n<em>y<\/em> = <em>a<\/em> + <em>bx<\/em>\r\n\r\nIn most cases, even if the whole population had been gathered, the regression line would not go through every point. Most of the phenomena that business researchers deal with are not perfectly deterministic, so no function will perfectly predict or explain every observation.\r\n\r\nImagine that you wanted to study the estimated price for a one-bedroom apartment in Nelson, BC. You decide to estimate the price\u00a0as a function of its location in relation to downtown. If you collected 12 sample pairs, you would find different apartments located within the same distance from downtown. In other words, you might draw a distribution of prices for apartments located at the same distance from downtown or away from downtown. When you use regression to estimate the parameters of price = f(distance), you are estimating the parameters of the line that connects the mean price\u00a0at each location. Because the best that can be expected is to predict the mean price\u00a0for a certain location, researchers often write their regression models with an extra term, the <strong>error term<\/strong>, which notes that many of the members of the population of (location, price of apartment) pairs will not have exactly the predicted price\u00a0because many of the points do not lie directly on the regression line. The error term is usually denoted as <strong><em>\u03b5<\/em><\/strong>, or <strong>epsilon<\/strong>, and you often see regression equations written:\r\n\r\n<a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2014\/10\/1000000000000061000000157E0FBF2F1.png\"><img src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2014\/10\/1000000000000061000000157E0FBF2F1.png\" alt=\"Regression equation\" class=\" wp-image-107 size-full alignnone\" height=\"21\" width=\"97\" \/><\/a>\r\n\r\nStrictly, the distribution of <em>\u03b5<\/em> at each location\u00a0must be normal, and the distributions of <em>\u03b5<\/em> for all the locations\u00a0must have the same variance (this is known as homoscedasticity to statisticians).\r\n<h1>Simple regression and least squares method<\/h1>\r\nIn estimating the unknown parameters of the population for the regression line, we need to apply a method by which the vertical distances between the yet-to-be estimated regression line and the observed values in our sample are minimized. This minimized distance is called <em>sample error,<\/em> though it is more commonly referred to as <em>residual<\/em> and denoted by <em>e.\u00a0<\/em>In more mathematical form, the difference between the <em>y<\/em> and its predicted value<em>\u00a0<\/em>is the residual in each pair of observations for <em>x<\/em> and <em>y<\/em>. Obviously, some of these residuals will be positive (above the estimated line) and others will be negative\u00a0(below the line). If we add all these residuals over the sample size and raise them to the power 2\u00a0in order to prevent the chance those positive and negative signs are cancelling each other out, we can write the following criterion for our minimization problem:\r\n\r\n<a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image71.png\"><img src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image71.png\" alt=\"image7\" class=\" wp-image-320 alignnone\" height=\"45\" width=\"139\" \/><\/a>\r\n\r\n<em>S<\/em> is the sum of squares of the residuals. By minimizing <em>S<\/em> over any given set of observations for <em>x<\/em>\u00a0and <em>y<\/em>, we will get the following useful formula:\r\n\r\n$latex b=\\frac{\\sum{(x-\\bar{x})(y-\\bar{y})}}{\\sum{(x-\\bar{x})^2}}$\r\n\r\nAfter computing the value of <em>b<\/em> from the above formula out of our sample data, and the means of the two series of data on<em> x\u00a0<\/em>and <em>y<\/em>, one can simply recover the intercept of the estimated line using the following equation:\r\n\r\n<a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image9.png\"><img src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image9.png\" alt=\"image9\" class=\" size-full wp-image-323 alignnone\" height=\"32\" width=\"140\" \/><\/a>\r\n\r\nFor the sample data, and given the estimated intercept and slope, for each observation we can define a residual as:\r\n\r\n<a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image111.png\"><img src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image111.png\" alt=\"image11\" class=\" size-full wp-image-326 alignnone\" height=\"32\" width=\"253\" \/><\/a>\r\n\r\nDepending on the estimated values for intercept and slope, we can draw the estimated line along with all sample data in a <em>y<\/em>-<em>x<\/em> panel. Such graphs are known as scatter diagrams. Consider our analysis of the price of one-bedroom apartments in Nelson, BC. We would collect data for <em>y=<\/em>price of one bedroom apartment, <em>x<sub>1<\/sub><\/em>=its associated distance from downtown, and <em>x<sub>2<\/sub><\/em>=the size of the apartment, as shown in Table 8.1.\r\n<table><caption>Table 8.1 Data for Price, Size, and Distance of Apartments in Nelson, BC<\/caption>\r\n<tbody>\r\n<tr>\r\n<td colspan=\"3\"><em>y<\/em> = price of apartments in $1000\r\n<em>x<sub>1<\/sub><\/em> = distance of each apartment from downtown in kilometres\r\n<em>x<sub>2<\/sub><\/em> = size of the apartment in square feet<\/td>\r\n<\/tr>\r\n<tr>\r\n<td><strong>y<\/strong><\/td>\r\n<td><strong>x<sub>1<\/sub><\/strong><\/td>\r\n<td><strong>x<sub>2<\/sub><\/strong><\/td>\r\n<\/tr>\r\n<tr>\r\n<td>55<\/td>\r\n<td>1.5<\/td>\r\n<td>350<\/td>\r\n<\/tr>\r\n<tr>\r\n<td>51<\/td>\r\n<td>3<\/td>\r\n<td>450<\/td>\r\n<\/tr>\r\n<tr>\r\n<td>60<\/td>\r\n<td>1.75<\/td>\r\n<td>300<\/td>\r\n<\/tr>\r\n<tr>\r\n<td>75<\/td>\r\n<td>1<\/td>\r\n<td>450<\/td>\r\n<\/tr>\r\n<tr>\r\n<td>55.5<\/td>\r\n<td>3.1<\/td>\r\n<td>385<\/td>\r\n<\/tr>\r\n<tr>\r\n<td>49<\/td>\r\n<td>1.6<\/td>\r\n<td>210<\/td>\r\n<\/tr>\r\n<tr>\r\n<td>65<\/td>\r\n<td>2.3<\/td>\r\n<td>380<\/td>\r\n<\/tr>\r\n<tr>\r\n<td>61.5<\/td>\r\n<td>2<\/td>\r\n<td>600<\/td>\r\n<\/tr>\r\n<tr>\r\n<td>55<\/td>\r\n<td>4<\/td>\r\n<td>450<\/td>\r\n<\/tr>\r\n<tr>\r\n<td>45<\/td>\r\n<td>5<\/td>\r\n<td>325<\/td>\r\n<\/tr>\r\n<tr>\r\n<td>75<\/td>\r\n<td>0.65<\/td>\r\n<td>424<\/td>\r\n<\/tr>\r\n<tr>\r\n<td>65<\/td>\r\n<td>2<\/td>\r\n<td>285<\/td>\r\n<\/tr>\r\n<\/tbody>\r\n<\/table>\r\nThe graph (shown in Figure 8.1) is a scatter plot of the prices of the apartments and their distances from downtown, along with a proposed regression line.\r\n\r\n[caption id=\"attachment_1235\" align=\"aligncenter\" width=\"730\"]<a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/11\/Figure8-1.png\"><img src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/11\/Figure8-1.png\" alt=\"Figure8-1\" class=\"wp-image-1235\" height=\"350\" width=\"730\" \/><\/a> Figure 8.1 Scatter Plot of Price, Distance from Downtown, along with a Proposed Regression Line[\/caption]\r\n\r\nIn order to plot such a scatter diagram, you can use many available statistical software packages including Excel, SAS, and Minitab.\u00a0In this scatter diagram, a negative simple regression line has been shown. The estimated equation for this scatter diagram from Excel is:\r\n\r\n<a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image131.png\"><img src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image131.png\" alt=\"image13\" class=\" size-full wp-image-329 alignnone\" height=\"30\" width=\"144\" \/><\/a>\r\n\r\nWhere <em>a<\/em>=71.84 and\u00a0<em>b<\/em>=-5.38. In other words, for every additional kilometre from downtown an apartment is located, the price of the apartment is estimated to be $5380 cheaper, i.e.\u00a05.38*$1000=$5380. One might also be curious about the fitted values out of this estimated model. You can simply plug the actual value for <em>x<\/em> into the estimated line, and find the fitted values for the prices of the apartments. The residuals for all 12 observations are shown in Figure 8.2.\r\n\r\n[caption id=\"attachment_845\" align=\"aligncenter\" width=\"107\"]<a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/07\/Residuals_Simple-Regression.png\"><img src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/07\/Residuals_Simple-Regression.png\" alt=\"Residuals_Simple Regression\" class=\"wp-image-845 size-full\" height=\"547\" width=\"107\" \/><\/a> Figure 8.2[\/caption]\r\n\r\nYou should also notice\u00a0that by minimizing errors, you have not eliminated them; rather, this method of least squares only guarantees the <em>best fitted<\/em> estimated regression line out of the sample data.\r\n\r\nIn the presence of the remaining errors, one should be aware of the fact that there are still other factors that might not have been included in our regression model and\u00a0are responsible for the fluctuations in the remaining errors. By adding these excluded but relevant factors to the model, we probably expect the remaining error will show less meaningful fluctuations. In determining the price of these apartments, the missing factors may include age of the apartment, size, etc. Because this type of regression model does not include many relevant factors and assumes only a linear relationship, it is known as a simple linear regression model.\r\n<h2>Testing your regression: does <em>y<\/em> really depend on <em>x<\/em>?<\/h2>\r\nUnderstanding that there is a distribution of <em>y<\/em> (apartment price) values at each <em>x<\/em> (distance) is the key for understanding how regression results from a sample can be used to test the hypothesis that there is (or is not) a relationship between <em>x<\/em> and <em>y<\/em>. When you hypothesize that <em>y<\/em> = f(<em>x<\/em>), you hypothesize that the slope of the line (<em>\u03b2<\/em> in <em>y<\/em> = <em>\u03b1<\/em> + <em>\u03b2x<\/em> + <em>\u03b5<\/em>) is not equal to zero. If <em>\u03b2<\/em> was equal to zero, changes in <em>x<\/em> would not cause any change in <em>y<\/em>. Choosing a sample of apartments, and finding each apartment\u2019s distance to downtown, gives you a sample of (<em>x<\/em>, <em>y<\/em>). Finding the equation of the line that best fits the sample will give you a sample intercept, <em>\u03b1<\/em>, and a sample slope, <em>\u03b2<\/em>. These sample statistics are unbiased estimators of the population intercept, <em>\u03b1<\/em>, and slope, <em>\u03b2<\/em>. If another sample of the same size is taken, another sample equation could be generated. If many samples are taken, a sampling distribution of sample <em>\u03b2<\/em>\u2019s, the slopes of the sample lines, will be generated. Statisticians know that this sampling distribution of <em>b<\/em>\u2019s will be normal with a mean equal to <em>\u03b2<\/em>, the population slope. Because the standard deviation of this sampling distribution is seldom known, statisticians developed a method to estimate it from a single sample. With this estimated <em>s<sub>b<\/sub><\/em>, a t-statistic for each sample can be computed:\r\n\r\n<a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2014\/10\/100000000000009F0000002BE45013371.png\"><img src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2014\/10\/100000000000009F0000002BE45013371.png\" alt=\"T-statistic\" class=\"alignnone wp-image-108 size-full\" height=\"43\" width=\"159\" \/><\/a>\r\n\r\nwhere <em>n<\/em> = sample size\r\n\r\n<em>m<\/em> = number of explanatory (<em>x<\/em>) variables\r\n\r\n<em>b<\/em> = sample slope\r\n\r\n<em>\u03b2<\/em>= population slope\r\n\r\n<em>s<sub>b<\/sub><\/em> = estimated standard deviation of b\u2019s, often called the <strong>standard error<\/strong>\r\n\r\nThese <em>t<\/em>\u2019s follow the t-distribution in the tables with <em>n<\/em>-<em>m<\/em>-1 df.\r\n\r\nComputing <em>s<sub>b<\/sub><\/em> is tedious, and is almost always left to a computer, especially when there is more than one explanatory variable. The estimate is based on how much the sample points vary from the regression line. If the points in the sample are not very close to the sample regression line, it seems reasonable that the population points are also widely scattered around the population regression line and different samples could easily produce lines with quite varied slopes. Though there are other factors involved, in general when the points in the sample are farther from the regression line, <em>s<sub>b<\/sub><\/em> is greater. Rather than learn how to compute <em>s<sub>b<\/sub><\/em>, it is more useful for you to learn how to find it on the regression results that you get from statistical software. It is often called the standard error and there is one for each independent variable. The printout in Figure 8.3 is typical.\r\n\r\n[caption id=\"attachment_350\" align=\"aligncenter\" width=\"608\"]<a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/Simple_Regression.png\"><img src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/Simple_Regression.png\" alt=\"Simple_Regression\" class=\" wp-image-350\" height=\"312\" width=\"608\" \/><\/a> Figure 8.3 Typical Statistical Package Output for Linear Simple Regression Model[\/caption]\r\n\r\nYou will need these standard errors in order to test to see if <em>y<\/em> depends on <em>x<\/em> or not. You want to test to see if the slope of the line in the population, <em>\u03b2<\/em>, is equal to zero or not. If the slope equals zero, then changes in <em>x<\/em> do not result in any change in <em>y<\/em>. Formally, for each independent variable, you will have a test of the hypotheses:\r\n\r\n$latex H_o: \\beta = 0 $\r\n\r\n$latex H_a: \\beta \\neq 0 $\r\n\r\nIf the t-score is large (either negative or positive), then the sample <em>b<\/em> is far from zero (the hypothesized <em>\u03b2<\/em>), and <em>H<sub>a<\/sub><\/em> should be accepted. Substitute zero for b into the t-score equation, and if the t-score is small, <em>b<\/em> is close enough to zero to accept <em>H<sub>a<\/sub><\/em>. To find out what t-value separates \"close to zero\" from \"far from zero\", choose an alpha, find the degrees of freedom, and use a t-table from any textbook, or simply use the interactive Excel template from <a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/chapter\/making-estimates-2\/\">Chapter 3<\/a>, which is shown again in Figure 8.4.\r\n\r\n<iframe src=\"https:\/\/onedrive.live.com\/embed?cid=0B27F889BFE551E2&amp;resid=B27F889BFE551E2%21192&amp;authkey=AEj42yeNIcfbMB0&amp;em=2&amp;wdAllowInteractivity=False&amp;AllowTyping=True&amp;wdHideGridlines=True&amp;wdHideHeaders=True\" width=\"100%\" height=\"600\"><\/iframe>\r\nFigure 8.4 Interactive Excel Template for Determining t-Value from the t-Table - see Appendix 8.\r\n\r\nRemember to halve alpha\u00a0when conducting a two-tail test like this. The degrees of freedom equal <em>n - m<\/em> -1, where <em>n<\/em> is the size of the sample and <em>m<\/em> is the number of independent <em>x<\/em> variables. There is a separate hypothesis test for each independent variable. This means you test to see if <em>y<\/em> is a function of each <em>x<\/em> separately. You can also test to see if <em>\u03b2<\/em> &gt; 0 (or <em>\u03b2<\/em> &lt; 0) rather than <em>\u03b2<\/em> \u2260 0 by using a one-tail test, or test to see if <em>\u03b2<\/em> equals a particular value by substituting that value for <em>\u03b2<\/em> when computing the sample t-score.\r\n<h2>Testing your regression: does this equation really help predict?<\/h2>\r\nTo test to see if the regression equation really helps, see how much of the error that would be made using the mean of all of the <em>y<\/em>\u2019s to predict is eliminated by using the regression equation to predict. By testing to see if the regression helps predict, you are testing to see if there is a functional relationship in the population.\r\n\r\nImagine that you have found the mean price of the apartments\u00a0in our\u00a0sample, and for each apartment, you have made the simple prediction that price of apartment\u00a0will be equal to the sample mean, <span style=\"border-top: 1px; border-left: 0px; border-right: 0px; border-bottom: 0px; border-style: solid;\"><em>y<\/em><\/span>. This is not a very sophisticated prediction technique, but remember that the sample mean is an unbiased estimator of population mean, so <strong>on average<\/strong> you will be right. For each apartment, you could compute your <strong>error<\/strong> by finding the difference between your prediction (the sample mean, <span style=\"border-top: 1px; border-left: 0px; border-right: 0px; border-bottom: 0px; border-style: solid;\"><em>y<\/em><\/span>) and the actual price\u00a0of an apartment.\r\n\r\nAs an alternative way to predict the price, you can have a computer find the intercept, <em>\u03b1<\/em>, and slope, <em>\u03b2<\/em>, of the sample regression line. Now, you can make another prediction of how much each apartment\u00a0in the sample may be worth\u00a0by computing:\r\n\r\n$latex \\hat{y} = \\alpha + \\beta(distance)$\r\n\r\nOnce again, you can find the error made for each apartment\u00a0by finding the difference between the price of apartments\u00a0predicted using the regression equation <em>\u0177<\/em>, and the observed price, <span style=\"border-top: 1px; border-left: 0px; border-right: 0px; border-bottom: 0px; border-style: solid;\"><em>y<\/em><\/span>. Finally, find how much using the regression improves your prediction by finding the difference between the price\u00a0predicted using the mean, <span style=\"border-top: 1px; border-left: 0px; border-right: 0px; border-bottom: 0px; border-style: solid;\"><em>y<\/em><\/span>, and the price\u00a0predicted using regression, \u0177. Notice that the measures of these differences could be positive or negative numbers, but that error or <strong>improvement<\/strong> implies a positive distance.\r\n<h2>Coefficient of Determination<\/h2>\r\nIf you use the sample mean to predict the amount of the price of\u00a0each apartment, your error is (<em>y<\/em>-<span style=\"border-top: 1px; border-left: 0px; border-right: 0px; border-bottom: 0px; border-style: solid;\"><em>y<\/em><\/span>) for each apartment. Squaring each error so that worries about signs are overcome, and then adding the squared errors together, gives you a measure of the total mistake you make if you want to predict <em>y<\/em>. Your total mistake is \u03a3(<em>y<\/em>-<span style=\"border-top: 1px; border-left: 0px; border-right: 0px; border-bottom: 0px; border-style: solid;\"><em>y<\/em><\/span>)<sup>2<\/sup>. The total mistake you make using the regression model would be \u03a3(<em>y-\u0177<\/em>)<sup>2<\/sup>. The difference between the mistakes, a raw measure of how much your prediction has improved, is \u03a3(<em>\u0177<\/em>-<span style=\"border-top: 1px; border-left: 0px; border-right: 0px; border-bottom: 0px; border-style: solid;\"><em>y<\/em><\/span>)<sup>2<\/sup>. To make this raw measure of the improvement meaningful, you need to compare it to one of the two measures of the total mistake. This means that there are two measures of \"how good\" your regression equation is. One compares the improvement to the mistakes still made with regression. The other compares the improvement to the mistakes that would be made if the mean was used to predict. The first is called an F-score because the sampling distribution of these measures follows the F-distribution seen in <a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/chapter\/f-test-and-one-way-anova-2\/\">Chapter 6<\/a>, \"F-test and One-Way ANOVA\".\u00a0The second is called <em>R<sup>2<\/sup><\/em>, or the <strong>coefficient of determination<\/strong>.\r\n\r\nAll of these mistakes and improvements have names, and talking about them will be easier once you know those names. The total mistake made using the sample mean to predict, \u03a3(<em>y<\/em>-<span style=\"border-top: 1px; border-left: 0px; border-right: 0px; border-bottom: 0px; border-style: solid;\"><em>y<\/em><\/span>)<sup>2<\/sup>, is called the <strong>sum of squares, total<\/strong>. The total mistake made using the regression, \u03a3(<em>y-\u0177<\/em>)<sup>2<\/sup>, is called the <strong>sum of squares, error (residual)<\/strong>. The general improvement made by using regression, \u03a3(<em>\u0177<\/em>-<span style=\"border-top: 1px; border-left: 0px; border-right: 0px; border-bottom: 0px; border-style: solid;\"><em>y<\/em><\/span>)<sup>2<\/sup> is called the <strong>sum of squares, regression<\/strong> or <strong>sum of squares, model<\/strong>. You should be able to see that:\r\n\r\nsum of squares, total = sum of squares, regression + sum of squares, error (residual)\r\n\r\n$latex \\sum{(y-\\bar{y})^2} = \\sum{(\u0177-\\bar{y})^2} + \\sum{(y-\u0177)^2}$\r\n\r\nIn other words, the total variations in <em>y<\/em>\u00a0can be partitioned into two sources: the explained variations and the unexplained variations. Further, we can rewrite the above equation as:\r\n\r\n<a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image17.png\"><img src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image17.png\" alt=\"image17\" class=\"size-full wp-image-339 alignnone\" height=\"30\" width=\"131\" \/><\/a>\r\n\r\nwhere SST stands for sum of squares due to total variations, SSR measures the sum of squares due to the estimated regression model that is explained by variable <em>x<\/em>, and SSE measures all the variations due to other factors excluded from the estimated model.\r\n\r\nGoing back to the idea of goodness of fit, one should be able to easily calculate the percentage of each variation with respect to the total variations. In particular, the strength of the estimated regression model can now be measured. Since we are interested in the explained part of the variations by the estimated model, we simply divide both sides of the above equation by SST, and we get:\r\n\r\n<a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image18.png\"><img src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image18.png\" alt=\"image18\" class=\" size-full wp-image-340 alignnone\" height=\"30\" width=\"221\" \/><\/a>\r\n\r\nWe then\u00a0isolate this equation for the explained proportion, also known as <em>R<\/em>-square:\r\n\r\n<a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image192.png\"><img src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image192.png\" alt=\"image19\" class=\" wp-image-343 alignnone\" height=\"49\" width=\"139\" \/><\/a>\r\n\r\nOnly in cases where an intercept is included in a simple regression model will the value of <em>R<sup>2<\/sup><\/em>\u00a0be bounded between zero and one. The closer <em>R<sup>2<\/sup><\/em> is to one, the stronger the model is. Alternatively, <em>R<sup>2<\/sup><\/em> is also found by:\r\n\r\n<a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/R-Square.png\"><img src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/R-Square.png\" alt=\"R-Square\" class=\" size-full wp-image-344 alignnone\" height=\"50\" width=\"341\" \/><\/a>\r\n\r\nThis is the ratio of the improvement made using the regression to the mistakes made using the mean.\u00a0The numerator is the improvement regression makes over using the mean to predict; the denominator is the mistakes (errors) made using the mean. Thus <em>R<sup>2<\/sup><\/em> simply shows what proportion of the mistakes made using the mean are eliminated by using regression.\r\n\r\nIn the case of the market for one-bedroom apartments in Nelson, BC, the percentage of the variations in price for the\u00a0apartments is estimated to be around 50%. This indicates that only half of the fluctuations in apartment prices with respect to the average price can be explained by the apartments\u2019 distance from downtown. The other 50% are not controlled (that is, they are unexplained) and are subject to further research. One typical approach is to add more relevant factors to the simple regression model. In this case, the estimated model is referred to as a multiple regression model.\r\n\r\nWhile <em>R<sup>2<\/sup><\/em> is not used to test hypotheses, it has a more intuitive meaning than the F-score.\u00a0The F-score is the measure usually used in a hypothesis test to see if the regression made a significant improvement over using the mean.\u00a0It is used because the sampling distribution of F-scores that it follows is printed in the tables at the back of most statistics books, so that it can be used for hypothesis testing. It works no matter how many explanatory variables are used. More formally, consider a population of multivariate observations, (<em>y, x<sub>1<\/sub>, x<sub>2<\/sub>, ..., x<sub>m<\/sub><\/em>), where\u00a0there is no linear relationship between <em>y<\/em> and the <em>x<\/em>\u2019s, so that <em>y<\/em> \u2260 f(<em>y, x<sub>1<\/sub>, x<sub>2<\/sub>, ..., x<sub>m<\/sub><\/em>). If samples of <em>n<\/em> observations are taken, a regression equation estimated for each sample, and a statistic, F, found for each sample regression, then those F\u2019s will be distributed like those shown in Figure 8.5, the\u00a0F-table with (<em>m<\/em>, <em>n<\/em>-<em>m<\/em>-1) df.\r\n\r\n<iframe src=\"https:\/\/onedrive.live.com\/embed?cid=0B27F889BFE551E2&amp;resid=B27F889BFE551E2%21194&amp;authkey=ACbx6NPdd4cpRL8&amp;em=2&amp;wdAllowInteractivity=False&amp;AllowTyping=True&amp;wdHideGridlines=True&amp;wdHideHeaders=True\" width=\"100%\" height=\"600\"><\/iframe>\r\nFigure 8.5 Interactive Excel Template of an F-Table - see Appendix 8.\r\n\r\nThe value of F can be calculated as:\r\n\r\n<a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2014\/10\/10000000000000C700000052A5D78DF11.png\"><img src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2014\/10\/10000000000000C700000052A5D78DF11.png\" alt=\"Sum of squares regression \/ sum of squares residual\" class=\" wp-image-112 size-full alignnone\" height=\"82\" width=\"199\" \/><\/a>\r\n\r\n<a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2014\/10\/10000000000000910000004E1831B8BA1.png\"><img src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2014\/10\/10000000000000910000004E1831B8BA1.png\" alt=\"Improvement made \/ mistakes still made\" class=\"wp-image-113 size-full alignnone\" height=\"78\" width=\"145\" \/><\/a>\r\n\r\n<a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2014\/10\/100000000000006E00000065597DBFCE1.png\"><img src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2014\/10\/100000000000006E00000065597DBFCE1.png\" alt=\"Value of F\" class=\" wp-image-114 size-full alignnone\" height=\"101\" width=\"110\" \/><\/a>\r\n\r\nwhere\u00a0<em>n<\/em> is the size of the sample, and\u00a0<em>m<\/em> is the number of explanatory variables (how many <em>x<\/em>\u2019s there are in the regression equation).\r\n\r\nIf \u03a3(<em>\u0177<\/em>-<span style=\"border-top: 1px; border-left: 0px; border-right: 0px; border-bottom: 0px; border-style: solid;\"><em>y<\/em><\/span>)<sup>2<\/sup> the sum of squares regression (the improvement), is large relative to \u03a3(<em>\u0177<\/em>-<span style=\"border-top: 1px; border-left: 0px; border-right: 0px; border-bottom: 0px; border-style: solid;\"><em>y<\/em><\/span>)<sup>3<\/sup>, the sum of squares residual (the mistakes still made), then the F-score will be large. In a population where there is no functional relationship between y and the <em>x<\/em>\u2019s, the regression line will have a slope of zero (it will be flat), and the <em>\u0177<\/em> will be close to y. As a result very few samples from such populations will have a large sum of squares regression and large F-scores. Because this F-score is distributed like the one in the F-tables, the tables can tell you whether the F-score a sample regression equation produces is large enough to be judged unlikely to occur if <em>y<\/em> \u2260 f(<em>y, x<sub>1<\/sub>, x<sub>2<\/sub>, ..., x<sub>m<\/sub><\/em>). The sum of squares regression is divided by the number of explanatory variables to account for the fact that it always decreases when more variables are added. You can also look at this as finding the improvement per explanatory variable. The sum of squares residual is divided by a number very close to the number of observations because it always increases if more observations are added. You can also look at this as the approximate mistake per observation.\r\n\r\n$latex H_0: y \\neq f(y,x_1,x_2,\\cdots,x_m)$\r\n\r\nTo test to see if a regression equation was worth estimating, test to see if there seems to be a functional relationship:\r\n\r\n$latex H_a: y = f(y,x_1,x_2,\\cdots,x_m)$\r\n\r\nThis might look like a two-tailed test since <em>H<sub>o<\/sub><\/em> has an equal sign. But, by looking at the equation for the F-score you should be able to see that the data support <em>H<sub>a<\/sub><\/em> only if the F-score is large. This is because the data support the existence of a functional relationship if the sum of squares regression is large relative to the sum of squares residual. Since <a href=\"http:\/\/www.stat.purdue.edu\/~jtroisi\/STAT350Spring2015\/tables\/FTable.pdf\">F-tables<\/a> are usually one-tail tables, choose an <em>\u03b1<\/em>, go to the <a href=\"http:\/\/www.stat.purdue.edu\/~jtroisi\/STAT350Spring2015\/tables\/FTable.pdf\">F-tables<\/a> for that <em>\u03b1<\/em> and (<em>m<\/em>, <em>n<\/em>-<em>m<\/em>-1) df, and find the table F. If the computed F is greater than the table F, then the computed F is unlikely to have occurred if <em>H<sub>o<\/sub><\/em> is true, and you can safely decide that the data support <em>H<sub>a<\/sub><\/em>. There is a functional relationship in the population.\r\n\r\nNow that you have learned all the necessary steps in estimating a simple regression model, you may take some time to re-estimate the Nelson apartment model or any other simple regression model, using the interactive Excel template shown in Figure 8.6. Like all other interactive templates in this textbook, you can change the values in the yellow cells only. The result will be shown automatically within this template. For this template, you can only estimate simple regression models with 30 observations. You use <em>special paste\/values<\/em> when you paste your data from other spreadsheets. The first step is to enter your data under independent and dependent variables. Next, select your alpha level. Check your results in terms of both individual and overall significance. Once the model has passed all these requirements, you can select an appropriate value for the independent variable, which in this example is the distance to downtown, to estimate both the confidence intervals for the average price of such an apartment, and the prediction intervals for the selected distance. Both these intervals are discussed later in this chapter. Remember that by changing any of the values in the yellow areas in this template, all calculations will be updated, including the tests of significance and the values for both confidence and prediction intervals.\r\n\r\n<iframe src=\"https:\/\/onedrive.live.com\/embed?cid=0B27F889BFE551E2&amp;resid=B27F889BFE551E2%21195&amp;authkey=ADs_Wx8MoXAAfYw&amp;em=2&amp;wdAllowInteractivity=False&amp;AllowTyping=True&amp;wdHideGridlines=True&amp;wdHideHeaders=True\" width=\"100%\" height=\"600\"><\/iframe>\r\nFigure 8.6 Interactive Excel Template for Simple Regression - see Appendix 8.\r\n<h2>Multiple Regression Analysis<\/h2>\r\nWhen we add more explanatory variables to our simple regression model to strengthen its ability to explain real-world data, we in fact convert a simple regression model into a multiple regression model. The least squares approach we used in the case of simple regression can still\u00a0be used for multiple regression analysis.\r\n\r\nAs per our discussion in the simple regression model section, our low estimated <em>R<sup>2<\/sup><\/em>\u00a0indicated that only 50% of the variations in the price of apartments in Nelson, BC, was explained by their distance from downtown. Obviously, there should be more relevant factors that can be added into this model to make it stronger. Let\u2019s add the second explanatory factor to this model. We collected data for the area of each apartment in square feet\u00a0(i.e., \u00a0<em>x<sub>2<\/sub><\/em>). If we go back to Excel and estimate our model including the new added variable, we will see the printout shown in Figure 8.7.\r\n\r\n[caption id=\"attachment_399\" align=\"aligncenter\" width=\"656\"]<a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/multiple-regression.png\"><img src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/multiple-regression.png\" alt=\"multiple regression\" class=\"wp-image-399 size-full\" height=\"376\" width=\"656\" \/><\/a> Figure 8.7 Excel Printout[\/caption]\r\n\r\nThe estimates equation of the regression model is:\r\n\r\npredicted price\u00a0of apartments= 60.041 - 5.393*distance\u00a0+ .03*area\r\n\r\nThis is the equation for a plane, the three-dimensional equivalent of a straight line. It is still a linear function because neither of the <em>x<\/em>\u2019s nor <em>y<\/em> is raised to a power nor taken to some root nor are the <em>x<\/em>\u2019s multiplied together. You can have even more independent variables, and as long as the function is linear, you can estimate the slope, <em>\u03b2<\/em>, for each independent variable.\r\n\r\nBefore using this estimated model for prediction and decision-making purposes, we should test three hypotheses. First, we\u00a0can use the F-score to test to see if the regression model improves our\u00a0ability to predict price\u00a0of apartments. In other words, we test the <em>overall<\/em> significance of the estimated model. Second and third, we\u00a0can use the t-scores to test to see if the slopes of distance\u00a0and area\u00a0are different from zero. These two t-tests are also known as <em>individual<\/em> tests of significance.\r\n\r\nTo conduct the first test, we\u00a0choose an <em>\u03b1<\/em> = .05. The F-score is the regression or model mean square over the residual or error mean square, so the df for the F-statistic are first the df for the regression model and, second, the df for the error. There are 2 and 9\u00a0df for the F-test. According to this <a href=\"http:\/\/www.stat.purdue.edu\/~jtroisi\/STAT350Spring2015\/tables\/FTable.pdf\">F-table<\/a>, with 2\u00a0and 9\u00a0df, the critical F-score for <em>\u03b1<\/em> = .05 is 4.26.\r\n\r\nThe hypotheses are:\r\n\r\n<em>H<sub>0<\/sub><\/em>: price\u00a0\u2260 f(distance, area)\r\n\r\n<em>H<sub>a<\/sub><\/em>: price = f(distance, area)\r\n\r\nBecause the F-score from the regression, 6.812, is greater than the critical F-score, 4.26, we decide that the data support <em>H<sub>o<\/sub><\/em> and conclude that the model helps us predict price of\u00a0apartments. Alternatively, we say there is such a functional relationship in the population.\r\n\r\nNow, we move to the individual test of significance. We can test to see if price\u00a0depends on distance and area.\u00a0There are (<em>n-m<\/em>-1)=(12-2-1)=9\u00a0df. There are two sets of hypotheses, one set for <em>\u03b2<sub>1<\/sub><\/em>, the slope for distance, and one set for <em>\u03b2<sub>2<\/sub><\/em>, the slope for area. For a small town, one may\u00a0expect that <em>\u03b2<sub>1<\/sub><\/em>, the slope for distance, will be negative, and\u00a0expect that <em>\u03b2<sub>2<\/sub><\/em> will be positive. Therefore, we\u00a0will use a one-tail test on <em>\u03b2<sub>1<\/sub><\/em>, as well as for <em>\u03b2<sub>2<\/sub><\/em>:\r\n\r\n$latex H_a: \\beta _1 &lt;0 \\qquad H_a:\\beta _2&lt;0$\r\n\r\nSince we have\u00a0two one-tail tests, the t-values we\u00a0choose from the t-table will be the same\u00a0for the two tests. Using <em>\u03b1<\/em> = .05 and 9 df, we\u00a0choose .05\/2=.025\u00a0for\u00a0the\u00a0t-score for <em>\u03b2<sub>1<\/sub><\/em>\u00a0with a one-tail test, and come up with 2.262.\u00a0Looking back at our Excel\u00a0printout and checking the <em>t<\/em>-scores, we\u00a0decide that distance does affect price of apartments, but area is not a significant factor in explaining the price of apartments.\u00a0Notice that the printout also gives a <em>t<\/em>-score for the intercept, so we\u00a0could test to see if the intercept equals zero or not.\r\n\r\nAlternatively, one may go ahead and compare directly the <em>p<\/em>-values out of the Excel printout against the assumed level of significance\u00a0(i.e., <em>\u03b1<\/em> = .05). We can easily see that the\u00a0<em>p-<\/em>values associated with the intercept and price are both less than <em>alpha<\/em>, and as a result we reject the hypothesis that the associated\u00a0coefficients\u00a0are zero\u00a0(i.e., both are\u00a0significant).\u00a0However, area is not a significant factor since its associated\u00a0<em>p-<\/em>value\u00a0is greater than <em>alpha<\/em>.\r\n\r\nWhile there are other required assumptions and conditions in both simple and multiple regression models (we encourage students to consult\u00a0an intermediate business statistics open textbook\u00a0for more detailed discussions), here we only\u00a0focus on two relevant points about the use and applications of multiple regression.\r\n\r\nThe first point is related to the interpretation of the estimated coefficients in a multiple regression model. You should be careful to note\u00a0that in a simple regression model, the estimated coefficient of our independent variable is simply the slope of the line and can be interpreted. It refers to the response of the dependent variable to a one-unit change in the independent variable. However, this interpretation in a multiple regression model should be adjusted slightly. The estimated coefficients under multiple regression analysis are the response of the dependent variable to a one-unit change in one of the independent variables when the levels of all other independent variables are kept constant. In our example, the estimated coefficient of price of an apartment in Nelson, BC, indicates that \u2014 for a given size of apartment\u2014 it will drop by\u00a05.248*1000=$5248 for every one kilometre that the apartment is away from downtown.\r\n\r\nThe second point is about the use of <em>R<sup>2<\/sup><\/em> in multiple regression analysis. Technically, adding more independent variables to the model will increase the value of <em>R<sup>2<\/sup><\/em>, regardless of whether\u00a0the added variables are relevant or irrelevant in explaining the variation in the dependent variable. In order to <em>adjust<\/em> the inflated <em>R<sup>2<\/sup><\/em> due to the\u00a0irrelevant variables\u00a0added to the model, the following formula is recommended in the case of multiple regression:\r\n\r\n<a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/06\/Screen-Shot-2015-06-16-at-8.36.58-AM.png\"><img src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/06\/Screen-Shot-2015-06-16-at-8.36.58-AM-300x62.png\" class=\"alignnone wp-image-664 \" height=\"39\" width=\"181\" \/><\/a>\r\n\r\nwhere <em>n <\/em>is the sample size, and <em>k<\/em>\u00a0is number of the estimated parameters in our model.\r\n\r\nBack to our earlier Excel results for the\u00a0multiple regression model estimated for the apartment example, we can see that while the <em>R<sup>2<\/sup><\/em> has been inflated\u00a0from .504 to .612 due to the new added factor, apartment size, \u00a0the adjusted <em>R<sup>2<\/sup><\/em> has dropped the inflated value to .526. To understand it better, you should pay attention to the associated <em>p<\/em>-value for the newly added factor. Since this value is more than .05, we cannot reject the hypothesis that the true coefficient of apartment size (area) is significantly different from zero. In other words, in its current situation, apartment size is not a significant factor, yet the value of <em>R<sup>2<\/sup><\/em> has been inflated!\r\n\r\nFurthermore, the adjusted <em>R<sup>2<\/sup><\/em> indicates that only 61.2% of variations in price of one-bedroom apartments in Nelson, BC, can be explained by their locations and sizes. Almost 40% of the variations of the price still\u00a0cannot be explained by these two factors. One may seek\u00a0to improve this model, by searching for more relevant factors such as style of the apartment, year built, etc. and add them in to this model.\r\n\r\nUsing the interactive Excel template shown in Figure 8.8, you can estimate a multiple regression model. Again, enter your data into the yellow cells only. For this template you are allowed to use up to 50 observations for each column. Like all other interactive templates in this textbook, you use <em>special paste\/values<\/em> when you paste your data from other spreadsheets. Specifically, if you have fewer than 50 data entries, you must also fill out the rest of the empty yellow cells under X1, X2, and Y with zeros. Now, select your alpha level. By clicking <em>enter<\/em>, you will not only have all your estimated coefficients along with their t-values, etc., you will also be guided as to whether the model is significant both overall and individually. If your p-value associated with F-value within the ANOVA table is not less than the selected alpha level, you will see a message indicating that your estimated model is not overall significant, and as a result, no values for C.I. and P.I. will be shown. By either changing the alpha level and\/or adding more accurate data, it is possible\u00a0to estimate a more significant multiple regression model.\r\n\r\n<iframe src=\"https:\/\/onedrive.live.com\/embed?cid=0B27F889BFE551E2&amp;resid=B27F889BFE551E2%21196&amp;authkey=AMQIOCaItKS3dy8&amp;em=2&amp;wdAllowInteractivity=False&amp;AllowTyping=True&amp;wdHideGridlines=True&amp;wdHideHeaders=True\" width=\"100%\" height=\"600\"><\/iframe>\r\nFigure 8.8 Interactive Excel Template for Multiple Regression Model - see Appendix 8.\r\n\r\nOne more point is about the format of your assumed multiple regression model. You can see that the nature of the associations between the dependent variable and all the independent variables may not always be linear. In reality, you will face cases where such relationships may be better formed by a nonlinear model. Without going into the details of such a non-linear model, just to give you an idea, you should be able to transform your selected data for X1, X2, and Y before estimating your model. For instance, one possible multiple regression non-linear model may be a model in which both the dependent and independent variables have been transformed to a natural logarithm rather than a level. In order to estimate such a model within Figure 8.5, all you need to do is transform the data in all three columns in a separate sheet from level to logarithm. In doing this, simply use =log(say A1) where in cell A1 you have the first observation of X1, and =log(say B1),.... Finally, simply cut and <em>special paste\/value<\/em> into the yellow columns within the template. Now you have estimated a multiple regression model with both sides in a non-linear form\u00a0(i.e., log form).\r\n<h2><strong>Predictions using the estimated simple regression<\/strong><\/h2>\r\nIf the estimated regression line fits well into the data, the model can then be used for predictions. Using the above estimated simple regression model, we can predict the price of an apartment a <em>given<\/em> distance to downtown. This is known as the prediction interval or P.I. Alternatively, we may predict the <em>mean price<\/em> of the apartment, also known as the confidence interval or C.I., for the mean value.\r\n\r\nIn predicting intervals for the price of an apartment that is six\u00a0kilometres away from downtown, we simply set <em>x<\/em>=6 , and substitute it back into the estimated equation:\r\n\r\n$latex y=71.84-5.38\\times 6 = \\$39.56$\r\n\r\nYou should pay attention to the scale of data. In this case, the dependent variable is measured in $1000s. Therefore, the predicted value for an apartment six\u00a0kilometres from downtown is 39.56*1000=$39,560. This value is known as the\u00a0<em>point estimate<\/em> of the prediction and is not reliable, as we are not clear how close this value is to the true value of the population.\r\n\r\nA more reliable estimate can be constructed by setting up an <em>interval<\/em> around the point estimate. This can be done in two ways. We can predict the particular value of <em>y<\/em>\u00a0for a given value of <em>x<\/em><em>,\u00a0<\/em>or we can estimate the expected value (mean) of <em>y,\u00a0<\/em>for a given value of <em>x<\/em>. For the particular value of <em>y<\/em>, we use the following formula for the interval:\r\n\r\n<a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image21.png\"><img src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image21.png\" alt=\"image21\" class=\" size-full wp-image-359 alignnone\" height=\"36\" width=\"246\" \/><\/a>\r\n\r\nwhere the standard error, S.E., of the prediction is calculated based on the following formula:\r\n\r\n<a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image22.png\"><img src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image22.png\" alt=\"image22\" class=\" size-full wp-image-360 alignnone\" height=\"65\" width=\"334\" \/><\/a>\r\n\r\nIn this equation, <em>x<sup>*<\/sup><\/em>\u00a0is the particular value of the independent variable, which in our case is 6, and\u00a0<em>s\u00a0<\/em>is the standard\u00a0error of the regression, calculated as:\r\n\r\n<a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image23.png\"><img src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image23.png\" alt=\"image23\" class=\" size-full wp-image-361 alignnone\" height=\"65\" width=\"82\" \/><\/a>\r\n\r\nFrom the Excel printout for the simple regression model, this standard error is estimated as 7.02.\r\n\r\nThe sum of squares of the independent variable,\r\n\r\n<a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/sum-of-Sq-of-indep2.png\"><img src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/sum-of-Sq-of-indep2.png\" alt=\"sum of Sq of indep\" class=\" size-full wp-image-365 alignnone\" height=\"50\" width=\"133\" \/><\/a>\r\n\r\ncan also be calculated as shown in Figure 8.9.\r\n\r\n[caption id=\"attachment_366\" align=\"aligncenter\" width=\"148\"]<a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image24.png\"><img src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image24.png\" alt=\"image24\" class=\"wp-image-366 size-full\" height=\"325\" width=\"148\" \/><\/a> Figure 8.9[\/caption]\r\n\r\nAll these calculated values can be substituted back into the formula for the S.E. of the prediction:\r\n\r\n<a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/11\/SEofCI1.png\"><img src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/11\/SEofCI1.png\" alt=\"C.I.\" class=\"alignnone size-full wp-image-1242\" height=\"73\" width=\"327\" \/><\/a>\r\n\r\nNow that the S.E. of the confidence interval has been calculated, you can pick up the cut-off point from the <em>t<\/em>-table. Given the degrees of freedom 12-2=10, the appropriate value from the <em>t<\/em>-table is 2.23. You use this information to calculate the <em>margin of error <\/em>as 6.52*2.23=14.54. Finally, construct the prediction interval for the particular value of the price of an apartment located six\u00a0kilometres away from downtown as:\r\n\r\n<a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/07\/C.I._VALUES.png\"><img src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/07\/C.I._VALUES.png\" alt=\"C.I._VALUES\" class=\"alignnone size-full wp-image-862\" height=\"40\" width=\"103\" \/><\/a>\r\n\r\nThis is a compact version of the prediction interval. For a\u00a0more general version of any confidence interval\u00a0for any given confidence level of <em>alpha<\/em>, we can write:\r\n\r\n<a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/Confidence-Interval1.png\"><img src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/Confidence-Interval1.png\" alt=\"Confidence Interval\" class=\" size-full wp-image-372 alignnone\" height=\"36\" width=\"544\" \/><\/a>\r\n\r\nIntuitively, for say a .05 level of confidence, we are\u00a095% confident that the\u00a0true parameter of the population will be within these two lower and upper limits:\r\n\r\n<a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/07\/Confidence-Interval1.png\"><img src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/07\/Confidence-Interval1.png\" alt=\"Confidence Interval\" class=\"alignnone size-full wp-image-864\" height=\"40\" width=\"476\" \/><\/a>\r\n\r\nBased on\u00a0our simple regression model that\u00a0only includes distance as a significant factor in predicting the price of an apartment, and for a particular apartment six kilometres away from downtown, we are 95% confident that the true price of an apartments in Nelson, BC, is between $25,037 and $54,096, with a width of $29,059. One should not be surprised there is such a wide width, given the fact that the coefficient of determination of this model was only 50%, and the fact that we have selected a distance far away from the mean distance from downtown.\u00a0We can always improve these numbers by adding more explanatory variables to our simple regression model. Alternatively, we can predict only for the numbers as much as possible close to the downtown area.\r\n\r\nNow we estimate the expected value (mean) of <em>y<\/em>\u00a0for a given value of <em>x<\/em>,\u00a0the so-called prediction interval. The process of constructing intervals is very similar to the previous case, except we use a new formula for S.E. and of course we set up the intervals for the mean value of the apartment price\u00a0(i.e., =59.33).\r\n\r\n<a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/11\/SEofPI.png\"><img src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/11\/SEofPI.png\" alt=\"P.I.\" class=\"size-full wp-image-1244\" height=\"70\" width=\"356\" \/><\/a>\r\n\r\nYou should be very careful\u00a0to note\u00a0the difference between this formula and the one introduced earlier for S.E. for\u00a0predicting the particular value of <em>y<\/em>\u00a0for a given value of <em>x.\u00a0<\/em>They look\u00a0very\u00a0similar\u00a0but\u00a0this formula comes with an extra 1\u00a0inside the radical!\r\n\r\nThe margin of error is then calculated as 2.179*3.82=8.32. We use this to set up directly the lower and upper limits of the estimates:<a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/07\/P.I._VALUES.png\">\r\n<img src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/07\/P.I._VALUES.png\" alt=\"P.I._VALUES\" class=\"alignnone size-full wp-image-867\" height=\"40\" width=\"480\" \/><\/a>\r\n\r\nThus, for the\u00a0<em>average<\/em>\u00a0price of apartments located in Nelson, BC, six kilometres away from downtown, we are 95% confident that this average price\u00a0will be between $18,200 and $60,920, with a width of $47,720. Compared with the earlier width for C.I., it is obvious that we are less confident in predicting the average price. The reason is that the S.E. for the prediction is always larger than the S.E. for the confidence interval.\r\n\r\nThis process can be repeated for all different\u00a0levels of\u00a0<em>x<\/em>, to calculate the associated confidence and prediction intervals. By doing this, we will have a range of lower and upper levels for both P.I.s and C.I.s. All these numbers can be reproduced within the interactive Excel template shown in Figure 8.8. If you use a statistical software such as Minitab, you will directly plot a scatter diagram with all P.I.s and C.I.s as well as the estimated linear regression line all in one diagram. Figure 8.10 shows such a diagram from Minitab for our\u00a0example.\r\n\r\n[caption id=\"attachment_377\" align=\"aligncenter\" width=\"666\"]<a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image351.png\"><img src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image351.png\" alt=\"image35\" class=\" wp-image-377\" height=\"444\" width=\"666\" \/><\/a> Figure 8.10 Minitab Plot for C.I. and P.I.[\/caption]\r\n\r\nFigure 8.10 indicates that a more reliable prediction should\u00a0be made as close as possible to the mean of our observations for\u00a0<i>x<\/i>. In this graph, the widths of both intervals are at the lowest\u00a0levels closer to the means of <em>x<\/em> and <em>y<\/em>.\r\n\r\nYou should be careful to note that Figure 8.10 provides the predicted intervals only for the case of a simple regression model. For the multiple regression model, you may use other statistical software packages, such as SAS, SPSS, etc., to estimate both P.I. and C.I. For instance, by selecting <em>x<sub>1<\/sub><\/em>=3, and <em>x<sub>2<\/sub><\/em>=300, and coding these figures into Minitab, you will\u00a0see the results as shown in Figure 8.11. Alternatively, you may use the interactive Excel template provided in Figure 8.8 to estimate your multiple regression model, and to check for the significance of the estimated parameters. This template can also be used to construct both the P.I. and C.I. for the given values of <em>x<sub>1<\/sub><\/em>=3, and <em>x<sub>2<\/sub><\/em>=300 or any other values of your choice. Furthermore, this template enables you to test if the estimated multiple regression model is overall significant. When the estimated multiple regression model is not overall significant, this template will not provide the P.I. and C.I. To practice this case, you may want to change the yellow columns of <em>x<sub>1<\/sub><\/em> and <em>x<sub>2<\/sub><\/em> with different random numbers that are not correlated with the dependent variable. Once the estimated model is not overall significant, no prediction values will be provided.\r\n\r\n[caption id=\"attachment_402\" align=\"aligncenter\" width=\"417\"]<a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/prediction-intervals-multiple-regression1.png\"><img src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/prediction-intervals-multiple-regression1.png\" alt=\"prediction intervals multiple regression\" class=\"wp-image-402 size-full\" height=\"207\" width=\"417\" \/><\/a> Figure 8.11[\/caption]\r\n\r\nThe 95% C.I., and P.I. figures in the brackets are the lower and upper limits of the intervals given the specific values for distance and size of apartments. The fitted value of the price of apartment, as well as the standard error of this value, are also estimated.\r\n\r\nWe have just given you some rough ideas about how the basic regression calculations are done. We left out other steps needed to calculate more detailed results of regression without a computer on purpose, for you will never compute a regression without a computer (or a high-end calculator) in all of your working years. However, by working with these interactive templates, you will have a much better chance to play around with any data to see how the outcomes can be altered, and to observe their implications for the real-world business decision-making process.\r\n<h1>Correlation and covariance<\/h1>\r\nThe correlation between two variables is important in statistics, and it is commonly reported. What is correlation? The meaning of correlation can be discovered by looking closely at the word\u2014it is almost co-relation, and that is what it means: how two variables are co-related. Correlation is also closely related to regression. The covariance between two variables is also important in statistics, but it is seldom reported. Its meaning can also be discovered by looking closely at the word\u2014it is co-variance, how two variables vary together. Covariance plays a behind-the-scenes role in multivariate statistics. Though you will not see covariance reported very often, understanding it will help you understand multivariate statistics like understanding variance helps you understand univariate statistics.\r\n\r\nThere are two ways to look at correlation. The first flows directly from regression and the second from covariance. Since you just learned about regression, it makes sense to start with that approach.\r\n\r\nCorrelation is measured with a number between -1 and +1 called the correlation coefficient. The population correlation coefficient is usually written as the Greek <strong>rho<\/strong>, <em>\u03c1<\/em>, and the sample correlation coefficient as <em>r<\/em>. If you have a linear regression equation with only one explanatory variable, the sign of the correlation coefficient shows whether the slope of the regression line is positive or negative, while the absolute value of the coefficient shows how close to the regression line the points lie. If <em>\u03c1<\/em> is +.95, then the regression line has a positive slope and the points in the population are very close to the regression line. If <em>r<\/em> is -.13 then the regression line has a negative slope and the points in the sample are scattered far from the regression line. If you square <em>r<\/em>, you will get <em>R<sup>2<\/sup><\/em>, which is higher if the points in the sample lie very close to the regression line so that the sum of squares regression is close to the sum of squares total.\r\n\r\nThe other approach to explaining correlation requires understanding covariance, how two variables vary together. Because covariance is a multivariate statistic, it measures something about a sample or population of observations where each observation has two or more variables. Think of a population of (<em>x<\/em>,<em>y<\/em>) pairs. First find the mean of the <em>x<\/em>\u2019s and the mean of the <em>y<\/em>\u2019s, <em>\u03bc<sub>x<\/sub><\/em> and <em>\u03bc<sub>y<\/sub><\/em>. Then for each observation, find (<em>x<\/em> - <em>\u03bc<sub>x<\/sub><\/em>)(<em>y<\/em> - <em>\u03bc<sub>y<\/sub><\/em>). If the <em>x<\/em> and the <em>y<\/em> in this observation are both far above their means, then this number will be large and positive. If both are far below their means, it will also be large and positive. If you found \u03a3(<em>x<\/em> - <em>\u03bc<sub>x<\/sub><\/em>)(<em>y<\/em> - <em>\u03bc<sub>y<\/sub><\/em>), it would be large and positive if <em>x<\/em> and y move up and down together, so that large <em>x<\/em>\u2019s go with large <em>y<\/em>\u2019s, small <em>x<\/em>\u2019s go with small <em>y<\/em>\u2019s, and medium <em>x<\/em>\u2019s go with medium <em>y<\/em>\u2019s. However, if some of the large <em>x<\/em>\u2019s go with medium <em>y<\/em>\u2019s, etc. then the sum will be smaller, though probably still positive. A \u03a3(<em>x<\/em> - <em>\u03bc<sub>x<\/sub><\/em>)(y - <em>\u03bc<sub>y<\/sub><\/em>) implies that <em>x<\/em>\u2019s above <em>\u03bc<sub>x<\/sub><\/em> are generally paired with <em>y<\/em>\u2019s above <em>\u03bc<sub>y<\/sub><\/em>, and those <em>x<\/em>\u2019s below their mean are generally paired with <em>y<\/em>\u2019s below their mean. As you can see, the sum is a measure of how <em>x<\/em> and <em>y<\/em> vary together. The more often similar <em>x<\/em>\u2019s are paired with similar <em>y<\/em>\u2019s, the more <em>x<\/em> and <em>y<\/em> vary together and the larger the sum and the covariance. The term for a single observation, (<em>x<\/em> - <em>\u03bc<sub>x<\/sub><\/em>)(<em>y<\/em> - <em>\u03bc<sub>y<\/sub><\/em>), will be negative when the <em>x<\/em> and <em>y<\/em> are on opposite sides of their means. If large <em>x<\/em>\u2019s are usually paired with small <em>y<\/em>\u2019s, and vice versa, most of the terms will be negative and the sum will be negative. If the largest <em>x<\/em>\u2019s are paired with the smallest <em>y<\/em>\u2019s and the smallest <em>x<\/em>\u2019s with the largest <em>y<\/em>\u2019s, then many of the (<em>x<\/em> - <em>\u03bc<sub>x<\/sub><\/em>)(<em>y<\/em> - <em>\u03bc<sub>y<\/sub><\/em>) will be large and negative and so will the sum. A population with more members will have a larger sum simply because there are more terms to be added together, so you divide the sum by the number of observations to get the final measure, the covariance, or cov:\r\n\r\n<a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2014\/10\/10000000000000DF0000002E78028BB81.png\"><img src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2014\/10\/10000000000000DF0000002E78028BB81.png\" alt=\"Population covariance\" class=\" wp-image-117 size-full alignnone\" height=\"46\" width=\"223\" \/><\/a>\r\n\r\nThe maximum for the covariance is the product of the standard deviations of the <em>x<\/em> values and the <em>y<\/em> values, <em>\u03c3<sub>x<\/sub><\/em><em>\u03c3<sub>y<\/sub><\/em>. While proving that the maximum is exactly equal to the product of the standard deviations is complicated, you should be able to see that the more spread out the points are, the greater the covariance can be. By now you should understand that a larger standard deviation means that the points are more spread out, so you should understand that a larger <em>\u03c3<sub>x<\/sub><\/em> or a larger <em>\u03c3<sub>y<\/sub><\/em> will allow for a greater covariance.\r\n\r\nSample covariance is measured similarly, except the sum is divided by <em>n<\/em>-1 so that sample covariance is an unbiased estimator of population covariance:\r\n\r\n$latex sample \\ cov= \\frac{\\sum{(x-\\bar{x})(y-\\bar{y})}}{(n-1)}$\r\n\r\nCorrelation simply compares the covariance to the standard deviations of the two variables. Using the formula for population correlation:\r\n\r\n<a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/07\/Screen-Shot-2015-07-29-at-3.12.39-PM.png\"><img src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/07\/Screen-Shot-2015-07-29-at-3.12.39-PM.png\" class=\"alignnone wp-image-943 \" height=\"71\" width=\"92\" \/><\/a>\r\n\r\nor\r\n<a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/07\/Screen-Shot-2015-07-29-at-3.10.09-PM.png\"><img src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/07\/Screen-Shot-2015-07-29-at-3.10.09-PM-300x71.png\" alt=\"Screen Shot 2015-07-29 at 3.10.09 PM\" class=\"alignnone wp-image-940\" height=\"74\" width=\"313\" \/><\/a>\r\n\r\nAt its maximum, the absolute value of the covariance equals the product of the standard deviations, so at its maximum, the absolute value of <em>r<\/em> will be 1. Since the covariance can be negative or positive while standard deviations are always positive, <em>r<\/em> can be either negative or positive. Putting these two facts together, you can see that <em>r<\/em> will be between -1 and +1. The sign depends on the sign of the covariance and the absolute value depends on how close the covariance is to its maximum. The covariance rises as the relationship between <em>x<\/em> and <em>y<\/em> grows stronger, so a strong relationship between <em>x<\/em> and <em>y<\/em> will result in <em>r<\/em> having a value close to -1 or +1.\r\n<h1>Covariance, correlation, and regression<\/h1>\r\nNow it is time to think about how all of this fits together and to see how the two approaches to correlation are related. Start by assuming that you have a population of (<em>x<\/em>, <em>y<\/em>) which covers a wide range of <em>y<\/em>-values, but only a narrow range of <em>x<\/em>-values. This means that <em>\u03c3<sub>y<\/sub><\/em> is large while <em>\u03c3<sub>x<\/sub><\/em> is small. Assume that you graph the (<em>x<\/em>, <em>y<\/em>) points and find that they all lie in a narrow band stretched linearly from bottom left to top right, so that the largest y\u2019s are paired with the largest <em>x<\/em>\u2019s and the smallest <em>y<\/em>\u2019s with the smallest <em>x<\/em>\u2019s. This means both that the covariance is large and a good regression line that comes very close to almost all the points is easily drawn. The correlation coefficient will also be very high (close to +1). An example will show why all these happen together.\r\n\r\nImagine that the equation for the regression line is <em>y<\/em>=3+4<em>x<\/em>, <em>\u03bc<sub>y<\/sub><\/em> = 31, and <em>\u03bc<sub>x<\/sub><\/em> = 7, and the two points farthest to the top right, (10, 43) and (12, 51), lie exactly on the regression line. These two points together contribute \u2211(<em>x<\/em>-<em>\u03bc<sub>x<\/sub><\/em>)(<em>y<\/em>-<em>\u03bc<sub>y<\/sub><\/em>) =(10-7)(43-31)+(12-7)(51-31)= 136 to the numerator of the covariance. If we switched the <em>x<\/em>\u2019s and <em>y<\/em>\u2019s of these two points, moving them off the regression line, so that they became (10, 51) and (12, 43), <em>\u03bc<sub>x<\/sub><\/em>, <em>\u03bc<sub>y<\/sub><\/em>, <em>\u03c3<sub>x<\/sub><\/em>, and <em>\u03c3<sub>y<\/sub><\/em> would remain the same, but these points would only contribute (10-7)(51-31)+(12-7)(43-31)= 120 to the numerator. As you can see, covariance is at its greatest, given the distributions of the <em>x<\/em>\u2019s and <em>y<\/em>\u2019s, when the (<em>x<\/em>, <em>y<\/em>) points lie on a straight line. Given that correlation, <em>r<\/em>, equals 1 when the covariance is maximized, you can see that <em>r<\/em>=+1 when the points lie exactly on a straight line (with a positive slope). The closer the points lie to a straight line, the closer the covariance is to its maximum, and the greater the correlation.\r\n\r\nAs the example in Figure 8.12 shows, the closer the points lie to a straight line, the higher the correlation. Regression finds the straight line that comes as close to the points as possible, so it should not be surprising that correlation and regression are related. One of the ways the <strong>goodness of fit<\/strong> of a regression line can be measured is by <em>R<sup>2<\/sup><\/em>. For the simple two-variable case, <em>R<sup>2<\/sup><\/em> is simply the correlation coefficient\u00a0<em>r<\/em>, squared.\r\n\r\n[caption id=\"attachment_277\" align=\"aligncenter\" width=\"655\"]<a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/Screen-Shot-2015-03-19-at-3.12.14-PM.png\"><img src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/Screen-Shot-2015-03-19-at-3.12.14-PM-1024x618.png\" alt=\"Screen Shot 2015-03-19 at 3.12.14 PM\" class=\"wp-image-277\" height=\"395\" width=\"655\" \/><\/a> Figure 8.12 Plot of Initial Population[\/caption]\r\n\r\nCorrelation does not tell us anything about how steep or flat the regression line is, though it does tell us if the slope is positive or negative. If we took the initial population shown in Figure 8.12, and stretched it both left and right horizontally so that each point\u2019s <em>x<\/em>-value changed, but its <em>y<\/em>-value stayed the same, <em>\u03c3<sub>x<\/sub><\/em> would grow while <em>\u03c3<sub>y<\/sub><\/em> stayed the same. If you pulled equally to the right and to the left, both <em>\u03bc<sub>x<\/sub><\/em> and <em>\u03bc<sub>y<\/sub><\/em> would stay the same. The covariance would certainly grow since the (<em>x<\/em>-<em>\u03bc<sub>x<\/sub><\/em>) that goes with each point would be larger absolutely while the (<em>y<\/em>-<em>\u03bc<sub>y<\/sub><\/em>)\u2019s would stay the same. The equation of the regression line would change, with the slope b becoming smaller, but the correlation coefficient would be the same because the points would be just as close to the regression line as before. Once again, notice that correlation tells you how well the line fits the points, but it does not tell you anything about the slope other than if it is positive or negative. If the points are stretched out horizontally, the slope changes but correlation does not. Also notice that though the covariance increases, correlation does not because <em>\u03c3<sub>x<\/sub><\/em> increases, causing the denominator in the equation for finding <em>r<\/em> to increase as much as covariance, the numerator.\r\n\r\nThe regression line and covariance approaches to understanding correlation are obviously related. If the points in the population lie very close to the regression line, the covariance will be large in absolute value since the <em>x<\/em>\u2019s that are far from their mean will be paired with <em>y<\/em>\u2019s that\u00a0are far from theirs. A positive regression slope means that <em>x<\/em> and <em>y<\/em> rise and fall together, which also means that the covariance will be positive. A negative regression slope means that <em>x<\/em> and <em>y<\/em> move in opposite directions, which means a negative covariance.\r\n<h1>Summary<\/h1>\r\nSimple linear regression allows researchers to estimate the parameters \u2014 the intercept and slopes \u2014 of linear equations connecting two or more variables. Knowing that a dependent variable is functionally related to one or more independent or explanatory variables, and having an estimate of the parameters of that function, greatly improves the ability of a researcher to predict the values the dependent variable will take under many conditions. Being able to estimate the effect that one independent variable has on the value of the dependent variable in isolation from changes in other independent variables can be a powerful aid in decision-making and policy design. Being able to test the existence of individual effects of a number of independent variables helps decision-makers, researchers, and policy-makers identify what variables are most important. Regression is a very powerful statistical tool in many ways.\r\n\r\nThe idea behind regression is simple: it is simply the equation of the line that comes as close as possible to as many of the points as possible. The mathematics of regression are not so simple, however. Instead of trying to learn the math, most researchers use computers to find regression equations, so this chapter stressed reading computer printouts rather than the mathematics of regression.\r\n\r\nTwo other topics, which are related to each other and to regression, were also covered: correlation and covariance.\r\n\r\nSomething as powerful as linear regression must have limitations and problems. There is a whole subject, econometrics, which deals with identifying and overcoming the limitations and problems of regression.","rendered":"<p>Regression analysis, like most multivariate statistics, allows you to infer that there is a relationship between two or more variables. These relationships are seldom exact because there is variation caused by many variables, not just the variables being studied.<\/p>\n<p>If you say that students who study more make better grades, you are really hypothesizing that there is a positive relationship between one variable, studying, and another variable, grades. You could then complete your inference and test your hypothesis by gathering a sample of (amount studied, grades) data from some students and use regression to see if the relationship in the sample is strong enough to safely infer that there is a relationship in the population. Notice that even if students who study more make better grades, the relationship in the population would not be perfect; the same amount of studying will not result in the same grades for every student (or for one student every time). Some students are taking harder courses, like chemistry or statistics; some are smarter; some study effectively; and some get lucky and find that the professor has asked them exactly what they understood best. For each level of amount studied, there will be a distribution of grades. If there is a relationship between studying and grades, the location of that distribution of grades will change in an orderly manner as you move from lower to higher levels of studying.<\/p>\n<p>Regression analysis is one of the most used and most powerful multivariate statistical techniques for it infers the existence and form of a functional relationship in a population. Once you learn how to use regression, you will be able to estimate the parameters \u2014 the slope and intercept \u2014 of the function that\u00a0links two or more variables. With that estimated function, you will be able to infer or forecast things like unit costs, interest rates, or sales over a wide range of conditions. Though the simplest regression techniques seem limited in their applications, statisticians have developed a number of variations on regression that\u00a0greatly expand the usefulness of the technique. In this chapter, the basics will be discussed. Once again, the t-distribution and F-distribution will be used to test hypotheses.<\/p>\n<h1>What is regression?<\/h1>\n<p>Before starting to learn about regression, go back to algebra and review what a function is. The definition of a function can be formal, like the one in my freshman calculus text: &#8220;A function is a set of ordered pairs of numbers (<em>x<\/em>,<em>y<\/em>) such that to each value of the first variable (<em>x<\/em>) there corresponds a unique value of the second variable (<em>y<\/em>)&#8221; (Thomas, 1960).<a class=\"footnote\" title=\"Thomas, G.B. (1960). Calculus and analytical geometry (3rd ed.). Boston, MA: Addison-Wesley.\" id=\"return-footnote-118-1\" href=\"#footnote-118-1\" aria-label=\"Footnote 1\"><sup class=\"footnote\">[1]<\/sup><\/a>.\u00a0More intuitively, if there is a regular relationship between two variables, there is usually a function that describes the relationship. Functions are written in a number of forms. The most general is <strong><em>y<\/em> = f(<em>x<\/em>)<\/strong>, which simply says that the value of y depends on the value of x in some regular fashion, though the form of the relationship is not specified. The simplest functional form is the linear function where:<\/p>\n<p><a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2014\/10\/100000000000004900000015489223001.png\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2014\/10\/100000000000004900000015489223001.png\" class=\"wp-image-105 size-full alignnone\" alt=\"Linear function\" height=\"21\" width=\"73\" srcset=\"https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2014\/10\/100000000000004900000015489223001.png 73w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2014\/10\/100000000000004900000015489223001-65x18.png 65w\" sizes=\"auto, (max-width: 73px) 100vw, 73px\" \/><\/a><\/p>\n<p><em>\u03b1<\/em> and <em>\u03b2<\/em> are parameters, remaining constant as <em>x<\/em> and <em>y<\/em> change. <em>\u03b1<\/em> is the intercept and <em>\u03b2<\/em> is the slope. If the values of\u00a0<em>\u03b1<\/em> and <em>\u03b2<\/em> are known, you can find the <em>y<\/em> that goes with any <em>x<\/em> by putting the <em>x<\/em> into the equation and solving. There can be functions where one variable depends on the values values of two or more other variables where\u00a0<em>x<sub>1<\/sub><\/em>\u00a0and\u00a0<em>x<sub>2<\/sub><\/em>\u00a0together determine the value of <em>y<\/em>. There can also be non-linear functions, where the value of the dependent variable (<em><strong>y<\/strong><\/em> in all of the examples we have used so far) depends on the values of one or more other variables, but the values of the other variables are squared, or taken to some other power or root or multiplied together, before the value of the dependent variable is determined. Regression allows you to estimate directly the parameters in linear functions only, though there are tricks that\u00a0allow many non-linear functional forms to be estimated indirectly. Regression also allows you to test to see if there is a functional relationship between the variables, by testing the hypothesis that each of the slopes has a value of zero.<\/p>\n<p>First, let us consider the simple case of a two-variable function. You believe that <em>y<\/em>, the dependent variable, is a linear function of <em>x<\/em>, the independent variable \u2014 <em>y<\/em> depends on <em>x<\/em>. Collect a sample of (<em>x<\/em>, <em>y<\/em>) pairs, and plot them on a set of <em>x<\/em>, <em>y<\/em> axes. The basic idea behind regression is to find the equation of the straight line that comes as close as possible to as many of the points as possible. The parameters of the line drawn through the sample are unbiased estimators of the parameters of the line that would come as close as possible to as many of the points as possible in the population, if the population had been gathered and plotted. In keeping with the convention of using Greek letters for population values and Roman letters for sample values, the line drawn through a population is:<\/p>\n<p><a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2014\/10\/100000000000004900000015489223001.png\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2014\/10\/100000000000004900000015489223001.png\" class=\"alignnone wp-image-105 size-full\" alt=\"Linear function\" height=\"21\" width=\"73\" srcset=\"https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2014\/10\/100000000000004900000015489223001.png 73w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2014\/10\/100000000000004900000015489223001-65x18.png 65w\" sizes=\"auto, (max-width: 73px) 100vw, 73px\" \/><\/a><\/p>\n<p>while the line drawn through a sample is:<\/p>\n<p><em>y<\/em> = <em>a<\/em> + <em>bx<\/em><\/p>\n<p>In most cases, even if the whole population had been gathered, the regression line would not go through every point. Most of the phenomena that business researchers deal with are not perfectly deterministic, so no function will perfectly predict or explain every observation.<\/p>\n<p>Imagine that you wanted to study the estimated price for a one-bedroom apartment in Nelson, BC. You decide to estimate the price\u00a0as a function of its location in relation to downtown. If you collected 12 sample pairs, you would find different apartments located within the same distance from downtown. In other words, you might draw a distribution of prices for apartments located at the same distance from downtown or away from downtown. When you use regression to estimate the parameters of price = f(distance), you are estimating the parameters of the line that connects the mean price\u00a0at each location. Because the best that can be expected is to predict the mean price\u00a0for a certain location, researchers often write their regression models with an extra term, the <strong>error term<\/strong>, which notes that many of the members of the population of (location, price of apartment) pairs will not have exactly the predicted price\u00a0because many of the points do not lie directly on the regression line. The error term is usually denoted as <strong><em>\u03b5<\/em><\/strong>, or <strong>epsilon<\/strong>, and you often see regression equations written:<\/p>\n<p><a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2014\/10\/1000000000000061000000157E0FBF2F1.png\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2014\/10\/1000000000000061000000157E0FBF2F1.png\" alt=\"Regression equation\" class=\"wp-image-107 size-full alignnone\" height=\"21\" width=\"97\" srcset=\"https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2014\/10\/1000000000000061000000157E0FBF2F1.png 97w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2014\/10\/1000000000000061000000157E0FBF2F1-65x14.png 65w\" sizes=\"auto, (max-width: 97px) 100vw, 97px\" \/><\/a><\/p>\n<p>Strictly, the distribution of <em>\u03b5<\/em> at each location\u00a0must be normal, and the distributions of <em>\u03b5<\/em> for all the locations\u00a0must have the same variance (this is known as homoscedasticity to statisticians).<\/p>\n<h1>Simple regression and least squares method<\/h1>\n<p>In estimating the unknown parameters of the population for the regression line, we need to apply a method by which the vertical distances between the yet-to-be estimated regression line and the observed values in our sample are minimized. This minimized distance is called <em>sample error,<\/em> though it is more commonly referred to as <em>residual<\/em> and denoted by <em>e.\u00a0<\/em>In more mathematical form, the difference between the <em>y<\/em> and its predicted value<em>\u00a0<\/em>is the residual in each pair of observations for <em>x<\/em> and <em>y<\/em>. Obviously, some of these residuals will be positive (above the estimated line) and others will be negative\u00a0(below the line). If we add all these residuals over the sample size and raise them to the power 2\u00a0in order to prevent the chance those positive and negative signs are cancelling each other out, we can write the following criterion for our minimization problem:<\/p>\n<p><a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image71.png\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image71.png\" alt=\"image7\" class=\"wp-image-320 alignnone\" height=\"45\" width=\"139\" srcset=\"https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image71.png 205w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image71-65x21.png 65w\" sizes=\"auto, (max-width: 139px) 100vw, 139px\" \/><\/a><\/p>\n<p><em>S<\/em> is the sum of squares of the residuals. By minimizing <em>S<\/em> over any given set of observations for <em>x<\/em>\u00a0and <em>y<\/em>, we will get the following useful formula:<\/p>\n<p>[latex]b=\\frac{\\sum{(x-\\bar{x})(y-\\bar{y})}}{\\sum{(x-\\bar{x})^2}}[\/latex]<\/p>\n<p>After computing the value of <em>b<\/em> from the above formula out of our sample data, and the means of the two series of data on<em> x\u00a0<\/em>and <em>y<\/em>, one can simply recover the intercept of the estimated line using the following equation:<\/p>\n<p><a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image9.png\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image9.png\" alt=\"image9\" class=\"size-full wp-image-323 alignnone\" height=\"32\" width=\"140\" srcset=\"https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image9.png 140w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image9-65x15.png 65w\" sizes=\"auto, (max-width: 140px) 100vw, 140px\" \/><\/a><\/p>\n<p>For the sample data, and given the estimated intercept and slope, for each observation we can define a residual as:<\/p>\n<p><a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image111.png\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image111.png\" alt=\"image11\" class=\"size-full wp-image-326 alignnone\" height=\"32\" width=\"253\" srcset=\"https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image111.png 253w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image111-65x8.png 65w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image111-225x28.png 225w\" sizes=\"auto, (max-width: 253px) 100vw, 253px\" \/><\/a><\/p>\n<p>Depending on the estimated values for intercept and slope, we can draw the estimated line along with all sample data in a <em>y<\/em>&#8211;<em>x<\/em> panel. Such graphs are known as scatter diagrams. Consider our analysis of the price of one-bedroom apartments in Nelson, BC. We would collect data for <em>y=<\/em>price of one bedroom apartment, <em>x<sub>1<\/sub><\/em>=its associated distance from downtown, and <em>x<sub>2<\/sub><\/em>=the size of the apartment, as shown in Table 8.1.<\/p>\n<table>\n<caption>Table 8.1 Data for Price, Size, and Distance of Apartments in Nelson, BC<\/caption>\n<tbody>\n<tr>\n<td colspan=\"3\"><em>y<\/em> = price of apartments in $1000<br \/>\n<em>x<sub>1<\/sub><\/em> = distance of each apartment from downtown in kilometres<br \/>\n<em>x<sub>2<\/sub><\/em> = size of the apartment in square feet<\/td>\n<\/tr>\n<tr>\n<td><strong>y<\/strong><\/td>\n<td><strong>x<sub>1<\/sub><\/strong><\/td>\n<td><strong>x<sub>2<\/sub><\/strong><\/td>\n<\/tr>\n<tr>\n<td>55<\/td>\n<td>1.5<\/td>\n<td>350<\/td>\n<\/tr>\n<tr>\n<td>51<\/td>\n<td>3<\/td>\n<td>450<\/td>\n<\/tr>\n<tr>\n<td>60<\/td>\n<td>1.75<\/td>\n<td>300<\/td>\n<\/tr>\n<tr>\n<td>75<\/td>\n<td>1<\/td>\n<td>450<\/td>\n<\/tr>\n<tr>\n<td>55.5<\/td>\n<td>3.1<\/td>\n<td>385<\/td>\n<\/tr>\n<tr>\n<td>49<\/td>\n<td>1.6<\/td>\n<td>210<\/td>\n<\/tr>\n<tr>\n<td>65<\/td>\n<td>2.3<\/td>\n<td>380<\/td>\n<\/tr>\n<tr>\n<td>61.5<\/td>\n<td>2<\/td>\n<td>600<\/td>\n<\/tr>\n<tr>\n<td>55<\/td>\n<td>4<\/td>\n<td>450<\/td>\n<\/tr>\n<tr>\n<td>45<\/td>\n<td>5<\/td>\n<td>325<\/td>\n<\/tr>\n<tr>\n<td>75<\/td>\n<td>0.65<\/td>\n<td>424<\/td>\n<\/tr>\n<tr>\n<td>65<\/td>\n<td>2<\/td>\n<td>285<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The graph (shown in Figure 8.1) is a scatter plot of the prices of the apartments and their distances from downtown, along with a proposed regression line.<\/p>\n<figure id=\"attachment_1235\" aria-describedby=\"caption-attachment-1235\" style=\"width: 730px\" class=\"wp-caption aligncenter\"><a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/11\/Figure8-1.png\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/11\/Figure8-1.png\" alt=\"Figure8-1\" class=\"wp-image-1235\" height=\"350\" width=\"730\" \/><\/a><figcaption id=\"caption-attachment-1235\" class=\"wp-caption-text\">Figure 8.1 Scatter Plot of Price, Distance from Downtown, along with a Proposed Regression Line<\/figcaption><\/figure>\n<p>In order to plot such a scatter diagram, you can use many available statistical software packages including Excel, SAS, and Minitab.\u00a0In this scatter diagram, a negative simple regression line has been shown. The estimated equation for this scatter diagram from Excel is:<\/p>\n<p><a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image131.png\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image131.png\" alt=\"image13\" class=\"size-full wp-image-329 alignnone\" height=\"30\" width=\"144\" srcset=\"https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image131.png 144w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image131-65x14.png 65w\" sizes=\"auto, (max-width: 144px) 100vw, 144px\" \/><\/a><\/p>\n<p>Where <em>a<\/em>=71.84 and\u00a0<em>b<\/em>=-5.38. In other words, for every additional kilometre from downtown an apartment is located, the price of the apartment is estimated to be $5380 cheaper, i.e.\u00a05.38*$1000=$5380. One might also be curious about the fitted values out of this estimated model. You can simply plug the actual value for <em>x<\/em> into the estimated line, and find the fitted values for the prices of the apartments. The residuals for all 12 observations are shown in Figure 8.2.<\/p>\n<figure id=\"attachment_845\" aria-describedby=\"caption-attachment-845\" style=\"width: 107px\" class=\"wp-caption aligncenter\"><a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/07\/Residuals_Simple-Regression.png\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/07\/Residuals_Simple-Regression.png\" alt=\"Residuals_Simple Regression\" class=\"wp-image-845 size-full\" height=\"547\" width=\"107\" srcset=\"https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/07\/Residuals_Simple-Regression.png 107w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/07\/Residuals_Simple-Regression-65x332.png 65w\" sizes=\"auto, (max-width: 107px) 100vw, 107px\" \/><\/a><figcaption id=\"caption-attachment-845\" class=\"wp-caption-text\">Figure 8.2<\/figcaption><\/figure>\n<p>You should also notice\u00a0that by minimizing errors, you have not eliminated them; rather, this method of least squares only guarantees the <em>best fitted<\/em> estimated regression line out of the sample data.<\/p>\n<p>In the presence of the remaining errors, one should be aware of the fact that there are still other factors that might not have been included in our regression model and\u00a0are responsible for the fluctuations in the remaining errors. By adding these excluded but relevant factors to the model, we probably expect the remaining error will show less meaningful fluctuations. In determining the price of these apartments, the missing factors may include age of the apartment, size, etc. Because this type of regression model does not include many relevant factors and assumes only a linear relationship, it is known as a simple linear regression model.<\/p>\n<h2>Testing your regression: does <em>y<\/em> really depend on <em>x<\/em>?<\/h2>\n<p>Understanding that there is a distribution of <em>y<\/em> (apartment price) values at each <em>x<\/em> (distance) is the key for understanding how regression results from a sample can be used to test the hypothesis that there is (or is not) a relationship between <em>x<\/em> and <em>y<\/em>. When you hypothesize that <em>y<\/em> = f(<em>x<\/em>), you hypothesize that the slope of the line (<em>\u03b2<\/em> in <em>y<\/em> = <em>\u03b1<\/em> + <em>\u03b2x<\/em> + <em>\u03b5<\/em>) is not equal to zero. If <em>\u03b2<\/em> was equal to zero, changes in <em>x<\/em> would not cause any change in <em>y<\/em>. Choosing a sample of apartments, and finding each apartment\u2019s distance to downtown, gives you a sample of (<em>x<\/em>, <em>y<\/em>). Finding the equation of the line that best fits the sample will give you a sample intercept, <em>\u03b1<\/em>, and a sample slope, <em>\u03b2<\/em>. These sample statistics are unbiased estimators of the population intercept, <em>\u03b1<\/em>, and slope, <em>\u03b2<\/em>. If another sample of the same size is taken, another sample equation could be generated. If many samples are taken, a sampling distribution of sample <em>\u03b2<\/em>\u2019s, the slopes of the sample lines, will be generated. Statisticians know that this sampling distribution of <em>b<\/em>\u2019s will be normal with a mean equal to <em>\u03b2<\/em>, the population slope. Because the standard deviation of this sampling distribution is seldom known, statisticians developed a method to estimate it from a single sample. With this estimated <em>s<sub>b<\/sub><\/em>, a t-statistic for each sample can be computed:<\/p>\n<p><a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2014\/10\/100000000000009F0000002BE45013371.png\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2014\/10\/100000000000009F0000002BE45013371.png\" alt=\"T-statistic\" class=\"alignnone wp-image-108 size-full\" height=\"43\" width=\"159\" srcset=\"https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2014\/10\/100000000000009F0000002BE45013371.png 159w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2014\/10\/100000000000009F0000002BE45013371-65x17.png 65w\" sizes=\"auto, (max-width: 159px) 100vw, 159px\" \/><\/a><\/p>\n<p>where <em>n<\/em> = sample size<\/p>\n<p><em>m<\/em> = number of explanatory (<em>x<\/em>) variables<\/p>\n<p><em>b<\/em> = sample slope<\/p>\n<p><em>\u03b2<\/em>= population slope<\/p>\n<p><em>s<sub>b<\/sub><\/em> = estimated standard deviation of b\u2019s, often called the <strong>standard error<\/strong><\/p>\n<p>These <em>t<\/em>\u2019s follow the t-distribution in the tables with <em>n<\/em>&#8211;<em>m<\/em>-1 df.<\/p>\n<p>Computing <em>s<sub>b<\/sub><\/em> is tedious, and is almost always left to a computer, especially when there is more than one explanatory variable. The estimate is based on how much the sample points vary from the regression line. If the points in the sample are not very close to the sample regression line, it seems reasonable that the population points are also widely scattered around the population regression line and different samples could easily produce lines with quite varied slopes. Though there are other factors involved, in general when the points in the sample are farther from the regression line, <em>s<sub>b<\/sub><\/em> is greater. Rather than learn how to compute <em>s<sub>b<\/sub><\/em>, it is more useful for you to learn how to find it on the regression results that you get from statistical software. It is often called the standard error and there is one for each independent variable. The printout in Figure 8.3 is typical.<\/p>\n<figure id=\"attachment_350\" aria-describedby=\"caption-attachment-350\" style=\"width: 608px\" class=\"wp-caption aligncenter\"><a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/Simple_Regression.png\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/Simple_Regression.png\" alt=\"Simple_Regression\" class=\"wp-image-350\" height=\"312\" width=\"608\" srcset=\"https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/Simple_Regression.png 741w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/Simple_Regression-300x154.png 300w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/Simple_Regression-65x33.png 65w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/Simple_Regression-225x115.png 225w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/Simple_Regression-350x179.png 350w\" sizes=\"auto, (max-width: 608px) 100vw, 608px\" \/><\/a><figcaption id=\"caption-attachment-350\" class=\"wp-caption-text\">Figure 8.3 Typical Statistical Package Output for Linear Simple Regression Model<\/figcaption><\/figure>\n<p>You will need these standard errors in order to test to see if <em>y<\/em> depends on <em>x<\/em> or not. You want to test to see if the slope of the line in the population, <em>\u03b2<\/em>, is equal to zero or not. If the slope equals zero, then changes in <em>x<\/em> do not result in any change in <em>y<\/em>. Formally, for each independent variable, you will have a test of the hypotheses:<\/p>\n<p>[latex]H_o: \\beta = 0[\/latex]<\/p>\n<p>[latex]H_a: \\beta \\neq 0[\/latex]<\/p>\n<p>If the t-score is large (either negative or positive), then the sample <em>b<\/em> is far from zero (the hypothesized <em>\u03b2<\/em>), and <em>H<sub>a<\/sub><\/em> should be accepted. Substitute zero for b into the t-score equation, and if the t-score is small, <em>b<\/em> is close enough to zero to accept <em>H<sub>a<\/sub><\/em>. To find out what t-value separates &#8220;close to zero&#8221; from &#8220;far from zero&#8221;, choose an alpha, find the degrees of freedom, and use a t-table from any textbook, or simply use the interactive Excel template from <a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/chapter\/making-estimates-2\/\">Chapter 3<\/a>, which is shown again in Figure 8.4.<\/p>\n<p><iframe loading=\"lazy\" src=\"https:\/\/onedrive.live.com\/embed?cid=0B27F889BFE551E2&amp;resid=B27F889BFE551E2%21192&amp;authkey=AEj42yeNIcfbMB0&amp;em=2&amp;wdAllowInteractivity=False&amp;AllowTyping=True&amp;wdHideGridlines=True&amp;wdHideHeaders=True\" width=\"100%\" height=\"600\"><\/iframe><br \/>\nFigure 8.4 Interactive Excel Template for Determining t-Value from the t-Table &#8211; see Appendix 8.<\/p>\n<p>Remember to halve alpha\u00a0when conducting a two-tail test like this. The degrees of freedom equal <em>n &#8211; m<\/em> -1, where <em>n<\/em> is the size of the sample and <em>m<\/em> is the number of independent <em>x<\/em> variables. There is a separate hypothesis test for each independent variable. This means you test to see if <em>y<\/em> is a function of each <em>x<\/em> separately. You can also test to see if <em>\u03b2<\/em> &gt; 0 (or <em>\u03b2<\/em> &lt; 0) rather than <em>\u03b2<\/em> \u2260 0 by using a one-tail test, or test to see if <em>\u03b2<\/em> equals a particular value by substituting that value for <em>\u03b2<\/em> when computing the sample t-score.<\/p>\n<h2>Testing your regression: does this equation really help predict?<\/h2>\n<p>To test to see if the regression equation really helps, see how much of the error that would be made using the mean of all of the <em>y<\/em>\u2019s to predict is eliminated by using the regression equation to predict. By testing to see if the regression helps predict, you are testing to see if there is a functional relationship in the population.<\/p>\n<p>Imagine that you have found the mean price of the apartments\u00a0in our\u00a0sample, and for each apartment, you have made the simple prediction that price of apartment\u00a0will be equal to the sample mean, <span style=\"border-top: 1px; border-left: 0px; border-right: 0px; border-bottom: 0px; border-style: solid;\"><em>y<\/em><\/span>. This is not a very sophisticated prediction technique, but remember that the sample mean is an unbiased estimator of population mean, so <strong>on average<\/strong> you will be right. For each apartment, you could compute your <strong>error<\/strong> by finding the difference between your prediction (the sample mean, <span style=\"border-top: 1px; border-left: 0px; border-right: 0px; border-bottom: 0px; border-style: solid;\"><em>y<\/em><\/span>) and the actual price\u00a0of an apartment.<\/p>\n<p>As an alternative way to predict the price, you can have a computer find the intercept, <em>\u03b1<\/em>, and slope, <em>\u03b2<\/em>, of the sample regression line. Now, you can make another prediction of how much each apartment\u00a0in the sample may be worth\u00a0by computing:<\/p>\n<p>[latex]\\hat{y} = \\alpha + \\beta(distance)[\/latex]<\/p>\n<p>Once again, you can find the error made for each apartment\u00a0by finding the difference between the price of apartments\u00a0predicted using the regression equation <em>\u0177<\/em>, and the observed price, <span style=\"border-top: 1px; border-left: 0px; border-right: 0px; border-bottom: 0px; border-style: solid;\"><em>y<\/em><\/span>. Finally, find how much using the regression improves your prediction by finding the difference between the price\u00a0predicted using the mean, <span style=\"border-top: 1px; border-left: 0px; border-right: 0px; border-bottom: 0px; border-style: solid;\"><em>y<\/em><\/span>, and the price\u00a0predicted using regression, \u0177. Notice that the measures of these differences could be positive or negative numbers, but that error or <strong>improvement<\/strong> implies a positive distance.<\/p>\n<h2>Coefficient of Determination<\/h2>\n<p>If you use the sample mean to predict the amount of the price of\u00a0each apartment, your error is (<em>y<\/em>&#8211;<span style=\"border-top: 1px; border-left: 0px; border-right: 0px; border-bottom: 0px; border-style: solid;\"><em>y<\/em><\/span>) for each apartment. Squaring each error so that worries about signs are overcome, and then adding the squared errors together, gives you a measure of the total mistake you make if you want to predict <em>y<\/em>. Your total mistake is \u03a3(<em>y<\/em>&#8211;<span style=\"border-top: 1px; border-left: 0px; border-right: 0px; border-bottom: 0px; border-style: solid;\"><em>y<\/em><\/span>)<sup>2<\/sup>. The total mistake you make using the regression model would be \u03a3(<em>y-\u0177<\/em>)<sup>2<\/sup>. The difference between the mistakes, a raw measure of how much your prediction has improved, is \u03a3(<em>\u0177<\/em>&#8211;<span style=\"border-top: 1px; border-left: 0px; border-right: 0px; border-bottom: 0px; border-style: solid;\"><em>y<\/em><\/span>)<sup>2<\/sup>. To make this raw measure of the improvement meaningful, you need to compare it to one of the two measures of the total mistake. This means that there are two measures of &#8220;how good&#8221; your regression equation is. One compares the improvement to the mistakes still made with regression. The other compares the improvement to the mistakes that would be made if the mean was used to predict. The first is called an F-score because the sampling distribution of these measures follows the F-distribution seen in <a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/chapter\/f-test-and-one-way-anova-2\/\">Chapter 6<\/a>, &#8220;F-test and One-Way ANOVA&#8221;.\u00a0The second is called <em>R<sup>2<\/sup><\/em>, or the <strong>coefficient of determination<\/strong>.<\/p>\n<p>All of these mistakes and improvements have names, and talking about them will be easier once you know those names. The total mistake made using the sample mean to predict, \u03a3(<em>y<\/em>&#8211;<span style=\"border-top: 1px; border-left: 0px; border-right: 0px; border-bottom: 0px; border-style: solid;\"><em>y<\/em><\/span>)<sup>2<\/sup>, is called the <strong>sum of squares, total<\/strong>. The total mistake made using the regression, \u03a3(<em>y-\u0177<\/em>)<sup>2<\/sup>, is called the <strong>sum of squares, error (residual)<\/strong>. The general improvement made by using regression, \u03a3(<em>\u0177<\/em>&#8211;<span style=\"border-top: 1px; border-left: 0px; border-right: 0px; border-bottom: 0px; border-style: solid;\"><em>y<\/em><\/span>)<sup>2<\/sup> is called the <strong>sum of squares, regression<\/strong> or <strong>sum of squares, model<\/strong>. You should be able to see that:<\/p>\n<p>sum of squares, total = sum of squares, regression + sum of squares, error (residual)<\/p>\n<p>[latex]\\sum{(y-\\bar{y})^2} = \\sum{(\u0177-\\bar{y})^2} + \\sum{(y-\u0177)^2}[\/latex]<\/p>\n<p>In other words, the total variations in <em>y<\/em>\u00a0can be partitioned into two sources: the explained variations and the unexplained variations. Further, we can rewrite the above equation as:<\/p>\n<p><a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image17.png\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image17.png\" alt=\"image17\" class=\"size-full wp-image-339 alignnone\" height=\"30\" width=\"131\" srcset=\"https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image17.png 131w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image17-65x15.png 65w\" sizes=\"auto, (max-width: 131px) 100vw, 131px\" \/><\/a><\/p>\n<p>where SST stands for sum of squares due to total variations, SSR measures the sum of squares due to the estimated regression model that is explained by variable <em>x<\/em>, and SSE measures all the variations due to other factors excluded from the estimated model.<\/p>\n<p>Going back to the idea of goodness of fit, one should be able to easily calculate the percentage of each variation with respect to the total variations. In particular, the strength of the estimated regression model can now be measured. Since we are interested in the explained part of the variations by the estimated model, we simply divide both sides of the above equation by SST, and we get:<\/p>\n<p><a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image18.png\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image18.png\" alt=\"image18\" class=\"size-full wp-image-340 alignnone\" height=\"30\" width=\"221\" srcset=\"https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image18.png 221w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image18-65x9.png 65w\" sizes=\"auto, (max-width: 221px) 100vw, 221px\" \/><\/a><\/p>\n<p>We then\u00a0isolate this equation for the explained proportion, also known as <em>R<\/em>-square:<\/p>\n<p><a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image192.png\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image192.png\" alt=\"image19\" class=\"wp-image-343 alignnone\" height=\"49\" width=\"139\" srcset=\"https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image192.png 138w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image192-65x23.png 65w\" sizes=\"auto, (max-width: 139px) 100vw, 139px\" \/><\/a><\/p>\n<p>Only in cases where an intercept is included in a simple regression model will the value of <em>R<sup>2<\/sup><\/em>\u00a0be bounded between zero and one. The closer <em>R<sup>2<\/sup><\/em> is to one, the stronger the model is. Alternatively, <em>R<sup>2<\/sup><\/em> is also found by:<\/p>\n<p><a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/R-Square.png\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/R-Square.png\" alt=\"R-Square\" class=\"size-full wp-image-344 alignnone\" height=\"50\" width=\"341\" srcset=\"https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/R-Square.png 341w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/R-Square-300x44.png 300w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/R-Square-65x10.png 65w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/R-Square-225x33.png 225w\" sizes=\"auto, (max-width: 341px) 100vw, 341px\" \/><\/a><\/p>\n<p>This is the ratio of the improvement made using the regression to the mistakes made using the mean.\u00a0The numerator is the improvement regression makes over using the mean to predict; the denominator is the mistakes (errors) made using the mean. Thus <em>R<sup>2<\/sup><\/em> simply shows what proportion of the mistakes made using the mean are eliminated by using regression.<\/p>\n<p>In the case of the market for one-bedroom apartments in Nelson, BC, the percentage of the variations in price for the\u00a0apartments is estimated to be around 50%. This indicates that only half of the fluctuations in apartment prices with respect to the average price can be explained by the apartments\u2019 distance from downtown. The other 50% are not controlled (that is, they are unexplained) and are subject to further research. One typical approach is to add more relevant factors to the simple regression model. In this case, the estimated model is referred to as a multiple regression model.<\/p>\n<p>While <em>R<sup>2<\/sup><\/em> is not used to test hypotheses, it has a more intuitive meaning than the F-score.\u00a0The F-score is the measure usually used in a hypothesis test to see if the regression made a significant improvement over using the mean.\u00a0It is used because the sampling distribution of F-scores that it follows is printed in the tables at the back of most statistics books, so that it can be used for hypothesis testing. It works no matter how many explanatory variables are used. More formally, consider a population of multivariate observations, (<em>y, x<sub>1<\/sub>, x<sub>2<\/sub>, &#8230;, x<sub>m<\/sub><\/em>), where\u00a0there is no linear relationship between <em>y<\/em> and the <em>x<\/em>\u2019s, so that <em>y<\/em> \u2260 f(<em>y, x<sub>1<\/sub>, x<sub>2<\/sub>, &#8230;, x<sub>m<\/sub><\/em>). If samples of <em>n<\/em> observations are taken, a regression equation estimated for each sample, and a statistic, F, found for each sample regression, then those F\u2019s will be distributed like those shown in Figure 8.5, the\u00a0F-table with (<em>m<\/em>, <em>n<\/em>&#8211;<em>m<\/em>-1) df.<\/p>\n<p><iframe loading=\"lazy\" src=\"https:\/\/onedrive.live.com\/embed?cid=0B27F889BFE551E2&amp;resid=B27F889BFE551E2%21194&amp;authkey=ACbx6NPdd4cpRL8&amp;em=2&amp;wdAllowInteractivity=False&amp;AllowTyping=True&amp;wdHideGridlines=True&amp;wdHideHeaders=True\" width=\"100%\" height=\"600\"><\/iframe><br \/>\nFigure 8.5 Interactive Excel Template of an F-Table &#8211; see Appendix 8.<\/p>\n<p>The value of F can be calculated as:<\/p>\n<p><a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2014\/10\/10000000000000C700000052A5D78DF11.png\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2014\/10\/10000000000000C700000052A5D78DF11.png\" alt=\"Sum of squares regression \/ sum of squares residual\" class=\"wp-image-112 size-full alignnone\" height=\"82\" width=\"199\" srcset=\"https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2014\/10\/10000000000000C700000052A5D78DF11.png 199w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2014\/10\/10000000000000C700000052A5D78DF11-65x26.png 65w\" sizes=\"auto, (max-width: 199px) 100vw, 199px\" \/><\/a><\/p>\n<p><a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2014\/10\/10000000000000910000004E1831B8BA1.png\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2014\/10\/10000000000000910000004E1831B8BA1.png\" alt=\"Improvement made \/ mistakes still made\" class=\"wp-image-113 size-full alignnone\" height=\"78\" width=\"145\" srcset=\"https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2014\/10\/10000000000000910000004E1831B8BA1.png 145w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2014\/10\/10000000000000910000004E1831B8BA1-65x34.png 65w\" sizes=\"auto, (max-width: 145px) 100vw, 145px\" \/><\/a><\/p>\n<p><a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2014\/10\/100000000000006E00000065597DBFCE1.png\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2014\/10\/100000000000006E00000065597DBFCE1.png\" alt=\"Value of F\" class=\"wp-image-114 size-full alignnone\" height=\"101\" width=\"110\" srcset=\"https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2014\/10\/100000000000006E00000065597DBFCE1.png 110w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2014\/10\/100000000000006E00000065597DBFCE1-65x59.png 65w\" sizes=\"auto, (max-width: 110px) 100vw, 110px\" \/><\/a><\/p>\n<p>where\u00a0<em>n<\/em> is the size of the sample, and\u00a0<em>m<\/em> is the number of explanatory variables (how many <em>x<\/em>\u2019s there are in the regression equation).<\/p>\n<p>If \u03a3(<em>\u0177<\/em>&#8211;<span style=\"border-top: 1px; border-left: 0px; border-right: 0px; border-bottom: 0px; border-style: solid;\"><em>y<\/em><\/span>)<sup>2<\/sup> the sum of squares regression (the improvement), is large relative to \u03a3(<em>\u0177<\/em>&#8211;<span style=\"border-top: 1px; border-left: 0px; border-right: 0px; border-bottom: 0px; border-style: solid;\"><em>y<\/em><\/span>)<sup>3<\/sup>, the sum of squares residual (the mistakes still made), then the F-score will be large. In a population where there is no functional relationship between y and the <em>x<\/em>\u2019s, the regression line will have a slope of zero (it will be flat), and the <em>\u0177<\/em> will be close to y. As a result very few samples from such populations will have a large sum of squares regression and large F-scores. Because this F-score is distributed like the one in the F-tables, the tables can tell you whether the F-score a sample regression equation produces is large enough to be judged unlikely to occur if <em>y<\/em> \u2260 f(<em>y, x<sub>1<\/sub>, x<sub>2<\/sub>, &#8230;, x<sub>m<\/sub><\/em>). The sum of squares regression is divided by the number of explanatory variables to account for the fact that it always decreases when more variables are added. You can also look at this as finding the improvement per explanatory variable. The sum of squares residual is divided by a number very close to the number of observations because it always increases if more observations are added. You can also look at this as the approximate mistake per observation.<\/p>\n<p>[latex]H_0: y \\neq f(y,x_1,x_2,\\cdots,x_m)[\/latex]<\/p>\n<p>To test to see if a regression equation was worth estimating, test to see if there seems to be a functional relationship:<\/p>\n<p>[latex]H_a: y = f(y,x_1,x_2,\\cdots,x_m)[\/latex]<\/p>\n<p>This might look like a two-tailed test since <em>H<sub>o<\/sub><\/em> has an equal sign. But, by looking at the equation for the F-score you should be able to see that the data support <em>H<sub>a<\/sub><\/em> only if the F-score is large. This is because the data support the existence of a functional relationship if the sum of squares regression is large relative to the sum of squares residual. Since <a href=\"http:\/\/www.stat.purdue.edu\/~jtroisi\/STAT350Spring2015\/tables\/FTable.pdf\">F-tables<\/a> are usually one-tail tables, choose an <em>\u03b1<\/em>, go to the <a href=\"http:\/\/www.stat.purdue.edu\/~jtroisi\/STAT350Spring2015\/tables\/FTable.pdf\">F-tables<\/a> for that <em>\u03b1<\/em> and (<em>m<\/em>, <em>n<\/em>&#8211;<em>m<\/em>-1) df, and find the table F. If the computed F is greater than the table F, then the computed F is unlikely to have occurred if <em>H<sub>o<\/sub><\/em> is true, and you can safely decide that the data support <em>H<sub>a<\/sub><\/em>. There is a functional relationship in the population.<\/p>\n<p>Now that you have learned all the necessary steps in estimating a simple regression model, you may take some time to re-estimate the Nelson apartment model or any other simple regression model, using the interactive Excel template shown in Figure 8.6. Like all other interactive templates in this textbook, you can change the values in the yellow cells only. The result will be shown automatically within this template. For this template, you can only estimate simple regression models with 30 observations. You use <em>special paste\/values<\/em> when you paste your data from other spreadsheets. The first step is to enter your data under independent and dependent variables. Next, select your alpha level. Check your results in terms of both individual and overall significance. Once the model has passed all these requirements, you can select an appropriate value for the independent variable, which in this example is the distance to downtown, to estimate both the confidence intervals for the average price of such an apartment, and the prediction intervals for the selected distance. Both these intervals are discussed later in this chapter. Remember that by changing any of the values in the yellow areas in this template, all calculations will be updated, including the tests of significance and the values for both confidence and prediction intervals.<\/p>\n<p><iframe loading=\"lazy\" src=\"https:\/\/onedrive.live.com\/embed?cid=0B27F889BFE551E2&amp;resid=B27F889BFE551E2%21195&amp;authkey=ADs_Wx8MoXAAfYw&amp;em=2&amp;wdAllowInteractivity=False&amp;AllowTyping=True&amp;wdHideGridlines=True&amp;wdHideHeaders=True\" width=\"100%\" height=\"600\"><\/iframe><br \/>\nFigure 8.6 Interactive Excel Template for Simple Regression &#8211; see Appendix 8.<\/p>\n<h2>Multiple Regression Analysis<\/h2>\n<p>When we add more explanatory variables to our simple regression model to strengthen its ability to explain real-world data, we in fact convert a simple regression model into a multiple regression model. The least squares approach we used in the case of simple regression can still\u00a0be used for multiple regression analysis.<\/p>\n<p>As per our discussion in the simple regression model section, our low estimated <em>R<sup>2<\/sup><\/em>\u00a0indicated that only 50% of the variations in the price of apartments in Nelson, BC, was explained by their distance from downtown. Obviously, there should be more relevant factors that can be added into this model to make it stronger. Let\u2019s add the second explanatory factor to this model. We collected data for the area of each apartment in square feet\u00a0(i.e., \u00a0<em>x<sub>2<\/sub><\/em>). If we go back to Excel and estimate our model including the new added variable, we will see the printout shown in Figure 8.7.<\/p>\n<figure id=\"attachment_399\" aria-describedby=\"caption-attachment-399\" style=\"width: 656px\" class=\"wp-caption aligncenter\"><a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/multiple-regression.png\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/multiple-regression.png\" alt=\"multiple regression\" class=\"wp-image-399 size-full\" height=\"376\" width=\"656\" srcset=\"https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/multiple-regression.png 656w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/multiple-regression-300x172.png 300w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/multiple-regression-65x37.png 65w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/multiple-regression-225x129.png 225w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/multiple-regression-350x201.png 350w\" sizes=\"auto, (max-width: 656px) 100vw, 656px\" \/><\/a><figcaption id=\"caption-attachment-399\" class=\"wp-caption-text\">Figure 8.7 Excel Printout<\/figcaption><\/figure>\n<p>The estimates equation of the regression model is:<\/p>\n<p>predicted price\u00a0of apartments= 60.041 &#8211; 5.393*distance\u00a0+ .03*area<\/p>\n<p>This is the equation for a plane, the three-dimensional equivalent of a straight line. It is still a linear function because neither of the <em>x<\/em>\u2019s nor <em>y<\/em> is raised to a power nor taken to some root nor are the <em>x<\/em>\u2019s multiplied together. You can have even more independent variables, and as long as the function is linear, you can estimate the slope, <em>\u03b2<\/em>, for each independent variable.<\/p>\n<p>Before using this estimated model for prediction and decision-making purposes, we should test three hypotheses. First, we\u00a0can use the F-score to test to see if the regression model improves our\u00a0ability to predict price\u00a0of apartments. In other words, we test the <em>overall<\/em> significance of the estimated model. Second and third, we\u00a0can use the t-scores to test to see if the slopes of distance\u00a0and area\u00a0are different from zero. These two t-tests are also known as <em>individual<\/em> tests of significance.<\/p>\n<p>To conduct the first test, we\u00a0choose an <em>\u03b1<\/em> = .05. The F-score is the regression or model mean square over the residual or error mean square, so the df for the F-statistic are first the df for the regression model and, second, the df for the error. There are 2 and 9\u00a0df for the F-test. According to this <a href=\"http:\/\/www.stat.purdue.edu\/~jtroisi\/STAT350Spring2015\/tables\/FTable.pdf\">F-table<\/a>, with 2\u00a0and 9\u00a0df, the critical F-score for <em>\u03b1<\/em> = .05 is 4.26.<\/p>\n<p>The hypotheses are:<\/p>\n<p><em>H<sub>0<\/sub><\/em>: price\u00a0\u2260 f(distance, area)<\/p>\n<p><em>H<sub>a<\/sub><\/em>: price = f(distance, area)<\/p>\n<p>Because the F-score from the regression, 6.812, is greater than the critical F-score, 4.26, we decide that the data support <em>H<sub>o<\/sub><\/em> and conclude that the model helps us predict price of\u00a0apartments. Alternatively, we say there is such a functional relationship in the population.<\/p>\n<p>Now, we move to the individual test of significance. We can test to see if price\u00a0depends on distance and area.\u00a0There are (<em>n-m<\/em>-1)=(12-2-1)=9\u00a0df. There are two sets of hypotheses, one set for <em>\u03b2<sub>1<\/sub><\/em>, the slope for distance, and one set for <em>\u03b2<sub>2<\/sub><\/em>, the slope for area. For a small town, one may\u00a0expect that <em>\u03b2<sub>1<\/sub><\/em>, the slope for distance, will be negative, and\u00a0expect that <em>\u03b2<sub>2<\/sub><\/em> will be positive. Therefore, we\u00a0will use a one-tail test on <em>\u03b2<sub>1<\/sub><\/em>, as well as for <em>\u03b2<sub>2<\/sub><\/em>:<\/p>\n<p>[latex]H_a: \\beta _1 <0 \\qquad H_a:\\beta _2<0[\/latex]\n\nSince we have\u00a0two one-tail tests, the t-values we\u00a0choose from the t-table will be the same\u00a0for the two tests. Using &lt;em&gt;\u03b1 = .05 and 9 df, we\u00a0choose .05\/2=.025\u00a0for\u00a0the\u00a0t-score for <em>\u03b2<sub>1<\/sub><\/em>\u00a0with a one-tail test, and come up with 2.262.\u00a0Looking back at our Excel\u00a0printout and checking the <em>t<\/em>-scores, we\u00a0decide that distance does affect price of apartments, but area is not a significant factor in explaining the price of apartments.\u00a0Notice that the printout also gives a <em>t<\/em>-score for the intercept, so we\u00a0could test to see if the intercept equals zero or not.<\/p>\n<p>Alternatively, one may go ahead and compare directly the <em>p<\/em>-values out of the Excel printout against the assumed level of significance\u00a0(i.e., <em>\u03b1<\/em> = .05). We can easily see that the\u00a0<em>p-<\/em>values associated with the intercept and price are both less than <em>alpha<\/em>, and as a result we reject the hypothesis that the associated\u00a0coefficients\u00a0are zero\u00a0(i.e., both are\u00a0significant).\u00a0However, area is not a significant factor since its associated\u00a0<em>p-<\/em>value\u00a0is greater than <em>alpha<\/em>.<\/p>\n<p>While there are other required assumptions and conditions in both simple and multiple regression models (we encourage students to consult\u00a0an intermediate business statistics open textbook\u00a0for more detailed discussions), here we only\u00a0focus on two relevant points about the use and applications of multiple regression.<\/p>\n<p>The first point is related to the interpretation of the estimated coefficients in a multiple regression model. You should be careful to note\u00a0that in a simple regression model, the estimated coefficient of our independent variable is simply the slope of the line and can be interpreted. It refers to the response of the dependent variable to a one-unit change in the independent variable. However, this interpretation in a multiple regression model should be adjusted slightly. The estimated coefficients under multiple regression analysis are the response of the dependent variable to a one-unit change in one of the independent variables when the levels of all other independent variables are kept constant. In our example, the estimated coefficient of price of an apartment in Nelson, BC, indicates that \u2014 for a given size of apartment\u2014 it will drop by\u00a05.248*1000=$5248 for every one kilometre that the apartment is away from downtown.<\/p>\n<p>The second point is about the use of <em>R<sup>2<\/sup><\/em> in multiple regression analysis. Technically, adding more independent variables to the model will increase the value of <em>R<sup>2<\/sup><\/em>, regardless of whether\u00a0the added variables are relevant or irrelevant in explaining the variation in the dependent variable. In order to <em>adjust<\/em> the inflated <em>R<sup>2<\/sup><\/em> due to the\u00a0irrelevant variables\u00a0added to the model, the following formula is recommended in the case of multiple regression:<\/p>\n<p><a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/06\/Screen-Shot-2015-06-16-at-8.36.58-AM.png\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/06\/Screen-Shot-2015-06-16-at-8.36.58-AM-300x62.png\" class=\"alignnone wp-image-664\" height=\"39\" width=\"181\" alt=\"image\" srcset=\"https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/06\/Screen-Shot-2015-06-16-at-8.36.58-AM-65x13.png 65w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/06\/Screen-Shot-2015-06-16-at-8.36.58-AM-225x47.png 225w\" sizes=\"auto, (max-width: 181px) 100vw, 181px\" \/><\/a><\/p>\n<p>where <em>n <\/em>is the sample size, and <em>k<\/em>\u00a0is number of the estimated parameters in our model.<\/p>\n<p>Back to our earlier Excel results for the\u00a0multiple regression model estimated for the apartment example, we can see that while the <em>R<sup>2<\/sup><\/em> has been inflated\u00a0from .504 to .612 due to the new added factor, apartment size, \u00a0the adjusted <em>R<sup>2<\/sup><\/em> has dropped the inflated value to .526. To understand it better, you should pay attention to the associated <em>p<\/em>-value for the newly added factor. Since this value is more than .05, we cannot reject the hypothesis that the true coefficient of apartment size (area) is significantly different from zero. In other words, in its current situation, apartment size is not a significant factor, yet the value of <em>R<sup>2<\/sup><\/em> has been inflated!<\/p>\n<p>Furthermore, the adjusted <em>R<sup>2<\/sup><\/em> indicates that only 61.2% of variations in price of one-bedroom apartments in Nelson, BC, can be explained by their locations and sizes. Almost 40% of the variations of the price still\u00a0cannot be explained by these two factors. One may seek\u00a0to improve this model, by searching for more relevant factors such as style of the apartment, year built, etc. and add them in to this model.<\/p>\n<p>Using the interactive Excel template shown in Figure 8.8, you can estimate a multiple regression model. Again, enter your data into the yellow cells only. For this template you are allowed to use up to 50 observations for each column. Like all other interactive templates in this textbook, you use <em>special paste\/values<\/em> when you paste your data from other spreadsheets. Specifically, if you have fewer than 50 data entries, you must also fill out the rest of the empty yellow cells under X1, X2, and Y with zeros. Now, select your alpha level. By clicking <em>enter<\/em>, you will not only have all your estimated coefficients along with their t-values, etc., you will also be guided as to whether the model is significant both overall and individually. If your p-value associated with F-value within the ANOVA table is not less than the selected alpha level, you will see a message indicating that your estimated model is not overall significant, and as a result, no values for C.I. and P.I. will be shown. By either changing the alpha level and\/or adding more accurate data, it is possible\u00a0to estimate a more significant multiple regression model.<\/p>\n<p><iframe loading=\"lazy\" src=\"https:\/\/onedrive.live.com\/embed?cid=0B27F889BFE551E2&amp;resid=B27F889BFE551E2%21196&amp;authkey=AMQIOCaItKS3dy8&amp;em=2&amp;wdAllowInteractivity=False&amp;AllowTyping=True&amp;wdHideGridlines=True&amp;wdHideHeaders=True\" width=\"100%\" height=\"600\"><\/iframe><br \/>\nFigure 8.8 Interactive Excel Template for Multiple Regression Model - see Appendix 8.<\/p>\n<p>One more point is about the format of your assumed multiple regression model. You can see that the nature of the associations between the dependent variable and all the independent variables may not always be linear. In reality, you will face cases where such relationships may be better formed by a nonlinear model. Without going into the details of such a non-linear model, just to give you an idea, you should be able to transform your selected data for X1, X2, and Y before estimating your model. For instance, one possible multiple regression non-linear model may be a model in which both the dependent and independent variables have been transformed to a natural logarithm rather than a level. In order to estimate such a model within Figure 8.5, all you need to do is transform the data in all three columns in a separate sheet from level to logarithm. In doing this, simply use =log(say A1) where in cell A1 you have the first observation of X1, and =log(say B1),.... Finally, simply cut and <em>special paste\/value<\/em> into the yellow columns within the template. Now you have estimated a multiple regression model with both sides in a non-linear form\u00a0(i.e., log form).<\/p>\n<h2><strong>Predictions using the estimated simple regression<\/strong><\/h2>\n<p>If the estimated regression line fits well into the data, the model can then be used for predictions. Using the above estimated simple regression model, we can predict the price of an apartment a <em>given<\/em> distance to downtown. This is known as the prediction interval or P.I. Alternatively, we may predict the <em>mean price<\/em> of the apartment, also known as the confidence interval or C.I., for the mean value.<\/p>\n<p>In predicting intervals for the price of an apartment that is six\u00a0kilometres away from downtown, we simply set <em>x<\/em>=6 , and substitute it back into the estimated equation:<\/p>\n<p>[latex]y=71.84-5.38\\times 6 = \\$39.56[\/latex]<\/p>\n<p>You should pay attention to the scale of data. In this case, the dependent variable is measured in $1000s. Therefore, the predicted value for an apartment six\u00a0kilometres from downtown is 39.56*1000=$39,560. This value is known as the\u00a0<em>point estimate<\/em> of the prediction and is not reliable, as we are not clear how close this value is to the true value of the population.<\/p>\n<p>A more reliable estimate can be constructed by setting up an <em>interval<\/em> around the point estimate. This can be done in two ways. We can predict the particular value of <em>y<\/em>\u00a0for a given value of <em>x<\/em><em>,\u00a0<\/em>or we can estimate the expected value (mean) of <em>y,\u00a0<\/em>for a given value of <em>x<\/em>. For the particular value of <em>y<\/em>, we use the following formula for the interval:<\/p>\n<p><a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image21.png\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image21.png\" alt=\"image21\" class=\"size-full wp-image-359 alignnone\" height=\"36\" width=\"246\" srcset=\"https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image21.png 246w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image21-65x10.png 65w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image21-225x33.png 225w\" sizes=\"auto, (max-width: 246px) 100vw, 246px\" \/><\/a><\/p>\n<p>where the standard error, S.E., of the prediction is calculated based on the following formula:<\/p>\n<p><a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image22.png\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image22.png\" alt=\"image22\" class=\"size-full wp-image-360 alignnone\" height=\"65\" width=\"334\" srcset=\"https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image22.png 334w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image22-300x58.png 300w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image22-65x13.png 65w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image22-225x44.png 225w\" sizes=\"auto, (max-width: 334px) 100vw, 334px\" \/><\/a><\/p>\n<p>In this equation, <em>x<sup>*<\/sup><\/em>\u00a0is the particular value of the independent variable, which in our case is 6, and\u00a0<em>s\u00a0<\/em>is the standard\u00a0error of the regression, calculated as:<\/p>\n<p><a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image23.png\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image23.png\" alt=\"image23\" class=\"size-full wp-image-361 alignnone\" height=\"65\" width=\"82\" srcset=\"https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image23.png 82w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image23-65x52.png 65w\" sizes=\"auto, (max-width: 82px) 100vw, 82px\" \/><\/a><\/p>\n<p>From the Excel printout for the simple regression model, this standard error is estimated as 7.02.<\/p>\n<p>The sum of squares of the independent variable,<\/p>\n<p><a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/sum-of-Sq-of-indep2.png\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/sum-of-Sq-of-indep2.png\" alt=\"sum of Sq of indep\" class=\"size-full wp-image-365 alignnone\" height=\"50\" width=\"133\" srcset=\"https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/sum-of-Sq-of-indep2.png 133w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/sum-of-Sq-of-indep2-65x24.png 65w\" sizes=\"auto, (max-width: 133px) 100vw, 133px\" \/><\/a><\/p>\n<p>can also be calculated as shown in Figure 8.9.<\/p>\n<figure id=\"attachment_366\" aria-describedby=\"caption-attachment-366\" style=\"width: 148px\" class=\"wp-caption aligncenter\"><a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image24.png\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image24.png\" alt=\"image24\" class=\"wp-image-366 size-full\" height=\"325\" width=\"148\" srcset=\"https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image24.png 148w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image24-137x300.png 137w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image24-65x143.png 65w\" sizes=\"auto, (max-width: 148px) 100vw, 148px\" \/><\/a><figcaption id=\"caption-attachment-366\" class=\"wp-caption-text\">Figure 8.9<\/figcaption><\/figure>\n<p>All these calculated values can be substituted back into the formula for the S.E. of the prediction:<\/p>\n<p><a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/11\/SEofCI1.png\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/11\/SEofCI1.png\" alt=\"C.I.\" class=\"alignnone size-full wp-image-1242\" height=\"73\" width=\"327\" srcset=\"https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/11\/SEofCI1.png 327w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/11\/SEofCI1-300x67.png 300w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/11\/SEofCI1-65x15.png 65w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/11\/SEofCI1-225x50.png 225w\" sizes=\"auto, (max-width: 327px) 100vw, 327px\" \/><\/a><\/p>\n<p>Now that the S.E. of the confidence interval has been calculated, you can pick up the cut-off point from the <em>t<\/em>-table. Given the degrees of freedom 12-2=10, the appropriate value from the <em>t<\/em>-table is 2.23. You use this information to calculate the <em>margin of error <\/em>as 6.52*2.23=14.54. Finally, construct the prediction interval for the particular value of the price of an apartment located six\u00a0kilometres away from downtown as:<\/p>\n<p><a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/07\/C.I._VALUES.png\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/07\/C.I._VALUES.png\" alt=\"C.I._VALUES\" class=\"alignnone size-full wp-image-862\" height=\"40\" width=\"103\" srcset=\"https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/07\/C.I._VALUES.png 103w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/07\/C.I._VALUES-65x25.png 65w\" sizes=\"auto, (max-width: 103px) 100vw, 103px\" \/><\/a><\/p>\n<p>This is a compact version of the prediction interval. For a\u00a0more general version of any confidence interval\u00a0for any given confidence level of <em>alpha<\/em>, we can write:<\/p>\n<p><a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/Confidence-Interval1.png\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/Confidence-Interval1.png\" alt=\"Confidence Interval\" class=\"size-full wp-image-372 alignnone\" height=\"36\" width=\"544\" srcset=\"https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/Confidence-Interval1.png 544w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/Confidence-Interval1-300x20.png 300w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/Confidence-Interval1-65x4.png 65w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/Confidence-Interval1-225x15.png 225w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/Confidence-Interval1-350x23.png 350w\" sizes=\"auto, (max-width: 544px) 100vw, 544px\" \/><\/a><\/p>\n<p>Intuitively, for say a .05 level of confidence, we are\u00a095% confident that the\u00a0true parameter of the population will be within these two lower and upper limits:<\/p>\n<p><a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/07\/Confidence-Interval1.png\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/07\/Confidence-Interval1.png\" alt=\"Confidence Interval\" class=\"alignnone size-full wp-image-864\" height=\"40\" width=\"476\" srcset=\"https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/07\/Confidence-Interval1.png 476w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/07\/Confidence-Interval1-300x25.png 300w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/07\/Confidence-Interval1-65x5.png 65w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/07\/Confidence-Interval1-225x19.png 225w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/07\/Confidence-Interval1-350x29.png 350w\" sizes=\"auto, (max-width: 476px) 100vw, 476px\" \/><\/a><\/p>\n<p>Based on\u00a0our simple regression model that\u00a0only includes distance as a significant factor in predicting the price of an apartment, and for a particular apartment six kilometres away from downtown, we are 95% confident that the true price of an apartments in Nelson, BC, is between $25,037 and $54,096, with a width of $29,059. One should not be surprised there is such a wide width, given the fact that the coefficient of determination of this model was only 50%, and the fact that we have selected a distance far away from the mean distance from downtown.\u00a0We can always improve these numbers by adding more explanatory variables to our simple regression model. Alternatively, we can predict only for the numbers as much as possible close to the downtown area.<\/p>\n<p>Now we estimate the expected value (mean) of <em>y<\/em>\u00a0for a given value of <em>x<\/em>,\u00a0the so-called prediction interval. The process of constructing intervals is very similar to the previous case, except we use a new formula for S.E. and of course we set up the intervals for the mean value of the apartment price\u00a0(i.e., =59.33).<\/p>\n<p><a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/11\/SEofPI.png\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/11\/SEofPI.png\" alt=\"P.I.\" class=\"size-full wp-image-1244\" height=\"70\" width=\"356\" srcset=\"https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/11\/SEofPI.png 356w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/11\/SEofPI-300x59.png 300w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/11\/SEofPI-65x13.png 65w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/11\/SEofPI-225x44.png 225w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/11\/SEofPI-350x69.png 350w\" sizes=\"auto, (max-width: 356px) 100vw, 356px\" \/><\/a><\/p>\n<p>You should be very careful\u00a0to note\u00a0the difference between this formula and the one introduced earlier for S.E. for\u00a0predicting the particular value of <em>y<\/em>\u00a0for a given value of <em>x.\u00a0<\/em>They look\u00a0very\u00a0similar\u00a0but\u00a0this formula comes with an extra 1\u00a0inside the radical!<\/p>\n<p>The margin of error is then calculated as 2.179*3.82=8.32. We use this to set up directly the lower and upper limits of the estimates:<a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/07\/P.I._VALUES.png\"><br \/>\n<img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/07\/P.I._VALUES.png\" alt=\"P.I._VALUES\" class=\"alignnone size-full wp-image-867\" height=\"40\" width=\"480\" srcset=\"https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/07\/P.I._VALUES.png 480w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/07\/P.I._VALUES-300x25.png 300w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/07\/P.I._VALUES-65x5.png 65w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/07\/P.I._VALUES-225x19.png 225w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/07\/P.I._VALUES-350x29.png 350w\" sizes=\"auto, (max-width: 480px) 100vw, 480px\" \/><\/a><\/p>\n<p>Thus, for the\u00a0<em>average<\/em>\u00a0price of apartments located in Nelson, BC, six kilometres away from downtown, we are 95% confident that this average price\u00a0will be between $18,200 and $60,920, with a width of $47,720. Compared with the earlier width for C.I., it is obvious that we are less confident in predicting the average price. The reason is that the S.E. for the prediction is always larger than the S.E. for the confidence interval.<\/p>\n<p>This process can be repeated for all different\u00a0levels of\u00a0<em>x<\/em>, to calculate the associated confidence and prediction intervals. By doing this, we will have a range of lower and upper levels for both P.I.s and C.I.s. All these numbers can be reproduced within the interactive Excel template shown in Figure 8.8. If you use a statistical software such as Minitab, you will directly plot a scatter diagram with all P.I.s and C.I.s as well as the estimated linear regression line all in one diagram. Figure 8.10 shows such a diagram from Minitab for our\u00a0example.<\/p>\n<figure id=\"attachment_377\" aria-describedby=\"caption-attachment-377\" style=\"width: 666px\" class=\"wp-caption aligncenter\"><a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image351.png\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image351.png\" alt=\"image35\" class=\"wp-image-377\" height=\"444\" width=\"666\" srcset=\"https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image351.png 900w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image351-300x200.png 300w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image351-65x43.png 65w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image351-225x150.png 225w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/image351-350x233.png 350w\" sizes=\"auto, (max-width: 666px) 100vw, 666px\" \/><\/a><figcaption id=\"caption-attachment-377\" class=\"wp-caption-text\">Figure 8.10 Minitab Plot for C.I. and P.I.<\/figcaption><\/figure>\n<p>Figure 8.10 indicates that a more reliable prediction should\u00a0be made as close as possible to the mean of our observations for\u00a0<i>x<\/i>. In this graph, the widths of both intervals are at the lowest\u00a0levels closer to the means of <em>x<\/em> and <em>y<\/em>.<\/p>\n<p>You should be careful to note that Figure 8.10 provides the predicted intervals only for the case of a simple regression model. For the multiple regression model, you may use other statistical software packages, such as SAS, SPSS, etc., to estimate both P.I. and C.I. For instance, by selecting <em>x<sub>1<\/sub><\/em>=3, and <em>x<sub>2<\/sub><\/em>=300, and coding these figures into Minitab, you will\u00a0see the results as shown in Figure 8.11. Alternatively, you may use the interactive Excel template provided in Figure 8.8 to estimate your multiple regression model, and to check for the significance of the estimated parameters. This template can also be used to construct both the P.I. and C.I. for the given values of <em>x<sub>1<\/sub><\/em>=3, and <em>x<sub>2<\/sub><\/em>=300 or any other values of your choice. Furthermore, this template enables you to test if the estimated multiple regression model is overall significant. When the estimated multiple regression model is not overall significant, this template will not provide the P.I. and C.I. To practice this case, you may want to change the yellow columns of <em>x<sub>1<\/sub><\/em> and <em>x<sub>2<\/sub><\/em> with different random numbers that are not correlated with the dependent variable. Once the estimated model is not overall significant, no prediction values will be provided.<\/p>\n<figure id=\"attachment_402\" aria-describedby=\"caption-attachment-402\" style=\"width: 417px\" class=\"wp-caption aligncenter\"><a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/prediction-intervals-multiple-regression1.png\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/prediction-intervals-multiple-regression1.png\" alt=\"prediction intervals multiple regression\" class=\"wp-image-402 size-full\" height=\"207\" width=\"417\" srcset=\"https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/prediction-intervals-multiple-regression1.png 417w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/prediction-intervals-multiple-regression1-300x149.png 300w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/prediction-intervals-multiple-regression1-65x32.png 65w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/prediction-intervals-multiple-regression1-225x112.png 225w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/prediction-intervals-multiple-regression1-350x174.png 350w\" sizes=\"auto, (max-width: 417px) 100vw, 417px\" \/><\/a><figcaption id=\"caption-attachment-402\" class=\"wp-caption-text\">Figure 8.11<\/figcaption><\/figure>\n<p>The 95% C.I., and P.I. figures in the brackets are the lower and upper limits of the intervals given the specific values for distance and size of apartments. The fitted value of the price of apartment, as well as the standard error of this value, are also estimated.<\/p>\n<p>We have just given you some rough ideas about how the basic regression calculations are done. We left out other steps needed to calculate more detailed results of regression without a computer on purpose, for you will never compute a regression without a computer (or a high-end calculator) in all of your working years. However, by working with these interactive templates, you will have a much better chance to play around with any data to see how the outcomes can be altered, and to observe their implications for the real-world business decision-making process.<\/p>\n<h1>Correlation and covariance<\/h1>\n<p>The correlation between two variables is important in statistics, and it is commonly reported. What is correlation? The meaning of correlation can be discovered by looking closely at the word\u2014it is almost co-relation, and that is what it means: how two variables are co-related. Correlation is also closely related to regression. The covariance between two variables is also important in statistics, but it is seldom reported. Its meaning can also be discovered by looking closely at the word\u2014it is co-variance, how two variables vary together. Covariance plays a behind-the-scenes role in multivariate statistics. Though you will not see covariance reported very often, understanding it will help you understand multivariate statistics like understanding variance helps you understand univariate statistics.<\/p>\n<p>There are two ways to look at correlation. The first flows directly from regression and the second from covariance. Since you just learned about regression, it makes sense to start with that approach.<\/p>\n<p>Correlation is measured with a number between -1 and +1 called the correlation coefficient. The population correlation coefficient is usually written as the Greek <strong>rho<\/strong>, <em>\u03c1<\/em>, and the sample correlation coefficient as <em>r<\/em>. If you have a linear regression equation with only one explanatory variable, the sign of the correlation coefficient shows whether the slope of the regression line is positive or negative, while the absolute value of the coefficient shows how close to the regression line the points lie. If <em>\u03c1<\/em> is +.95, then the regression line has a positive slope and the points in the population are very close to the regression line. If <em>r<\/em> is -.13 then the regression line has a negative slope and the points in the sample are scattered far from the regression line. If you square <em>r<\/em>, you will get <em>R<sup>2<\/sup><\/em>, which is higher if the points in the sample lie very close to the regression line so that the sum of squares regression is close to the sum of squares total.<\/p>\n<p>The other approach to explaining correlation requires understanding covariance, how two variables vary together. Because covariance is a multivariate statistic, it measures something about a sample or population of observations where each observation has two or more variables. Think of a population of (<em>x<\/em>,<em>y<\/em>) pairs. First find the mean of the <em>x<\/em>\u2019s and the mean of the <em>y<\/em>\u2019s, <em>\u03bc<sub>x<\/sub><\/em> and <em>\u03bc<sub>y<\/sub><\/em>. Then for each observation, find (<em>x<\/em> - <em>\u03bc<sub>x<\/sub><\/em>)(<em>y<\/em> - <em>\u03bc<sub>y<\/sub><\/em>). If the <em>x<\/em> and the <em>y<\/em> in this observation are both far above their means, then this number will be large and positive. If both are far below their means, it will also be large and positive. If you found \u03a3(<em>x<\/em> - <em>\u03bc<sub>x<\/sub><\/em>)(<em>y<\/em> - <em>\u03bc<sub>y<\/sub><\/em>), it would be large and positive if <em>x<\/em> and y move up and down together, so that large <em>x<\/em>\u2019s go with large <em>y<\/em>\u2019s, small <em>x<\/em>\u2019s go with small <em>y<\/em>\u2019s, and medium <em>x<\/em>\u2019s go with medium <em>y<\/em>\u2019s. However, if some of the large <em>x<\/em>\u2019s go with medium <em>y<\/em>\u2019s, etc. then the sum will be smaller, though probably still positive. A \u03a3(<em>x<\/em> - <em>\u03bc<sub>x<\/sub><\/em>)(y - <em>\u03bc<sub>y<\/sub><\/em>) implies that <em>x<\/em>\u2019s above <em>\u03bc<sub>x<\/sub><\/em> are generally paired with <em>y<\/em>\u2019s above <em>\u03bc<sub>y<\/sub><\/em>, and those <em>x<\/em>\u2019s below their mean are generally paired with <em>y<\/em>\u2019s below their mean. As you can see, the sum is a measure of how <em>x<\/em> and <em>y<\/em> vary together. The more often similar <em>x<\/em>\u2019s are paired with similar <em>y<\/em>\u2019s, the more <em>x<\/em> and <em>y<\/em> vary together and the larger the sum and the covariance. The term for a single observation, (<em>x<\/em> - <em>\u03bc<sub>x<\/sub><\/em>)(<em>y<\/em> - <em>\u03bc<sub>y<\/sub><\/em>), will be negative when the <em>x<\/em> and <em>y<\/em> are on opposite sides of their means. If large <em>x<\/em>\u2019s are usually paired with small <em>y<\/em>\u2019s, and vice versa, most of the terms will be negative and the sum will be negative. If the largest <em>x<\/em>\u2019s are paired with the smallest <em>y<\/em>\u2019s and the smallest <em>x<\/em>\u2019s with the largest <em>y<\/em>\u2019s, then many of the (<em>x<\/em> - <em>\u03bc<sub>x<\/sub><\/em>)(<em>y<\/em> - <em>\u03bc<sub>y<\/sub><\/em>) will be large and negative and so will the sum. A population with more members will have a larger sum simply because there are more terms to be added together, so you divide the sum by the number of observations to get the final measure, the covariance, or cov:<\/p>\n<p><a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2014\/10\/10000000000000DF0000002E78028BB81.png\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2014\/10\/10000000000000DF0000002E78028BB81.png\" alt=\"Population covariance\" class=\"wp-image-117 size-full alignnone\" height=\"46\" width=\"223\" srcset=\"https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2014\/10\/10000000000000DF0000002E78028BB81.png 223w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2014\/10\/10000000000000DF0000002E78028BB81-65x13.png 65w\" sizes=\"auto, (max-width: 223px) 100vw, 223px\" \/><\/a><\/p>\n<p>The maximum for the covariance is the product of the standard deviations of the <em>x<\/em> values and the <em>y<\/em> values, <em>\u03c3<sub>x<\/sub><\/em><em>\u03c3<sub>y<\/sub><\/em>. While proving that the maximum is exactly equal to the product of the standard deviations is complicated, you should be able to see that the more spread out the points are, the greater the covariance can be. By now you should understand that a larger standard deviation means that the points are more spread out, so you should understand that a larger <em>\u03c3<sub>x<\/sub><\/em> or a larger <em>\u03c3<sub>y<\/sub><\/em> will allow for a greater covariance.<\/p>\n<p>Sample covariance is measured similarly, except the sum is divided by <em>n<\/em>-1 so that sample covariance is an unbiased estimator of population covariance:<\/p>\n<p>[latex]sample \\ cov= \\frac{\\sum{(x-\\bar{x})(y-\\bar{y})}}{(n-1)}[\/latex]<\/p>\n<p>Correlation simply compares the covariance to the standard deviations of the two variables. Using the formula for population correlation:<\/p>\n<p><a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/07\/Screen-Shot-2015-07-29-at-3.12.39-PM.png\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/07\/Screen-Shot-2015-07-29-at-3.12.39-PM.png\" class=\"alignnone wp-image-943\" height=\"71\" width=\"92\" alt=\"image\" srcset=\"https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/07\/Screen-Shot-2015-07-29-at-3.12.39-PM.png 132w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/07\/Screen-Shot-2015-07-29-at-3.12.39-PM-65x50.png 65w\" sizes=\"auto, (max-width: 92px) 100vw, 92px\" \/><\/a><\/p>\n<p>or<br \/>\n<a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/07\/Screen-Shot-2015-07-29-at-3.10.09-PM.png\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/07\/Screen-Shot-2015-07-29-at-3.10.09-PM-300x71.png\" alt=\"Screen Shot 2015-07-29 at 3.10.09 PM\" class=\"alignnone wp-image-940\" height=\"74\" width=\"313\" srcset=\"https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/07\/Screen-Shot-2015-07-29-at-3.10.09-PM-300x71.png 300w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/07\/Screen-Shot-2015-07-29-at-3.10.09-PM-65x15.png 65w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/07\/Screen-Shot-2015-07-29-at-3.10.09-PM-225x53.png 225w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/07\/Screen-Shot-2015-07-29-at-3.10.09-PM-350x83.png 350w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/07\/Screen-Shot-2015-07-29-at-3.10.09-PM.png 490w\" sizes=\"auto, (max-width: 313px) 100vw, 313px\" \/><\/a><\/p>\n<p>At its maximum, the absolute value of the covariance equals the product of the standard deviations, so at its maximum, the absolute value of <em>r<\/em> will be 1. Since the covariance can be negative or positive while standard deviations are always positive, <em>r<\/em> can be either negative or positive. Putting these two facts together, you can see that <em>r<\/em> will be between -1 and +1. The sign depends on the sign of the covariance and the absolute value depends on how close the covariance is to its maximum. The covariance rises as the relationship between <em>x<\/em> and <em>y<\/em> grows stronger, so a strong relationship between <em>x<\/em> and <em>y<\/em> will result in <em>r<\/em> having a value close to -1 or +1.<\/p>\n<h1>Covariance, correlation, and regression<\/h1>\n<p>Now it is time to think about how all of this fits together and to see how the two approaches to correlation are related. Start by assuming that you have a population of (<em>x<\/em>, <em>y<\/em>) which covers a wide range of <em>y<\/em>-values, but only a narrow range of <em>x<\/em>-values. This means that <em>\u03c3<sub>y<\/sub><\/em> is large while <em>\u03c3<sub>x<\/sub><\/em> is small. Assume that you graph the (<em>x<\/em>, <em>y<\/em>) points and find that they all lie in a narrow band stretched linearly from bottom left to top right, so that the largest y\u2019s are paired with the largest <em>x<\/em>\u2019s and the smallest <em>y<\/em>\u2019s with the smallest <em>x<\/em>\u2019s. This means both that the covariance is large and a good regression line that comes very close to almost all the points is easily drawn. The correlation coefficient will also be very high (close to +1). An example will show why all these happen together.<\/p>\n<p>Imagine that the equation for the regression line is <em>y<\/em>=3+4<em>x<\/em>, <em>\u03bc<sub>y<\/sub><\/em> = 31, and <em>\u03bc<sub>x<\/sub><\/em> = 7, and the two points farthest to the top right, (10, 43) and (12, 51), lie exactly on the regression line. These two points together contribute \u2211(<em>x<\/em>-<em>\u03bc<sub>x<\/sub><\/em>)(<em>y<\/em>-<em>\u03bc<sub>y<\/sub><\/em>) =(10-7)(43-31)+(12-7)(51-31)= 136 to the numerator of the covariance. If we switched the <em>x<\/em>\u2019s and <em>y<\/em>\u2019s of these two points, moving them off the regression line, so that they became (10, 51) and (12, 43), <em>\u03bc<sub>x<\/sub><\/em>, <em>\u03bc<sub>y<\/sub><\/em>, <em>\u03c3<sub>x<\/sub><\/em>, and <em>\u03c3<sub>y<\/sub><\/em> would remain the same, but these points would only contribute (10-7)(51-31)+(12-7)(43-31)= 120 to the numerator. As you can see, covariance is at its greatest, given the distributions of the <em>x<\/em>\u2019s and <em>y<\/em>\u2019s, when the (<em>x<\/em>, <em>y<\/em>) points lie on a straight line. Given that correlation, <em>r<\/em>, equals 1 when the covariance is maximized, you can see that <em>r<\/em>=+1 when the points lie exactly on a straight line (with a positive slope). The closer the points lie to a straight line, the closer the covariance is to its maximum, and the greater the correlation.<\/p>\n<p>As the example in Figure 8.12 shows, the closer the points lie to a straight line, the higher the correlation. Regression finds the straight line that comes as close to the points as possible, so it should not be surprising that correlation and regression are related. One of the ways the <strong>goodness of fit<\/strong> of a regression line can be measured is by <em>R<sup>2<\/sup><\/em>. For the simple two-variable case, <em>R<sup>2<\/sup><\/em> is simply the correlation coefficient\u00a0<em>r<\/em>, squared.<\/p>\n<figure id=\"attachment_277\" aria-describedby=\"caption-attachment-277\" style=\"width: 655px\" class=\"wp-caption aligncenter\"><a href=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/Screen-Shot-2015-03-19-at-3.12.14-PM.png\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/Screen-Shot-2015-03-19-at-3.12.14-PM-1024x618.png\" alt=\"Screen Shot 2015-03-19 at 3.12.14 PM\" class=\"wp-image-277\" height=\"395\" width=\"655\" srcset=\"https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/Screen-Shot-2015-03-19-at-3.12.14-PM-1024x618.png 1024w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/Screen-Shot-2015-03-19-at-3.12.14-PM-300x181.png 300w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/Screen-Shot-2015-03-19-at-3.12.14-PM-65x39.png 65w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/Screen-Shot-2015-03-19-at-3.12.14-PM-225x136.png 225w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/Screen-Shot-2015-03-19-at-3.12.14-PM-350x211.png 350w, https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-content\/uploads\/sites\/45\/2015\/03\/Screen-Shot-2015-03-19-at-3.12.14-PM.png 1074w\" sizes=\"auto, (max-width: 655px) 100vw, 655px\" \/><\/a><figcaption id=\"caption-attachment-277\" class=\"wp-caption-text\">Figure 8.12 Plot of Initial Population<\/figcaption><\/figure>\n<p>Correlation does not tell us anything about how steep or flat the regression line is, though it does tell us if the slope is positive or negative. If we took the initial population shown in Figure 8.12, and stretched it both left and right horizontally so that each point\u2019s <em>x<\/em>-value changed, but its <em>y<\/em>-value stayed the same, <em>\u03c3<sub>x<\/sub><\/em> would grow while <em>\u03c3<sub>y<\/sub><\/em> stayed the same. If you pulled equally to the right and to the left, both <em>\u03bc<sub>x<\/sub><\/em> and <em>\u03bc<sub>y<\/sub><\/em> would stay the same. The covariance would certainly grow since the (<em>x<\/em>-<em>\u03bc<sub>x<\/sub><\/em>) that goes with each point would be larger absolutely while the (<em>y<\/em>-<em>\u03bc<sub>y<\/sub><\/em>)\u2019s would stay the same. The equation of the regression line would change, with the slope b becoming smaller, but the correlation coefficient would be the same because the points would be just as close to the regression line as before. Once again, notice that correlation tells you how well the line fits the points, but it does not tell you anything about the slope other than if it is positive or negative. If the points are stretched out horizontally, the slope changes but correlation does not. Also notice that though the covariance increases, correlation does not because <em>\u03c3<sub>x<\/sub><\/em> increases, causing the denominator in the equation for finding <em>r<\/em> to increase as much as covariance, the numerator.<\/p>\n<p>The regression line and covariance approaches to understanding correlation are obviously related. If the points in the population lie very close to the regression line, the covariance will be large in absolute value since the <em>x<\/em>\u2019s that are far from their mean will be paired with <em>y<\/em>\u2019s that\u00a0are far from theirs. A positive regression slope means that <em>x<\/em> and <em>y<\/em> rise and fall together, which also means that the covariance will be positive. A negative regression slope means that <em>x<\/em> and <em>y<\/em> move in opposite directions, which means a negative covariance.<\/p>\n<h1>Summary<\/h1>\n<p>Simple linear regression allows researchers to estimate the parameters \u2014 the intercept and slopes \u2014 of linear equations connecting two or more variables. Knowing that a dependent variable is functionally related to one or more independent or explanatory variables, and having an estimate of the parameters of that function, greatly improves the ability of a researcher to predict the values the dependent variable will take under many conditions. Being able to estimate the effect that one independent variable has on the value of the dependent variable in isolation from changes in other independent variables can be a powerful aid in decision-making and policy design. Being able to test the existence of individual effects of a number of independent variables helps decision-makers, researchers, and policy-makers identify what variables are most important. Regression is a very powerful statistical tool in many ways.<\/p>\n<p>The idea behind regression is simple: it is simply the equation of the line that comes as close as possible to as many of the points as possible. The mathematics of regression are not so simple, however. Instead of trying to learn the math, most researchers use computers to find regression equations, so this chapter stressed reading computer printouts rather than the mathematics of regression.<\/p>\n<p>Two other topics, which are related to each other and to regression, were also covered: correlation and covariance.<\/p>\n<p>Something as powerful as linear regression must have limitations and problems. There is a whole subject, econometrics, which deals with identifying and overcoming the limitations and problems of regression.<\/p>\n<hr class=\"before-footnotes clear\" \/><div class=\"footnotes\"><ol><li id=\"footnote-118-1\">Thomas, G.B. (1960). <em>Calculus and analytical geometry<\/em> (3rd ed.). Boston, MA: Addison-Wesley. <a href=\"#return-footnote-118-1\" class=\"return-footnote\" aria-label=\"Return to footnote 1\">&crarr;<\/a><\/li><\/ol><\/div>","protected":false},"author":17,"menu_order":8,"template":"","meta":{"pb_show_title":"on","pb_short_title":"","pb_subtitle":"","pb_authors":[],"pb_section_license":""},"chapter-type":[],"contributor":[],"license":[],"class_list":["post-118","chapter","type-chapter","status-publish","hentry"],"part":3,"_links":{"self":[{"href":"https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-json\/pressbooks\/v2\/chapters\/118","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-json\/pressbooks\/v2\/chapters"}],"about":[{"href":"https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-json\/wp\/v2\/types\/chapter"}],"author":[{"embeddable":true,"href":"https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-json\/wp\/v2\/users\/17"}],"version-history":[{"count":29,"href":"https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-json\/pressbooks\/v2\/chapters\/118\/revisions"}],"predecessor-version":[{"id":1515,"href":"https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-json\/pressbooks\/v2\/chapters\/118\/revisions\/1515"}],"part":[{"href":"https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-json\/pressbooks\/v2\/parts\/3"}],"metadata":[{"href":"https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-json\/pressbooks\/v2\/chapters\/118\/metadata\/"}],"wp:attachment":[{"href":"https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-json\/wp\/v2\/media?parent=118"}],"wp:term":[{"taxonomy":"chapter-type","embeddable":true,"href":"https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-json\/pressbooks\/v2\/chapter-type?post=118"},{"taxonomy":"contributor","embeddable":true,"href":"https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-json\/wp\/v2\/contributor?post=118"},{"taxonomy":"license","embeddable":true,"href":"https:\/\/opentextbc.ca\/introductorybusinessstatistics\/wp-json\/wp\/v2\/license?post=118"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}