ARTYKUŁ

Dominik M. Marciniak, Andrzej Dryś, Janusz Pluta

Data analysis. part 2. Linear regression and correlation
2008-04-22

Data Analysis. Part 2: Linear regression and correlation. Linear regression attempts to model the relationship between two variables by fitting a linear equation to observed data. One variable is considered to be an explanatory variable, and the other is considered to be a dependent variable. For example, a modeler might want to relate the weights of individuals to their heights using a linear regression model. The most common method for fitting a regression line is the method of least-squares. This method calculates the best-fitting line for the observed data by minimizing the sum of the squares of the vertical deviations from each data point to the line (if a point lies on the fitted line exactly, then its vertical deviation is 0). Because the deviations are first squared, then summed, there are no cancellations between positive and negative values. The strength of the linear association between two variables is quantified by the Pearson's correlation coefficient. The correlation coefficient always takes a value between -1 and 1, with 1 or -1 indicating perfect correlation (all points would lie along a straight line in this case). A positive correlation indicates a positive association between the variables (increasing values in one variable correspond to increasing values in the other variable), while a negative correlation indicates a negative association between the variables (increasing values is one variable correspond to decreasing values in the other variable). A correlation value close to 0 indicates no association between the variables. Pearson's correlation coefficient is a parametric statistic and when distributions are not normal it may be less useful than non-parametric correlation methods, such as Spearman's ? and Kendall's ?. They are a little less powerful than parametric methods if the assumptions underlying the latter are met, but are less likely to give distorted results when the assumptions fail. Spearman's rank correlation provides a distribution free test of independence between two variables. It is, however, insensitive to some types of dependence. Kendall's rank correlation gives a better measure of correlation and is also a better two sided test for independence.