Regression by the method of least squares. Method of least squares examples of problem solving
It is widely used in econometrics in the form of a clear economic interpretation of its parameters.
Linear regression is reduced to finding an equation of the form
or
Type equation allows for given parameter values X have theoretical values of the effective feature, substituting the actual values of the factor into it X.
Building linear regression comes down to estimating its parameters — a and in. Linear regression parameter estimates can be found by different methods.
The classical approach to estimating linear regression parameters is based on method least squares (MNK).
LSM allows one to obtain such parameter estimates a and in, under which the sum of the squared deviations of the actual values of the resultant trait (y) from calculated (theoretical) mini-minimum:
To find the minimum of a function, it is necessary to calculate the partial derivatives with respect to each of the parameters a and b and equate them to zero.
Denote by S, then:
Transforming the formula, we obtain the following system of normal equations for estimating the parameters a and in:
Solving the system of normal equations (3.5) either by the method of successive elimination of variables or by the method of determinants, we find the desired parameter estimates a and in.
Parameter in called the regression coefficient. Its value shows the average change in the result with a change in the factor by one unit.
The regression equation is always supplemented with an indicator of the tightness of the relationship. When using linear regression, the linear correlation coefficient acts as such an indicator. There are various modifications of the linear correlation coefficient formula. Some of them are listed below:
As you know, the linear correlation coefficient is within the limits: -1 ≤ ≤ 1.
To assess the quality of the selection of a linear function, the square is calculated
A linear correlation coefficient called determination coefficient . The coefficient of determination characterizes the proportion of the variance of the effective feature y, explained by regression, in the total variance of the resulting trait:
Accordingly, the value 1 - characterizes the proportion of dispersion y, caused by the influence of other factors not taken into account in the model.
Questions for self-control
1. The essence of the method of least squares?
2. How many variables provide a pairwise regression?
3. What coefficient determines the tightness of the connection between the changes?
4. Within what limits is the coefficient of determination determined?
5. Estimation of parameter b in correlation-regression analysis?
1. Christopher Dougherty. Introduction to econometrics. - M.: INFRA - M, 2001 - 402 p.
2. S.A. Borodich. Econometrics. Minsk LLC "New Knowledge" 2001.
3. R.U. Rakhmetov Short course in econometrics. Tutorial. Almaty. 2004. -78s.
4. I.I. Eliseeva. Econometrics. - M.: "Finance and statistics", 2002
5. Monthly information and analytical magazine.
Nonlinear economic models. Nonlinear regression models. Variable conversion.
Nonlinear economic models..
Variable conversion.
elasticity coefficient.
If there are non-linear relationships between economic phenomena, then they are expressed using the corresponding non-linear functions: for example, an equilateral hyperbola , parabolas of the second degree, etc.
There are two classes of non-linear regressions:
1. Regressions that are non-linear with respect to the explanatory variables included in the analysis, but linear with respect to the estimated parameters, for example:
Polynomials of various degrees - , ;
Equilateral hyperbole - ;
Semilogarithmic function - .
2. Regressions that are non-linear in the estimated parameters, for example:
Power - ;
Demonstrative -;
Exponential - .
The total sum of the squared deviations of the individual values of the resulting attribute at from the average value is caused by the influence of many factors. We conditionally divide the entire set of reasons into two groups: studied factor x and other factors.
If the factor does not affect the result, then the regression line on the graph is parallel to the axis oh and
Then the entire dispersion of the resulting attribute is due to the influence of other factors and the total sum of squared deviations will coincide with the residual. If other factors do not affect the result, then u tied With X functionally, and the residual sum of squares is zero. In this case, the sum of squared deviations explained by the regression is the same as the total sum of squares.
Since not all points of the correlation field lie on the regression line, their scatter always takes place as due to the influence of the factor X, i.e. regression at on X, and caused by the action of other causes (unexplained variation). The suitability of the regression line for the forecast depends on what part of the total variation of the trait at accounts for the explained variation
Obviously, if the sum of squared deviations due to regression is greater than the residual sum of squares, then the regression equation is statistically significant and the factor X has a significant impact on the outcome. y.
, i.e. with the number of freedom of independent variation of the feature. The number of degrees of freedom is related to the number of units of the population n and the number of constants determined from it. In relation to the problem under study, the number of degrees of freedom should show how many independent deviations from P
The assessment of the significance of the regression equation as a whole is given with the help of F- Fisher's criterion. In this case, a null hypothesis is put forward that the regression coefficient is equal to zero, i.e. b= 0, and hence the factor X does not affect the result y.
The direct calculation of the F-criterion is preceded by an analysis of the variance. Central to it is the expansion of the total sum of squared deviations of the variable at from the average value at into two parts - "explained" and "unexplained":
Total sum of squared deviations;
Sum of squares of deviation explained by regression;
Residual sum of squared deviation.
Any sum of squared deviations is related to the number of degrees of freedom , i.e. with the number of freedom of independent variation of the feature. The number of degrees of freedom is related to the number of population units n and with the number of constants determined from it. In relation to the problem under study, the number of degrees of freedom should show how many independent deviations from P possible is required to form a given sum of squares.
Dispersion per degree of freedomD.
F-ratios (F-criterion):
If the null hypothesis is true, then the factor and residual variances do not differ from each other. For H 0, a refutation is necessary so that the factor variance exceeds the residual by several times. The English statistician Snedecor developed tables of critical values F-relationships at different levels of significance of the null hypothesis and a different number of degrees of freedom. Table value F-criterion is the maximum value of the ratio of variances that can occur if they diverge randomly for a given level of probability of the presence of a null hypothesis. Computed value F-relationship is recognized as reliable if o is greater than the tabular one.
In this case, the null hypothesis about the absence of a relationship of features is rejected and a conclusion is made about the significance of this relationship: F fact > F table H 0 is rejected.
If the value is less than the table F fact ‹, F table, then the probability of the null hypothesis is higher than a given level and it cannot be rejected without a serious risk of drawing the wrong conclusion about the presence of a relationship. In this case, the regression equation is considered statistically insignificant. N o does not deviate.
Standard error of the regression coefficient
To assess the significance of the regression coefficient, its value is compared with its standard error, i.e., the actual value is determined t-Student's test: which is then compared with the table value at a certain level of significance and the number of degrees of freedom ( n- 2).
Parameter Standard Error a:
The significance of the linear correlation coefficient is checked based on the magnitude of the error correlation coefficient r:
Total variance of a feature X:
Multiple Linear Regression
Model building
Multiple Regression is a regression of an effective feature with two or more factors, i.e. a model of the form
Regression can give a good result in modeling if the influence of other factors affecting the object of study can be neglected. The behavior of individual economic variables cannot be controlled, that is, it is not possible to ensure the equality of all other conditions for assessing the influence of one factor under study. In this case, you should try to identify the influence of other factors by introducing them into the model, i.e. build an equation multiple regression: y = a+b 1 x 1 +b 2 +…+b p x p + .
The main goal of multiple regression is to build a model with a large number of factors, while determining the influence of each of them individually, as well as their cumulative impact on the modeled indicator. The specification of the model includes two areas of questions: the selection of factors and the choice of the type of regression equation
After alignment, we get a function of the following form: g (x) = x + 1 3 + 1 .
We can approximate this data with a linear relationship y = a x + b by calculating the appropriate parameters. To do this, we will need to apply the so-called least squares method. You will also need to make a drawing to check which line will best align the experimental data.
What exactly is OLS (least squares method)
The main thing we need to do is to find such linear dependence coefficients at which the value of the function of two variables F (a, b) = ∑ i = 1 n (y i - (a x i + b)) 2 will be the smallest. In other words, for certain values of a and b, the sum of the squared deviations of the presented data from the resulting straight line will have a minimum value. This is the meaning of the least squares method. All we have to do to solve the example is to find the extremum of the function of two variables.
How to derive formulas for calculating coefficients
In order to derive formulas for calculating the coefficients, it is necessary to compose and solve a system of equations with two variables. To do this, we calculate the partial derivatives of the expression F (a , b) = ∑ i = 1 n (y i - (a x i + b)) 2 with respect to a and b and equate them to 0 .
δ F (a , b) δ a = 0 δ F (a , b) δ b = 0 ⇔ - 2 ∑ i = 1 n (y i - (a x i + b)) x i = 0 - 2 ∑ i = 1 n ( y i - (a x i + b)) = 0 ⇔ a ∑ i = 1 n x i 2 + b ∑ i = 1 n x i = ∑ i = 1 n x i y i a ∑ i = 1 n x i + ∑ i = 1 n b = ∑ i = 1 n y i ⇔ a ∑ i = 1 n x i 2 + b ∑ i = 1 n x i = ∑ i = 1 n x i y i a ∑ i = 1 n x i + n b = ∑ i = 1 n y i
To solve a system of equations, you can use any methods, such as substitution or Cramer's method. As a result, we should get formulas that calculate the coefficients using the least squares method.
n ∑ i = 1 n x i y i - ∑ i = 1 n x i ∑ i = 1 n y i n ∑ i = 1 n - ∑ i = 1 n x i 2 b = ∑ i = 1 n y i - a ∑ i = 1 n x i n
We have calculated the values of the variables for which the function
F (a , b) = ∑ i = 1 n (y i - (a x i + b)) 2 will take the minimum value. In the third paragraph, we will prove why it is so.
This is the application of the least squares method in practice. His formula, which is used to find the parameter a , includes ∑ i = 1 n x i , ∑ i = 1 n y i , ∑ i = 1 n x i y i , ∑ i = 1 n x i 2 , and the parameter
n - it denotes the amount of experimental data. We advise you to calculate each amount separately. The coefficient value b is calculated immediately after a .
Let's go back to the original example.
Example 1
Here we have n equal to five. To make it more convenient to calculate the required amounts included in the coefficient formulas, we fill out the table.
i = 1 | i = 2 | i = 3 | i = 4 | i = 5 | ∑ i = 1 5 | |
x i | 0 | 1 | 2 | 4 | 5 | 12 |
y i | 2 , 1 | 2 , 4 | 2 , 6 | 2 , 8 | 3 | 12 , 9 |
x i y i | 0 | 2 , 4 | 5 , 2 | 11 , 2 | 15 | 33 , 8 |
x i 2 | 0 | 1 | 4 | 16 | 25 | 46 |
Solution
The fourth row contains the data obtained by multiplying the values from the second row by the values of the third for each individual i . The fifth line contains the data from the second squared. The last column shows the sums of the values of the individual rows.
Let's use the least squares method to calculate the coefficients a and b we need. To do this, substitute the desired values from the last column and calculate the sums:
n ∑ i = 1 n x i y i - ∑ i = 1 n x i ∑ i = 1 n y i n ∑ i = 1 n - ∑ i = 1 n x i 2 b = ∑ i = 1 n y i - a ∑ i = 1 n x i n ⇒ a = 5 33 , 8 - 12 12, 9 5 46 - 12 2 b = 12, 9 - a 12 5 ⇒ a ≈ 0, 165 b ≈ 2, 184
We got that the desired approximating straight line will look like y = 0 , 165 x + 2 , 184 . Now we need to determine which line will best approximate the data - g (x) = x + 1 3 + 1 or 0 , 165 x + 2 , 184 . Let's make an estimate using the least squares method.
To calculate the error, we need to find the sums of squared deviations of the data from the lines σ 1 = ∑ i = 1 n (y i - (a x i + b i)) 2 and σ 2 = ∑ i = 1 n (y i - g (x i)) 2 , the minimum value will correspond to a more suitable line.
σ 1 = ∑ i = 1 n (y i - (a x i + b i)) 2 = = ∑ i = 1 5 (y i - (0 , 165 x i + 2 , 184)) 2 ≈ 0 , 019 σ 2 = ∑ i = 1 n (y i - g (x i)) 2 = = ∑ i = 1 5 (y i - (x i + 1 3 + 1)) 2 ≈ 0 , 096
Answer: since σ 1< σ 2 , то прямой, наилучшим образом аппроксимирующей исходные данные, будет
y = 0 , 165 x + 2 , 184 .
The least squares method is clearly shown in the graphic illustration. The red line marks the straight line g (x) = x + 1 3 + 1, the blue line marks y = 0, 165 x + 2, 184. Raw data are marked with pink dots.
Let us explain why exactly approximations of this type are needed.
They can be used in problems that require data smoothing, as well as in those where the data needs to be interpolated or extrapolated. For example, in the problem discussed above, one could find the value of the observed quantity y at x = 3 or at x = 6 . We have devoted a separate article to such examples.
Proof of the LSM method
For the function to take the minimum value for calculated a and b, it is necessary that at a given point the matrix of the quadratic form of the differential of the function of the form F (a, b) = ∑ i = 1 n (y i - (a x i + b)) 2 be positive definite. Let's show you how it should look.
Example 2
We have a second-order differential of the following form:
d 2 F (a ; b) = δ 2 F (a ; b) δ a 2 d 2 a + 2 δ 2 F (a ; b) δ a δ b d a d b + δ 2 F (a ; b) δ b 2 d 2b
Solution
δ 2 F (a ; b) δ a 2 = δ δ F (a ; b) δ a δ a = = δ - 2 ∑ i = 1 n (y i - (a x i + b)) x i δ a = 2 ∑ i = 1 n (x i) 2 δ 2 F (a ; b) δ a δ b = δ δ F (a ; b) δ a δ b = = δ - 2 ∑ i = 1 n (y i - (a x i + b) ) x i δ b = 2 ∑ i = 1 n x i δ 2 F (a ; b) δ b 2 = δ δ F (a ; b) δ b δ b = δ - 2 ∑ i = 1 n (y i - (a x i + b)) δ b = 2 ∑ i = 1 n (1) = 2 n
In other words, it can be written as follows: d 2 F (a ; b) = 2 ∑ i = 1 n (x i) 2 d 2 a + 2 2 ∑ x i i = 1 n d a d b + (2 n) d 2 b .
We have obtained a matrix of quadratic form M = 2 ∑ i = 1 n (x i) 2 2 ∑ i = 1 n x i 2 ∑ i = 1 n x i 2 n .
In this case, the values of individual elements will not change depending on a and b . Is this matrix positive definite? To answer this question, let's check if its angular minors are positive.
Calculate the first order angular minor: 2 ∑ i = 1 n (x i) 2 > 0 . Since the points x i do not coincide, the inequality is strict. We will keep this in mind in further calculations.
We calculate the second-order angular minor:
d e t (M) = 2 ∑ i = 1 n (x i) 2 2 ∑ i = 1 n x i 2 ∑ i = 1 n x i 2 n = 4 n ∑ i = 1 n (x i) 2 - ∑ i = 1 n x i 2
After that, we proceed to the proof of the inequality n ∑ i = 1 n (x i) 2 - ∑ i = 1 n x i 2 > 0 using mathematical induction.
- Let's check whether this inequality is valid for arbitrary n . Let's take 2 and calculate:
2 ∑ i = 1 2 (x i) 2 - ∑ i = 1 2 x i 2 = 2 x 1 2 + x 2 2 - x 1 + x 2 2 = = x 1 2 - 2 x 1 x 2 + x 2 2 = x 1 + x 2 2 > 0
We made it true equality(if the values x 1 and x 2 do not match).
- Let's make the assumption that this inequality will be true for n , i.e. n ∑ i = 1 n (x i) 2 - ∑ i = 1 n x i 2 > 0 – true.
- Now let's prove the validity for n + 1 , i.e. that (n + 1) ∑ i = 1 n + 1 (x i) 2 - ∑ i = 1 n + 1 x i 2 > 0 if n ∑ i = 1 n (x i) 2 - ∑ i = 1 n x i 2 > 0 .
We calculate:
(n + 1) ∑ i = 1 n + 1 (x i) 2 - ∑ i = 1 n + 1 x i 2 = = (n + 1) ∑ i = 1 n (x i) 2 + x n + 1 2 - ∑ i = 1 n x i + x n + 1 2 = = n ∑ i = 1 n (x i) 2 + n x n + 1 2 + ∑ i = 1 n (x i) 2 + x n + 1 2 - - ∑ i = 1 n x i 2 + 2 x n + 1 ∑ i = 1 n x i + x n + 1 2 = = ∑ i = 1 n (x i) 2 - ∑ i = 1 n x i 2 + n x n + 1 2 - x n + 1 ∑ i = 1 n x i + ∑ i = 1 n (x i) 2 = = ∑ i = 1 n (x i) 2 - ∑ i = 1 n x i 2 + x n + 1 2 - 2 x n + 1 x 1 + x 1 2 + + x n + 1 2 - 2 x n + 1 x 2 + x 2 2 + . . . + x n + 1 2 - 2 x n + 1 x 1 + x n 2 = = n ∑ i = 1 n (x i) 2 - ∑ i = 1 n x i 2 + + (x n + 1 - x 1) 2 + (x n + 1 - x 2) 2 + . . . + (x n - 1 - x n) 2 > 0
The expression enclosed in curly braces will be greater than 0 (based on what we assumed in step 2), and the rest of the terms will be greater than 0 because they are all squares of numbers. We have proven the inequality.
Answer: found a and b will match the smallest value functions F (a , b) \u003d ∑ i \u003d 1 n (y i - (a x i + b)) 2, which means that they are the desired parameters of the least squares method (LSM).
If you notice a mistake in the text, please highlight it and press Ctrl+Enter
Least squares method (OLS, eng. Ordinary Least Squares, OLS) - mathematical method, used to solve various problems, based on minimizing the sum of squared deviations of some functions from the desired variables. It can be used to "solve" overdetermined systems of equations (when the number of equations exceeds the number of unknowns), to find a solution in the case of ordinary (not overdetermined) nonlinear systems of equations, to approximate the point values of some function. OLS is one of the basic methods of regression analysis for estimating unknown parameters of regression models from sample data.
Encyclopedic YouTube
1 / 5
✪ Least squares method. Topic
✪ Least squares, lesson 1/2. Linear function
✪ Econometrics. Lecture 5. Least squares method
✪ Mitin I. V. - Processing the results of physical. experiment - Least squares method (Lecture 4)
✪ Econometrics: The essence of the method of least squares #2
Subtitles
Story
Before early XIX in. scientists did not have certain rules for solving a system of equations in which the number of unknowns is less than the number of equations; Until that time, particular methods were used, depending on the type of equations and on the ingenuity of calculators, and therefore different calculators, starting from the same observational data, came to different conclusions. Gauss (1795) is credited with the first application of the method, and Legendre (1805) independently discovered and published it under modern name(fr. Methode des moindres quarres) . Laplace connected the method with the theory of probabilities, and the American mathematician Adrain (1808) considered its probabilistic applications. The method is widespread and improved by further research by Encke, Bessel, Hansen and others.
The essence of the method of least squares
Let x (\displaystyle x)- kit n (\displaystyle n) unknown variables (parameters), f i (x) (\displaystyle f_(i)(x)), , m > n (\displaystyle m>n)- set of functions from this set of variables. The problem is to choose such values x (\displaystyle x) so that the values of these functions are as close as possible to some values y i (\displaystyle y_(i)). Essentially we are talking on the "solution" of an overdetermined system of equations f i (x) = y i (\displaystyle f_(i)(x)=y_(i)), i = 1 , … , m (\displaystyle i=1,\ldots ,m) in the indicated sense, the maximum proximity of the left and right parts of the system. The essence of LSM is to choose as a "measure of proximity" the sum of the squared deviations of the left and right parts | f i (x) − y i | (\displaystyle |f_(i)(x)-y_(i)|). Thus, the essence of the LSM can be expressed as follows:
∑ i e i 2 = ∑ i (y i − f i (x)) 2 → min x (\displaystyle \sum _(i)e_(i)^(2)=\sum _(i)(y_(i)-f_( i)(x))^(2)\rightarrow \min _(x)).If the system of equations has a solution, then the minimum of the sum of squares will be equal to zero and exact solutions of the system of equations can be found analytically or, for example, by various numerical optimization methods. If the system is overdetermined, that is, loosely speaking, the number of independent equations is greater than the number of unknown variables, then the system does not have an exact solution and the least squares method allows us to find some "optimal" vector x (\displaystyle x) in the sense of the maximum proximity of the vectors y (\displaystyle y) and f (x) (\displaystyle f(x)) or the maximum proximity of the deviation vector e (\displaystyle e) to zero (proximity is understood in the sense of Euclidean distance).
Example - system of linear equations
In particular, the least squares method can be used to "solve" the system linear equations
A x = b (\displaystyle Ax=b),where A (\displaystyle A) rectangular matrix size m × n , m > n (\displaystyle m\times n,m>n)(i.e. the number of rows of matrix A is greater than the number of required variables).
Such a system of equations generally has no solution. Therefore, this system can be "solved" only in the sense of choosing such a vector x (\displaystyle x) to minimize the "distance" between vectors A x (\displaystyle Ax) and b (\displaystyle b). To do this, you can apply the criterion for minimizing the sum of squared differences of the left and right parts of the equations of the system, that is (A x − b) T (A x − b) → min x (\displaystyle (Ax-b)^(T)(Ax-b)\rightarrow \min _(x)). It is easy to show that the solution of this minimization problem leads to the solution of the following system of equations
A T A x = A T b ⇒ x = (A T A) − 1 A T b (\displaystyle A^(T)Ax=A^(T)b\Rightarrow x=(A^(T)A)^(-1)A^ (T)b).OLS in regression analysis (data approximation)
Let there be n (\displaystyle n) values of some variable y (\displaystyle y)(this may be the results of observations, experiments, etc.) and the corresponding variables x (\displaystyle x). The challenge is to make the relationship between y (\displaystyle y) and x (\displaystyle x) approximate by some function known up to some unknown parameters b (\displaystyle b), that is, actually find the best values of the parameters b (\displaystyle b), maximally approximating the values f (x , b) (\displaystyle f(x,b)) to actual values y (\displaystyle y). In fact, this reduces to the case of "solution" of an overdetermined system of equations with respect to b (\displaystyle b):
F (x t , b) = y t , t = 1 , … , n (\displaystyle f(x_(t),b)=y_(t),t=1,\ldots ,n).
In regression analysis, and in particular in econometrics, probabilistic models of the relationship between variables are used.
Y t = f (x t , b) + ε t (\displaystyle y_(t)=f(x_(t),b)+\varepsilon _(t)),
where ε t (\displaystyle \varepsilon _(t))- so called random errors models.
Accordingly, the deviations of the observed values y (\displaystyle y) from model f (x , b) (\displaystyle f(x,b)) already assumed in the model itself. The essence of LSM (ordinary, classical) is to find such parameters b (\displaystyle b), at which the sum of squared deviations (errors, for regression models they are often called regression residuals) e t (\displaystyle e_(t)) will be minimal:
b ^ O L S = arg min b R S S (b) (\displaystyle (\hat (b))_(OLS)=\arg \min _(b)RSS(b)),where R S S (\displaystyle RSS)- English. Residual Sum of Squares is defined as:
R S S (b) = e T e = ∑ t = 1 n e t 2 = ∑ t = 1 n (y t − f (x t , b)) 2 (\displaystyle RSS(b)=e^(T)e=\sum _ (t=1)^(n)e_(t)^(2)=\sum _(t=1)^(n)(y_(t)-f(x_(t),b))^(2) ).In the general case, this problem can be solved by numerical methods of optimization (minimization). In this case, one speaks of nonlinear least squares(NLS or NLLS - eng. Non-Linear Least Squares). In many cases, an analytical solution can be obtained. To solve the minimization problem, it is necessary to find the stationary points of the function R S S (b) (\displaystyle RSS(b)), differentiating it with respect to unknown parameters b (\displaystyle b), equating the derivatives to zero and solving the resulting system of equations:
∑ t = 1 n (y t − f (x t , b)) ∂ f (x t , b) ∂ b = 0 (\displaystyle \sum _(t=1)^(n)(y_(t)-f(x_ (t),b))(\frac (\partial f(x_(t),b))(\partial b))=0).LSM in the case of linear regression
Let the regression dependence be linear:
y t = ∑ j = 1 k b j x t j + ε = x t T b + ε t (\displaystyle y_(t)=\sum _(j=1)^(k)b_(j)x_(tj)+\varepsilon =x_( t)^(T)b+\varepsilon _(t)).Let y is the column vector of observations of the variable being explained, and X (\displaystyle X)- this is (n × k) (\displaystyle ((n\times k)))- matrix of observations of factors (rows of the matrix - vectors of factor values in a given observation, by columns - vector of values of a given factor in all observations). The matrix representation of the linear model has the form:
y = Xb + ε (\displaystyle y=Xb+\varepsilon ).Then the vector of estimates of the explained variable and the vector of regression residuals will be equal to
y ^ = X b , e = y − y ^ = y − X b (\displaystyle (\hat (y))=Xb,\quad e=y-(\hat (y))=y-Xb).accordingly, the sum of the squares of the regression residuals will be equal to
R S S = e T e = (y − X b) T (y − X b) (\displaystyle RSS=e^(T)e=(y-Xb)^(T)(y-Xb)).Differentiating this function with respect to the parameter vector b (\displaystyle b) and equating the derivatives to zero, we obtain a system of equations (in matrix form):
(X T X) b = X T y (\displaystyle (X^(T)X)b=X^(T)y).In the deciphered matrix form, this system of equations looks like this:
(∑ x t 1 2 ∑ x t 1 x t 2 ∑ x t 1 x t 3 … ∑ x t 1 x t k ∑ x t 2 x t 1 ∑ x t 2 2 ∑ x t 2 x t 3 … ∑ x t 2 x t k ∑ x t 3 x t 1 ∑ x t x t 3 2 … ∑ x t 3 x t k ⋮ ⋮ ⋮ ⋱ ⋮ ∑ x t k x t 1 ∑ x t k x t 2 ∑ x t k x t 3 … ∑ x t k 2) (b 1 b 2 b 3 ⋮ b k) = (∑ x t 1 y x t ∑ 3 y t ⋮ ∑ x t k y t) , (\displaystyle (\begin(pmatrix)\sum x_(t1)^(2)&\sum x_(t1)x_(t2)&\sum x_(t1)x_(t3)&\ldots &\sum x_(t1)x_(tk)\\\sum x_(t2)x_(t1)&\sum x_(t2)^(2)&\sum x_(t2)x_(t3)&\ldots &\ sum x_(t2)x_(tk)\\\sum x_(t3)x_(t1)&\sum x_(t3)x_(t2)&\sum x_(t3)^(2)&\ldots &\sum x_ (t3)x_(tk)\\\vdots &\vdots &\vdots &\ddots &\vdots \\\sum x_(tk)x_(t1)&\sum x_(tk)x_(t2)&\sum x_ (tk)x_(t3)&\ldots &\sum x_(tk)^(2)\\\end(pmatrix))(\begin(pmatrix)b_(1)\\b_(2)\\b_(3 )\\\vdots \\b_(k)\\\end(pmatrix))=(\begin(pmatrix)\sum x_(t1)y_(t)\\\sum x_(t2)y_(t)\\ \sum x_(t3)y_(t)\\\vdots \\\sum x_(tk)y_(t)\\\end(pmatrix))) where all sums are taken over all allowed values t (\displaystyle t).
If a constant is included in the model (as usual), then x t 1 = 1 (\displaystyle x_(t1)=1) for all t (\displaystyle t), therefore, in the upper left corner of the matrix of the system of equations is the number of observations n (\displaystyle n), and in the remaining elements of the first row and first column - just the sum of the values of the variables: ∑ x t j (\displaystyle \sum x_(tj)) and the first element of the right side of the system - ∑ y t (\displaystyle \sum y_(t)).
The solution of this system of equations gives the general formula for the least squares estimates for the linear model:
b ^ O L S = (X T X) − 1 X T y = (1 n X T X) − 1 1 n X T y = V x − 1 C x y (\displaystyle (\hat (b))_(OLS)=(X^(T )X)^(-1)X^(T)y=\left((\frac (1)(n))X^(T)X\right)^(-1)(\frac (1)(n ))X^(T)y=V_(x)^(-1)C_(xy)).For analytical purposes, the last representation of this formula turns out to be useful (in the system of equations when divided by n, arithmetic means appear instead of sums). If in regression model data centered, then in this representation the first matrix has the meaning of the sample covariance matrix of factors, and the second one is the vector of covariances of factors with dependent variable. If, in addition, the data is also normalized at the SKO (that is, ultimately standardized), then the first matrix has the meaning of the sample correlation matrix of factors, the second vector - the vector of sample correlations of factors with the dependent variable.
An important property of LLS estimates for models with a constant- the line of the constructed regression passes through the center of gravity of the sample data, that is, the equality is fulfilled:
y ¯ = b 1 ^ + ∑ j = 2 k b ^ j x ¯ j (\displaystyle (\bar (y))=(\hat (b_(1)))+\sum _(j=2)^(k) (\hat (b))_(j)(\bar (x))_(j)).In particular, in the extreme case, when the only regressor is a constant, we find that the OLS estimate of a single parameter (the constant itself) is equal to the mean value of the variable being explained. That is, the arithmetic mean, known for its good properties from the laws of large numbers, is also an least squares estimate - it satisfies the criterion for the minimum sum of squared deviations from it.
The simplest special cases
In the case of pairwise linear regression y t = a + b x t + ε t (\displaystyle y_(t)=a+bx_(t)+\varepsilon _(t)), when the linear dependence of one variable on another is estimated, the calculation formulas are simplified (you can do without matrix algebra). The system of equations has the form:
(1 x ¯ x ¯ x 2 ¯) (a b) = (y ¯ x y ¯) (\displaystyle (\begin(pmatrix)1&(\bar (x))\\(\bar (x))&(\bar (x^(2)))\\\end(pmatrix))(\begin(pmatrix)a\\b\\\end(pmatrix))=(\begin(pmatrix)(\bar (y))\\ (\overline(xy))\\\end(pmatrix))).From here it is easy to find estimates for the coefficients:
( b ^ = Cov (x , y) Var (x) = x y ¯ − x ¯ y ¯ x 2 ¯ − x ¯ 2 , a ^ = y ¯ − b x ¯ . (\displaystyle (\begin(cases) (\hat (b))=(\frac (\mathop (\textrm (Cov)) (x,y))(\mathop (\textrm (Var)) (x)))=(\frac ((\overline (xy))-(\bar (x))(\bar (y)))((\overline (x^(2)))-(\overline (x))^(2))),\\( \hat (a))=(\bar (y))-b(\bar (x)).\end(cases)))Despite the fact that, in general, models with a constant are preferable, in some cases it is known from theoretical considerations that the constant a (\displaystyle a) should be equal to zero. For example, in physics, the relationship between voltage and current has the form U = I ⋅ R (\displaystyle U=I\cdot R); measuring voltage and current, it is necessary to estimate the resistance. In this case, we are talking about a model y = b x (\displaystyle y=bx). In this case, instead of a system of equations, we have a single equation
(∑ x t 2) b = ∑ x t y t (\displaystyle \left(\sum x_(t)^(2)\right)b=\sum x_(t)y_(t)).
Therefore, the formula for estimating a single coefficient has the form
B ^ = ∑ t = 1 n x t y t ∑ t = 1 n x t 2 = x y ¯ x 2 ¯ (\displaystyle (\hat (b))=(\frac (\sum _(t=1)^(n)x_(t )y_(t))(\sum _(t=1)^(n)x_(t)^(2)))=(\frac (\overline (xy))(\overline (x^(2)) ))).
The case of a polynomial model
If the data is fitted by a polynomial regression function of one variable f (x) = b 0 + ∑ i = 1 k b i x i (\displaystyle f(x)=b_(0)+\sum \limits _(i=1)^(k)b_(i)x^(i)), then, perceiving degrees x i (\displaystyle x^(i)) as independent factors for each i (\displaystyle i) it is possible to estimate the parameters of the model based on the general formula for estimating the parameters of the linear model. To do this, it suffices to take into account in the general formula that with such an interpretation x t i x t j = x t i x t j = x t i + j (\displaystyle x_(ti)x_(tj)=x_(t)^(i)x_(t)^(j)=x_(t)^(i+j)) and x t j y t = x t j y t (\displaystyle x_(tj)y_(t)=x_(t)^(j)y_(t)). Therefore, the matrix equations in this case will take the form:
(n ∑ n x t ... ∑ n x t k ∑ n x t ∑ n x t 2 ... ∑ n x t k + 1 ⋮ ⋱ ⋮ ∑ n x t k ∑ n x t k + 1 ... ∑ n x t 2 k) [b 0 b 1 ⋮ b k] = [∑ n y t ∑ n x t y t ⋮ n x t k y t ] . (\displaystyle (\begin(pmatrix)n&\sum \limits _(n)x_(t)&\ldots &\sum \limits _(n)x_(t)^(k)\\\sum \limits _( n)x_(t)&\sum \limits _(n)x_(t)^(2)&\ldots &\sum \limits _(n)x_(t)^(k+1)\\\vdots & \vdots &\ddots &\vdots \\\sum \limits _(n)x_(t)^(k)&\sum \limits _(n)x_(t)^(k+1)&\ldots &\ sum \limits _(n)x_(t)^(2k)\end(pmatrix))(\begin(bmatrix)b_(0)\\b_(1)\\\vdots \\b_(k)\end( bmatrix))=(\begin(bmatrix)\sum \limits _(n)y_(t)\\\sum \limits _(n)x_(t)y_(t)\\\vdots \\\sum \limits _(n)x_(t)^(k)y_(t)\end(bmatrix)).)
Statistical Properties of OLS Estimates
First of all, we note that for linear models, the least squares estimates are linear estimates, as follows from the above formula. For the unbiasedness of the least squares estimates, it is necessary and sufficient to fulfill the most important condition of regression analysis: the mathematical expectation of a random error conditional on the factors must be equal to zero. This condition is satisfied, in particular, if
- the mathematical expectation of random errors is zero, and
- factors and random errors are independent random values.
The second condition - the condition of exogenous factors - is fundamental. If this property is not satisfied, then we can assume that almost any estimates will be extremely unsatisfactory: they will not even be consistent (that is, even a very large amount of data does not allow obtaining qualitative estimates in this case). In the classical case, a stronger assumption is made about the determinism of factors, in contrast to a random error, which automatically means that the exogenous condition is satisfied. In the general case, for the consistency of the estimates, it is sufficient to satisfy the exogeneity condition together with the convergence of the matrix V x (\displaystyle V_(x)) to some nondegenerate matrix as the sample size increases to infinity.
In order for, in addition to consistency and unbiasedness, the (ordinary) least squares estimates to be also effective (the best in the class of linear unbiased estimates), additional properties of a random error must be satisfied:
These assumptions can be formulated for the covariance matrix of the vector of random errors V (ε) = σ 2 I (\displaystyle V(\varepsilon)=\sigma ^(2)I).
A linear model that satisfies these conditions is called classical. OLS estimates for classical linear regression are unbiased, consistent and most efficient estimates in the class of all linear unbiased estimates (in English literature, the abbreviation is sometimes used blue (Best Linear Unbiased Estimator) is the best linear unbiased estimate; in domestic literature, the Gauss - Markov theorem is more often cited). As it is easy to show, the covariance matrix of the coefficient estimates vector will be equal to:
V (b ^ O L S) = σ 2 (X T X) − 1 (\displaystyle V((\hat (b))_(OLS))=\sigma ^(2)(X^(T)X)^(-1 )).
Efficiency means that this covariance matrix is "minimal" (any linear combination of coefficients, and in particular the coefficients themselves, have a minimum variance), that is, in the class of linear unbiased estimates, the OLS estimates are the best. The diagonal elements of this matrix - the variances of the estimates of the coefficients - are important parameters of the quality of the obtained estimates. However, it is not possible to calculate the covariance matrix because the random error variance is unknown. It can be proved that the unbiased and consistent (for the classical linear model) estimate of the variance of random errors is the value:
S 2 = R S S / (n − k) (\displaystyle s^(2)=RSS/(n-k)).
Substituting this value into the formula for the covariance matrix, we obtain an estimate of the covariance matrix. The resulting estimates are also unbiased and consistent. It is also important that the estimate of the error variance (and hence the variance of the coefficients) and the estimates of the model parameters are independent. random variables, which allows you to get test statistics to test hypotheses about the coefficients of the model.
It should be noted that if the classical assumptions are not met, the least squares parameter estimates are not the most efficient and, where W (\displaystyle W) is some symmetric positive definite weight matrix. Ordinary least squares is a special case of this approach, when the weight matrix is proportional to the identity matrix. As is known, for symmetric matrices (or operators) there is a decomposition W = P T P (\displaystyle W=P^(T)P). Therefore, this functional can be represented as follows e T P T P e = (P e) T P e = e ∗ T e ∗ (\displaystyle e^(T)P^(T)Pe=(Pe)^(T)Pe=e_(*)^(T)e_( *)), that is, this functional can be represented as the sum of the squares of some transformed "residuals". Thus, we can distinguish a class of least squares methods - LS-methods (Least Squares).
It is proved (Aitken's theorem) that for a generalized linear regression model (in which no restrictions are imposed on the covariance matrix of random errors), the most effective (in the class of linear unbiased estimates) are estimates of the so-called. generalized OLS (OMNK, GLS - Generalized Least Squares)- LS-method with a weight matrix equal to the inverse covariance matrix of random errors: W = V ε − 1 (\displaystyle W=V_(\varepsilon )^(-1)).
It can be shown that the formula for the GLS-estimates of the parameters of the linear model has the form
B ^ G L S = (X T V − 1 X) − 1 X T V − 1 y (\displaystyle (\hat (b))_(GLS)=(X^(T)V^(-1)X)^(-1) X^(T)V^(-1)y).
The covariance matrix of these estimates, respectively, will be equal to
V (b ^ G L S) = (X T V − 1 X) − 1 (\displaystyle V((\hat (b))_(GLS))=(X^(T)V^(-1)X)^(- one)).
In fact, the essence of the OLS lies in a certain (linear) transformation (P) of the original data and the application of the usual least squares to the transformed data. The purpose of this transformation is that for the transformed data, the random errors already satisfy the classical assumptions.
Weighted least squares
In the case of a diagonal weight matrix (and hence the covariance matrix of random errors), we have the so-called weighted least squares (WLS - Weighted Least Squares). In this case, the weighted sum of squares of the residuals of the model is minimized, that is, each observation receives a “weight” that is inversely proportional to the variance of the random error in this observation: e T W e = ∑ t = 1 n e t 2 σ t 2 (\displaystyle e^(T)We=\sum _(t=1)^(n)(\frac (e_(t)^(2))(\ sigma _(t)^(2)))). In fact, the data is transformed by weighting the observations (dividing by an amount proportional to the assumed standard deviation of the random errors), and normal least squares is applied to the weighted data.
ISBN 978-5-7749-0473-0.
The method of least squares (OLS, eng. Ordinary Least Squares, OLS) -- a mathematical method used to solve various problems, based on minimizing the sum of squared deviations of some functions from the desired variables. It can be used to "solve" overdetermined systems of equations (when the number of equations exceeds the number of unknowns), to find a solution in the case of ordinary (not overdetermined) nonlinear systems of equations, to approximate point values by some function. OLS is one of the basic methods of regression analysis for estimating unknown parameters of regression models from sample data.
The essence of the method of least squares
Let be a set of unknown variables (parameters), be a set of functions from this set of variables. The task is to select such values of x so that the values of these functions are as close as possible to some values. In essence, we are talking about the "solution" of an overdetermined system of equations in the indicated sense of the maximum closeness of the left and right parts of the system. The essence of the LSM is to choose as a "measure of proximity" the sum of the squared deviations of the left and right parts - . Thus, the essence of the LSM can be expressed as follows:
If the system of equations has a solution, then the minimum of the sum of squares will be equal to zero and exact solutions of the system of equations can be found analytically or, for example, by various numerical optimization methods. If the system is overdetermined, that is, loosely speaking, the number of independent equations is greater than the number of unknown variables, then the system does not have an exact solution and the least squares method allows us to find some “optimal” vector in the sense of the maximum proximity of the vectors and or the maximum proximity of the deviation vector to zero (proximity understood in the sense of Euclidean distance).
Example - system of linear equations
In particular, the least squares method can be used to "solve" the system of linear equations
where the matrix is not square, but rectangular in size (more precisely, the rank of the matrix A is greater than the number of required variables).
Such a system of equations, in the general case, has no solution. Therefore, this system can be "solved" only in the sense of choosing such a vector in order to minimize the "distance" between the vectors and. To do this, you can apply the criterion for minimizing the sum of squared differences of the left and right parts of the equations of the system, that is. It is easy to show that the solution of this minimization problem leads to the solution of the following system of equations
Using the pseudo-inversion operator, the solution can be rewritten like this:
where is the pseudoinverse matrix for.
This problem can also be “solved” using the so-called weighted LSM (see below), when different equations of the system receive different weights from theoretical considerations.
Strict substantiation and determination of the limits of the meaningful applicability of the method were given by A. A. Markov and A. N. Kolmogorov.
OLS in regression analysis (data approximation)[edit | edit wiki text] Let there be values of some variable (it can be the results of observations, experiments, etc.) and corresponding variables. The task is to approximate the relationship between and by some function known up to some unknown parameters, that is, in fact, to find the best parameter values that bring the values as close as possible to the actual values. In fact, this boils down to the case of "solving" an overdetermined system of equations with respect to:
In regression analysis, and in particular in econometrics, probabilistic models of the relationship between variables are used.
where are the so-called random model errors.
Accordingly, the deviations of the observed values from the model values are already assumed in the model itself. The essence of LSM (ordinary, classical) is to find such parameters under which the sum of squared deviations (errors, for regression models they are often called regression residuals) will be minimal:
where is English. Residual Sum of Squares is defined as:
In the general case, this problem can be solved by numerical methods of optimization (minimization). In this case, one speaks of non-linear least squares (NLS or NLLS - Non-Linear Least Squares). In many cases, an analytical solution can be obtained. To solve the minimization problem, it is necessary to find the stationary points of the function by differentiating it with respect to unknown parameters, equating the derivatives to zero and solving the resulting system of equations:
OLS in the case of linear regression[edit | edit wiki text]
Let the regression dependence be linear:
Let y be a column vector of observations of the variable being explained, and be a matrix of observations of factors (rows of the matrix are vectors of factor values in a given observation, columns are a vector of values of a given factor in all observations). The matrix representation of the linear model has the form:
Then the vector of estimates of the explained variable and the vector of regression residuals will be equal to
accordingly, the sum of the squares of the regression residuals will be equal to
Differentiating this function with respect to the parameter vector and equating the derivatives to zero, we obtain a system of equations (in matrix form):
In the deciphered matrix form, this system of equations looks like this:
where all sums are taken over all admissible values.
If a constant is included in the model (as usual), then for all, therefore, in the upper left corner of the matrix of the system of equations is the number of observations, and in the remaining elements of the first row and first column - just the sum of the values of the variables: and the first element of the right side of the system -- .
The solution of this system of equations gives the general formula for the least squares estimates for the linear model:
For analytical purposes, the last representation of this formula turns out to be useful (in the system of equations when divided by n, arithmetic means appear instead of sums). If the data is centered in the regression model, then in this representation the first matrix has the meaning of the sample covariance matrix of factors, and the second one is the factor covariance vector with the dependent variable. If, in addition, the data is also normalized to the standard deviation (that is, eventually standardized), then the first matrix has the meaning of a sample correlation matrix of factors, the second vector - the vector of sample correlations of factors with a dependent variable.
An important property of LLS estimates for models with a constant is that the line of the constructed regression passes through the center of gravity of the sample data, that is, the equality is fulfilled:
In particular, in the extreme case, when the only regressor is a constant, we find that the OLS estimate of a single parameter (the constant itself) is equal to the mean value of the variable being explained. That is, the arithmetic mean, known for its good properties from the laws of large numbers, is also an least squares estimate - it satisfies the criterion for the minimum sum of squared deviations from it.
The simplest special cases[edit | edit wiki text]
In the case of paired linear regression, when the linear dependence of one variable on another is estimated, the calculation formulas are simplified (you can do without matrix algebra). The system of equations has the form:
From here it is easy to find estimates for the coefficients:
Although constant models are generally preferable, in some cases it is known from theoretical considerations that the constant should be zero. For example, in physics, the relationship between voltage and current has the form; measuring voltage and current, it is necessary to estimate the resistance. In this case, we are talking about the model. In this case, instead of a system of equations, we have a single equation
Therefore, the formula for estimating a single coefficient has the form
Statistical properties of OLS estimates[edit | edit wiki text]
First of all, we note that for linear models, the least squares estimates are linear estimates, as follows from the above formula. For non-biased OLS estimates, it is necessary and sufficient to fulfill the most important condition of regression analysis: the mathematical expectation of a random error conditional on the factors must be equal to zero. This condition, in particular, is satisfied if the mathematical expectation of random errors is equal to zero, and the factors and random errors are independent random variables.
The first condition can be considered always satisfied for models with a constant, since the constant takes on a non-zero mathematical expectation of errors (therefore, models with a constant are generally preferable). least square regression covariance
The second condition - the condition of exogenous factors - is fundamental. If this property is not satisfied, then we can assume that almost any estimates will be extremely unsatisfactory: they will not even be consistent (that is, even a very large amount of data does not allow obtaining qualitative estimates in this case). In the classical case, a stronger assumption is made about the determinism of factors, in contrast to a random error, which automatically means that the exogenous condition is satisfied. In the general case, for the consistency of estimates, it is sufficient to fulfill the exogeneity condition together with the convergence of the matrix to some non-singular matrix with an increase in the sample size to infinity.
In order for, in addition to consistency and unbiasedness, the (ordinary) least squares estimates to be also efficient (the best in the class of linear unbiased estimates), additional properties of a random error must be satisfied:
Constant (same) variance of random errors in all observations (no heteroscedasticity):
Lack of correlation (autocorrelation) of random errors in different observations among themselves
These assumptions can be formulated for the covariance matrix of the random error vector
A linear model that satisfies these conditions is called classical. LLS estimates for classical linear regression are unbiased, consistent and most efficient estimates in the class of all linear unbiased estimates (in the English literature they sometimes use the abbreviation BLUE (Best Linear Unbiased Estimator) - the best linear unbiased estimate; in domestic literature, the Gauss theorem is more often given - Markov). As it is easy to show, the covariance matrix of the coefficient estimates vector will be equal to:
Efficiency means that this covariance matrix is "minimal" (any linear combination of coefficients, and in particular the coefficients themselves, have a minimum variance), that is, in the class of linear unbiased estimates, the OLS estimates are the best. The diagonal elements of this matrix, the variances of the estimates of the coefficients, are important parameters of the quality of the obtained estimates. However, it is not possible to calculate the covariance matrix because the random error variance is unknown. It can be proved that the unbiased and consistent (for the classical linear model) estimate of the variance of random errors is the value:
Substituting this value into the formula for the covariance matrix, we obtain an estimate of the covariance matrix. The resulting estimates are also unbiased and consistent. It is also important that the estimate of the error variance (and hence the variances of the coefficients) and the estimates of the model parameters are independent random variables, which makes it possible to obtain test statistics for testing hypotheses about the model coefficients.
It should be noted that if the classical assumptions are not met, the least squares parameter estimates are not the most efficient estimates (remaining unbiased and consistent). However, the estimate of the covariance matrix worsens even more - it becomes biased and inconsistent. This means that statistical conclusions about the quality of the constructed model in this case can be extremely unreliable. One of the solutions last problem is the use of special estimates of the covariance matrix, which are consistent when classical assumptions are violated (standard errors in the White form and standard errors in the Newey-West form). Another approach is to use the so-called generalized least squares.
Generalized least squares[edit | edit wiki text]
Main article: Generalized least squares
The method of least squares allows for a wide generalization. Instead of minimizing the sum of squares of the residuals, one can minimize some positive-definite quadratic form of the vector of residuals, where is some symmetric positive-definite weight matrix. Ordinary least squares is a special case of this approach, when the weight matrix is proportional to the identity matrix. As is known from the theory of symmetric matrices (or operators), there is a decomposition for such matrices. Therefore, this functional can be represented as follows
that is, this functional can be represented as the sum of the squares of some transformed "residuals". Thus, we can distinguish a class of least squares methods - LS-methods (Least Squares).
It is proved (Aitken's theorem) that for a generalized linear regression model (in which no restrictions are imposed on the covariance matrix of random errors), the most effective (in the class of linear unbiased estimates) are estimates of the so-called. generalized least squares (GLS, GLS - Generalized Least Squares) - LS-method with a weight matrix equal to the inverse covariance matrix of random errors: .
It can be shown that the formula for the GLS-estimates of the parameters of the linear model has the form
The covariance matrix of these estimates, respectively, will be equal to
In fact, the essence of the OLS lies in a certain (linear) transformation (P) of the original data and the application of the usual least squares to the transformed data. The purpose of this transformation is that for the transformed data, the random errors already satisfy the classical assumptions.
Weighted OLS[edit | edit wiki text]
In the case of a diagonal weight matrix (and hence the covariance matrix of random errors), we have the so-called weighted least squares (WLS - Weighted Least Squares). In this case, the weighted sum of squares of the residuals of the model is minimized, that is, each observation receives a “weight” that is inversely proportional to the variance of the random error in this observation:
In fact, the data is transformed by weighting the observations (dividing by an amount proportional to the assumed standard deviation of the random errors), and normal least squares is applied to the weighted data.
Least square method
Least square method ( MNK, OLS, Ordinary Least Squares) - one of the basic methods of regression analysis for estimating unknown parameters of regression models from sample data. The method is based on minimizing the sum of squares of regression residuals.
It should be noted that the least squares method itself can be called a method for solving a problem in any area if the solution consists of or satisfies a certain criterion for minimizing the sum of squares of some functions of the unknown variables. Therefore, the least squares method can also be used for an approximate representation (approximation) of a given function by other (simpler) functions, when finding a set of quantities that satisfy equations or restrictions, the number of which exceeds the number of these quantities, etc.
The essence of the MNC
Let some (parametric) model of probabilistic (regression) dependence between the (explained) variable y and many factors (explanatory variables) x
where is the vector of unknown model parameters
- Random model error.Let there also be sample observations of the values of the indicated variables. Let be the observation number (). Then are the values of the variables in the -th observation. Then, for given values of the parameters b, it is possible to calculate the theoretical (model) values of the explained variable y:
The value of the residuals depends on the values of the parameters b.
The essence of LSM (ordinary, classical) is to find such parameters b for which the sum of the squares of the residuals (eng. Residual Sum of Squares) will be minimal:
In the general case, this problem can be solved by numerical methods of optimization (minimization). In this case, one speaks of nonlinear least squares(NLS or NLLS - English. Non Linear Least Squares). In many cases, an analytical solution can be obtained. To solve the minimization problem, it is necessary to find the stationary points of the function by differentiating it with respect to the unknown parameters b, equating the derivatives to zero, and solving the resulting system of equations:
If the random errors of the model are normally distributed, have the same variance, and are not correlated with each other, the least squares parameter estimates are the same as the maximum likelihood method (MLM) estimates.
LSM in the case of a linear model
Let the regression dependence be linear:
Let y- column vector of observations of the explained variable, and - matrix of factor observations (rows of the matrix - vectors of factor values in a given observation, by columns - vector of values of a given factor in all observations). The matrix representation of the linear model has the form:
Then the vector of estimates of the explained variable and the vector of regression residuals will be equal to
accordingly, the sum of the squares of the regression residuals will be equal to
Differentiating this function with respect to the parameter vector and equating the derivatives to zero, we obtain a system of equations (in matrix form):
.The solution of this system of equations gives the general formula for the least squares estimates for the linear model:
For analytical purposes, the last representation of this formula turns out to be useful. If the data in the regression model centered, then in this representation the first matrix has the meaning of the sample covariance matrix of factors, and the second one is the vector of covariances of factors with dependent variable. If, in addition, the data is also normalized at the SKO (that is, ultimately standardized), then the first matrix has the meaning of the sample correlation matrix of factors, the second vector - the vector of sample correlations of factors with the dependent variable.
An important property of LLS estimates for models with a constant- the line of the constructed regression passes through the center of gravity of the sample data, that is, the equality is fulfilled:
In particular, in the extreme case, when the only regressor is a constant, we find that the OLS estimate of a single parameter (the constant itself) is equal to the mean value of the variable being explained. That is, the arithmetic mean, known for its good properties from the laws of large numbers, is also an least squares estimate - it satisfies the criterion for the minimum sum of squared deviations from it.
Example: simple (pairwise) regression
In the case of paired linear regression, the calculation formulas are simplified (you can do without matrix algebra):
Properties of OLS estimates
First of all, we note that for linear models, the least squares estimates are linear estimates, as follows from the above formula. For unbiased OLS estimates, it is necessary and sufficient to fulfill the most important condition of regression analysis: conditional on the factors, the mathematical expectation of a random error must be equal to zero. This condition is satisfied, in particular, if
- the mathematical expectation of random errors is zero, and
- factors and random errors are independent random variables.
The second condition - the condition of exogenous factors - is fundamental. If this property is not satisfied, then we can assume that almost any estimates will be extremely unsatisfactory: they will not even be consistent (that is, even a very large amount of data does not allow obtaining qualitative estimates in this case). In the classical case, a stronger assumption is made about the determinism of factors, in contrast to a random error, which automatically means that the exogenous condition is satisfied. In the general case, for the consistency of estimates, it is sufficient to fulfill the exogeneity condition together with the convergence of the matrix to some non-singular matrix with an increase in the sample size to infinity.
In order for, in addition to consistency and unbiasedness, the (ordinary) least squares estimates to be also effective (the best in the class of linear unbiased estimates), additional properties of a random error must be satisfied:
These assumptions can be formulated for the covariance matrix of the random error vector
A linear model that satisfies these conditions is called classical. OLS estimates for classical linear regression are unbiased, consistent and most efficient estimates in the class of all linear unbiased estimates (in English literature, the abbreviation is sometimes used blue (Best Linear Unbaised Estimator) is the best linear unbiased estimate; in domestic literature, the Gauss-Markov theorem is more often cited). As it is easy to show, the covariance matrix of the coefficient estimates vector will be equal to:
Generalized least squares
The method of least squares allows for a wide generalization. Instead of minimizing the sum of squares of the residuals, one can minimize some positive definite quadratic form of the residual vector , where is some symmetric positive definite weight matrix. Ordinary least squares is a special case of this approach, when the weight matrix is proportional to the identity matrix. As is known from the theory of symmetric matrices (or operators), there is a decomposition for such matrices. Therefore, the specified functional can be represented as follows, that is, this functional can be represented as the sum of the squares of some transformed "residuals". Thus, we can distinguish a class of least squares methods - LS-methods (Least Squares).
It is proved (Aitken's theorem) that for a generalized linear regression model (in which no restrictions are imposed on the covariance matrix of random errors), the most effective (in the class of linear unbiased estimates) are estimates of the so-called. generalized OLS (OMNK, GLS - Generalized Least Squares)- LS-method with a weight matrix equal to the inverse covariance matrix of random errors: .
It can be shown that the formula for the GLS-estimates of the parameters of the linear model has the form
The covariance matrix of these estimates, respectively, will be equal to
In fact, the essence of the OLS lies in a certain (linear) transformation (P) of the original data and the application of the usual least squares to the transformed data. The purpose of this transformation is that for the transformed data, the random errors already satisfy the classical assumptions.
Weighted least squares
In the case of a diagonal weight matrix (and hence the covariance matrix of random errors), we have the so-called weighted least squares (WLS - Weighted Least Squares). In this case, the weighted sum of squares of the residuals of the model is minimized, that is, each observation receives a "weight" that is inversely proportional to the variance of the random error in this observation: . In fact, the data is transformed by weighting the observations (dividing by an amount proportional to the assumed standard deviation of the random errors), and normal least squares is applied to the weighted data.
Some special cases of application of LSM in practice
Linear Approximation
Consider the case when, as a result of studying the dependence of a certain scalar quantity on a certain scalar quantity (This can be, for example, the dependence of voltage on current strength: , where is a constant value, the resistance of the conductor), these quantities were measured, as a result of which the values \u200b\u200band were obtained their corresponding values. Measurement data should be recorded in a table.
Table. Measurement results.
Measurement No. | ||
---|---|---|
1 | ||
2 | ||
3 | ||
4 | ||
5 | ||
6 |
The question sounds like this: what value of the coefficient can be chosen to best describe the dependence ? According to the least squares, this value should be such that the sum of the squared deviations of the values from the values
was minimal
The sum of squared deviations has one extremum - a minimum, which allows us to use this formula. Let's find the value of the coefficient from this formula. To do this, we transform its left side as follows:
The last formula allows us to find the value of the coefficient , which was required in the problem.
Story
Until the beginning of the XIX century. scientists did not have certain rules for solving a system of equations in which the number of unknowns is less than the number of equations; Until that time, particular methods were used, depending on the type of equations and on the ingenuity of calculators, and therefore different calculators, starting from the same observational data, came to different conclusions. Gauss (1795) is credited with the first application of the method, and Legendre (1805) independently discovered and published it under its modern name (fr. Methode des moindres quarres ) . Laplace related the method to the theory of probability, and the American mathematician Adrain (1808) considered its probabilistic applications. The method is widespread and improved by further research by Encke, Bessel, Hansen and others.
Alternative use of MNCs
The idea of the least squares method can also be used in other cases not directly related to regression analysis. The fact is that the sum of squares is one of the most common proximity measures for vectors (the Euclidean metric in finite-dimensional spaces).
One application is "solving" systems of linear equations in which the number of equations is greater than the number of variables
where the matrix is not square, but rectangular.
Such a system of equations, in the general case, has no solution (if the rank is actually greater than the number of variables). Therefore, this system can be "solved" only in the sense of choosing such a vector in order to minimize the "distance" between the vectors and . To do this, you can apply the criterion for minimizing the sum of squared differences of the left and right parts of the equations of the system, that is, . It is easy to show that the solution of this minimization problem leads to the solution of the following system of equations