Regression analysis example. Fundamentals of data analysis
RESULTS
Regression statistics | |
Multiple R | 0,998364 |
R-square | 0,99673 |
Normalized R-square | 0,996321 |
standard error | 0,42405 |
Observations | 10 |
Let's first look at the upper part of the calculations presented in Table 8.3a, the regression statistics.
The value R-square, also called a measure of certainty, characterizes the quality of the resulting regression line. This quality is expressed by the degree of correspondence between the original data and the regression model (calculated data). The measure of certainty is always within the interval .
In most cases, the R-squared value is between these values, called extremes, i.e. between zero and one.
If the value of the R-square is close to one, this means that the constructed model explains almost all the variability of the corresponding variables. Conversely, an R-squared value close to zero means poor quality built model.
In our example, the measure of certainty is 0.99673, which indicates a very good fit of the regression line to the original data.
Multiple R- coefficient multiple correlation R - expresses the degree of dependence of independent variables (X) and dependent variable (Y).
Multiple R equals square root from the coefficient of determination, this value takes values in the range from zero to one.
In a simple linear regression analysis, the multiple R is equal to the Pearson correlation coefficient. Indeed, the multiple R in our case is equal to the Pearson correlation coefficient from the previous example (0.998364).
Odds | standard error | t-statistic | |
Y-intersection | 2,694545455 | 0,33176878 | 8,121757129 |
Variable X 1 | 2,305454545 | 0,04668634 | 49,38177965 |
* A truncated version of the calculations is given |
Now consider the middle part of the calculations presented in table 8.3b. Here, the regression coefficient b (2.305454545) and the offset along the y-axis are given, i.e. constant a (2.694545455).
Based on the calculations, we can write the regression equation as follows:
Y= x*2.305454545+2.694545455
The direction of the relationship between variables is determined based on the signs (negative or positive) regression coefficients(coefficient b).
If the sign at regression coefficient- positive, the relationship of the dependent variable with the independent will be positive. In our case, the sign of the regression coefficient is positive, therefore, the relationship is also positive.
If the sign at regression coefficient- negative, the relationship between the dependent variable and the independent variable is negative (inverse).
In table 8.3c. the results of the output of the residuals are presented. In order for these results to appear in the report, it is necessary to activate the "Residuals" checkbox when launching the "Regression" tool.
REMAINING WITHDRAWAL
Observation | Predicted Y | Remains | Standard balances |
---|---|---|---|
1 | 9,610909091 | -0,610909091 | -1,528044662 |
2 | 7,305454545 | -0,305454545 | -0,764022331 |
3 | 11,91636364 | 0,083636364 | 0,209196591 |
4 | 14,22181818 | 0,778181818 | 1,946437843 |
5 | 16,52727273 | 0,472727273 | 1,182415512 |
6 | 18,83272727 | 0,167272727 | 0,418393181 |
7 | 21,13818182 | -0,138181818 | -0,34562915 |
8 | 23,44363636 | -0,043636364 | -0,109146047 |
9 | 25,74909091 | -0,149090909 | -0,372915662 |
10 | 28,05454545 | -0,254545455 | -0,636685276 |
Using this part of the report, we can see the deviations of each point from the constructed regression line. Greatest absolute value
Regression analysis a method for modeling measured data and studying their properties. Data consists of pairs of values dependent variable(response variable) and independent variable(explaining variable). The regression model is a function of the independent variable and parameters with an added random variable. The model parameters are tuned in such a way that the model approximates the data in the best possible way. The approximation quality criterion (objective function) is usually the mean square error: the sum of the squares of the difference between the values of the model and the dependent variable for all values of the independent variable as an argument. Regression analysis section of mathematical statistics and machine learning. It is assumed that the dependent variable is the sum of the values of some model and a random variable . Regarding the nature of the distribution of this value, assumptions are made, called the data generation hypothesis. To confirm or refute this hypothesis, statistical tests are performed, called residual analysis. This assumes that the independent variable does not contain errors. Regression analysis is used for forecasting, time series analysis, hypothesis testing, and discovering hidden relationships in data.
Definition of regression analysis
The sample may not be a function, but a relation. For example, the data for building a regression might be: . In such a sample, one value of the variable corresponds to several values of the variable.
Linear Regression
Linear regression assumes that the function depends linearly on the parameters. In this case, a linear dependence on the free variable is optional,
In the case where the linear regression function has the form
here are components of the vector .
Parameter values in case linear regression found using the least squares method. The use of this method is justified by the assumption of a Gaussian distribution of a random variable.
The differences between the actual values of the dependent variable and the reconstructed ones are called regression residuals(residuals). Synonyms are also used in the literature: residuals And mistakes. One of the important estimates of the quality criterion of the obtained dependence is the sum of the squares of the residuals:
Here Sum of Squared Errors.
The variance of the residuals is calculated by the formula
Here Mean Square Error.
The graphs show samples marked with blue dots and regression dependencies marked with solid lines. The free variable is plotted along the abscissa, and the dependent variable is plotted along the ordinate. All three dependencies are linear with respect to the parameters.
Nonlinear Regression
Nonlinear Regression Models - View Models
which cannot be represented as a dot product
where - parameters regression model, - free variable from space , - dependent variable, - random value and is a function from some given set.
Parameter values in the case of non-linear regression are found using one of the gradient descent methods, for example, the Levenberg-Marquardt algorithm.
About terms
The term "regression" was coined by Francis Galton in the late 19th century. Galton found that children of tall or short parents do not usually inherit outstanding height and called this phenomenon "regression to mediocrity." At first, the term was used exclusively in the biological sense. After the work of Karl Pearson, this term began to be used in statistics.
In the statistical literature, a distinction is made between regression involving one free variable and with several free variables one-dimensional And multidimensional regression. It is assumed that we use several free variables, that is, a free variable a vector. In special cases, when the free variable is a scalar, it will be denoted by . Distinguish linear And non-linear regression. If the regression model is not a linear combination of functions of parameters, then one speaks of a non-linear regression. In this case, the model can be an arbitrary superposition of functions from a certain set. Non-linear models are exponential, trigonometric and others (for example, radial basis functions or the Rosenblatt perceptron), which assume that the relationship between the parameters and the dependent variable is non-linear.
Distinguish parametric And nonparametric regression. It is difficult to draw a sharp line between these two types of regressions. Currently, there is no generally accepted criterion for distinguishing one type of model from another. For example, linear models are considered to be parametric, while models involving averaging of the dependent variable over the space of the free variable are considered to be non-parametric. An example of a parametric regression model: linear predictor, multilayer perceptron. Mixed Regression Model Examples: Radial Basis Functions. Nonparametric model moving average in a window of some width. In general, nonparametric regression differs from parametric regression in that the dependent variable does not depend on one value of the free variable, but on some given neighborhood of this value.
There is a distinction between the terms: "function approximation", "approximation", "interpolation", and "regression". It consists in the following.
Approximation of functions. A function of a discrete or continuous argument is given. It is required to find a function from some parametric family, for example, among algebraic polynomials of a given degree. Function parameters must deliver a minimum of some functionality, for example,
Term approximation a synonym for the term "approximation of functions". More often used when we are talking about a given function as a function of a discrete argument. Here it is also required to find such a function that passes closest to all points of the given function. This introduces the concept residuals distances between points of a continuous function and the corresponding points of a function of a discrete argument.
Interpolation functions special case approximation problems, when it is required that at certain points, called interpolation nodes the values of the function and the function approximating it coincided. In a more general case, restrictions are imposed on the values of some derivatives of derivatives. That is, given a function of a discrete argument. It is required to find a function that passes through all points. In this case, the metric is usually not used, but the concept of "smoothness" of the desired function is often introduced.
The concept of regression. Relationship between variables x And y can be described in different ways. In particular, any form of connection can be expressed by a general equation , where y treated as a dependent variable, or functions from another - an independent variable x, called argument. The correspondence between an argument and a function can be given by a table, a formula, a graph, and so on. Changing a function depending on a change in one or more arguments is called regression. All means used to describe correlations are the content regression analysis.
Correlation equations, or regression equations, empirical and theoretically calculated regression series, their graphs, called regression lines, as well as linear and non-linear regression coefficients, serve to express regression.
Regression indicators express the correlation two-way, taking into account the change in the average values of the attribute Y when changing values x i sign X, and vice versa, show the change in the mean values of the feature X by changed values y i sign Y. The exception is time series, or series of dynamics, showing the change in signs over time. The regression of such series is one-sided.
There are many different forms and types of correlations. The task is reduced to identifying the form of connection in each specific case and expressing it by the corresponding correlation equation, which allows us to foresee possible changes in one sign Y based on known changes X, associated with the first correlation.
12.1 Linear regression
Regression equation. The results of observations carried out on a particular biological object according to correlated characteristics x And y, can be represented by points on a plane by constructing a system of rectangular coordinates. As a result, a certain scatter diagram is obtained, which makes it possible to judge the form and tightness of the relationship between varying features. Quite often this relationship looks like a straight line or can be approximated by a straight line.
Linear relationship between variables x And y is described by a general equation , where a, b, c, d,… are the parameters of the equation that determine the relationship between the arguments x 1 , x 2 , x 3 , …, x m and functions.
In practice, not all possible arguments are taken into account, but only some arguments, in the simplest case, only one:
In the linear regression equation (1) a is a free term, and the parameter b determines the slope of the regression line with respect to the rectangular coordinate axes. In analytic geometry, this parameter is called slope factor, and in biometrics - regression coefficient. A visual representation of this parameter and the position of the regression lines Y By X And X By Y in the system of rectangular coordinates gives Fig.1.
Rice. 1 Y by X and X by Y regression lines in the system
rectangular coordinates
The regression lines, as shown in Fig. 1, intersect at the point O (,), corresponding to the arithmetic mean values of signs correlated with each other Y And X. When plotting regression graphs, the values of the independent variable X are plotted along the abscissa, and the values of the dependent variable, or function Y, are plotted along the ordinate. The line AB passing through the point O (,) corresponds to the complete (functional) relationship between the variables Y And X when the correlation coefficient . The stronger the connection between Y And X, the closer the regression lines are to AB, and, conversely, the weaker the relationship between these values, the more distant the regression lines are from AB. In the absence of a connection between the features, the regression lines are at right angles to each other and .
Since the regression indicators express the correlation two-way, the regression equation (1) should be written as follows:
According to the first formula, the average values are determined when the sign changes X per unit of measure, on the second - averaged values when a feature is changed per unit of measure Y.
Regression coefficient. The regression coefficient shows how, on average, the value of one feature y changes when another unit of measure, correlated with Y sign X. This indicator is determined by the formula
Here values s multiply by the size of the class intervals λ if they were found by variation series or correlation tables.
The regression coefficient can be calculated bypassing the calculation of standard deviations s y And s x according to the formula
If the correlation coefficient is unknown, the regression coefficient is determined as follows:
Relationship between regression and correlation coefficients. Comparing formulas (11.1) (topic 11) and (12.5), we see that their numerator contains the same value , which indicates a connection between these indicators. This relationship is expressed by the equality
Thus, the correlation coefficient is equal to the geometric mean of the coefficients b yx And b xy. Formula (6) allows, firstly, from the known values of the regression coefficients b yx And b xy determine the regression coefficient R xy, and secondly, to check the correctness of the calculation of this correlation indicator R xy between varying traits X And Y.
Like the correlation coefficient, the regression coefficient characterizes only a linear relationship and is accompanied by a plus sign for a positive relationship and a minus sign for a negative relationship.
Determination of linear regression parameters. It is known that the sum of the squared deviations of the variant x i from the average there is the smallest value, i.e. This theorem forms the basis of the method least squares. With respect to linear regression [see formula (1)], the requirement of this theorem is satisfied by a certain system of equations called normal:
Joint solution of these equations with respect to parameters a And b leads to the following results:
;
;
, whence i.
Given the two-way nature of the relationship between the variables Y And X, the formula for determining the parameter A should be expressed like this:
And . (7)
Parameter b, or regression coefficient, is determined by the following formulas:
Construction of empirical regression series. In the presence of a large number observations regression analysis begins with the construction of empirical regression series. Empirical regression series is formed by calculating the values of one variable attribute X average values of the other, correlated with X sign Y. In other words, the construction of empirical regression series comes down to finding the group means u from the corresponding values of the signs Y and X.
An empirical regression series is a double series of numbers that can be represented by points on a plane, and then, by connecting these points with straight line segments, an empirical regression line can be obtained. Empirical regression series, especially their plots, called regression lines, give a visual representation of the form and tightness of the correlation dependence between varying features.
Equalization of empirical regression series. Graphs of empirical regression series turn out, as a rule, not smoothly running, but broken lines. This is explained by the fact that, along with the main reasons that determine the general pattern in the variability of correlated traits, their value is affected by the influence of numerous secondary causes that cause random fluctuations in the nodal points of the regression. To identify the main trend (trend) of the conjugate variation of correlated features, you need to replace the broken lines with smooth, smoothly running regression lines. The process of replacing broken lines with smooth ones is called alignment of empirical series And regression lines.
Graphic alignment method. This is the simplest method that does not require computational work. Its essence is as follows. The empirical regression series is plotted as a graph in a rectangular coordinate system. Then, the midpoints of the regression are visually outlined, along which a solid line is drawn using a ruler or pattern. The disadvantage of this method is obvious: it does not exclude the influence of the individual characteristics of the researcher on the results of the alignment of empirical regression lines. Therefore, in cases where higher accuracy is required when replacing broken regression lines with smooth ones, other methods of aligning the empirical series are used.
Moving average method. The essence of this method is reduced to the sequential calculation of the arithmetic averages of two or three neighboring terms empirical series. This method is especially convenient in cases where the empirical series is represented by a large number of terms, so that the loss of two of them - the extreme ones, which is inevitable with this method of equalization, will not noticeably affect its structure.
Least square method. This method was proposed at the beginning of the 19th century by A.M. Legendre and, independently of him, K. Gauss. It allows you to most accurately align the empirical series. This method, as shown above, is based on the assumption that the sum of the squared deviations of the variant x i from their average there is a minimum value, i.e. Hence the name of the method, which is used not only in ecology, but also in technology. The method of least squares is objective and universal, it is used in a variety of cases when finding empirical equations of regression series and determining their parameters.
The requirement of the least squares method is that the theoretical points of the regression line must be obtained in such a way that the sum of the squared deviations from these points for empirical observations y i was minimal, i.e.
Calculating the minimum of this expression in accordance with the principles of mathematical analysis and transforming it in a certain way, one can obtain a system of so-called normal equations, in which the unknown values are the desired parameters of the regression equation, and the known coefficients are determined by the empirical values of the features, usually the sums of their values and their cross products.
Multiple linear regression. The relationship between several variables is usually expressed by a multiple regression equation, which can be linear And non-linear. In its simplest form, multiple regression is expressed by an equation with two independent variables ( x, z):
Where a is the free term of the equation; b And c are the parameters of the equation. To find the parameters of equation (10) (by the least squares method), the following system of normal equations is used:
Rows of dynamics. Row alignment. The change in signs over time forms the so-called time series or rows of dynamics. A characteristic feature of such series is that the time factor always acts here as the independent variable X, and the changing sign is the dependent variable Y. Depending on the regression series, the relationship between variables X and Y is one-sided, since the time factor does not depend on the variability of features. Despite these features, time series can be likened to regression series and processed by the same methods.
Like regression series, empirical time series are influenced not only by the main, but also by numerous secondary (random) factors that obscure the main trend in the variability of features, which in the language of statistics is called trend.
Analysis of time series begins with identifying the shape of the trend. To do this, the time series is represented as line graph in the system of rectangular coordinates. At the same time, time points (years, months, and other units of time) are plotted along the abscissa axis, and the values of the dependent variable Y are plotted along the ordinate axis. is the regression equation in the form of deviations of the terms of the series of the dependent variable Y from the arithmetic mean of the series of the independent variable X:
Here, is the linear regression parameter.
Numerical characteristics of the series of dynamics. The main generalizing numerical characteristics of the series of dynamics include geometric mean and an arithmetic mean close to it. They characterize the average rate at which the value of the dependent variable changes over certain periods of time:
An estimate of the variability of the terms of the dynamics series is standard deviation. When choosing regression equations to describe the time series, the form of the trend is taken into account, which can be linear (or reduced to linear) and non-linear. The correctness of the choice of the regression equation is usually judged by the similarity of the empirically observed and calculated values of the dependent variable. More accurate in solving this problem is the method of regression analysis of variance (topic 12 p.4).
Correlation of series of dynamics. It is often necessary to compare the dynamics of parallel time series that are related to each other by some general conditions, for example, to find out the relationship between agricultural production and livestock growth over a certain period of time. In such cases, the relationship between variables X and Y is characterized by correlation coefficient R xy (in the presence of a linear trend).
It is known that the trend of the series of dynamics, as a rule, is obscured by fluctuations in the terms of the series of the dependent variable Y. Hence, a two-fold problem arises: measuring the dependence between compared series, without excluding the trend, and measuring the dependence between adjacent members of the same series, excluding the trend. In the first case, an indicator of the closeness of the connection between the compared series of dynamics is correlation coefficient(if the relationship is linear), in the second - autocorrelation coefficient. These indicators have different values, although they are calculated using the same formulas (see topic 11).
It is easy to see that the value of the autocorrelation coefficient is affected by the variability of the members of the series of the dependent variable: the less the members of the series deviate from the trend, the higher the autocorrelation coefficient, and vice versa.
Regression and correlation analysis - statistical methods research. These are the most common ways to show the dependence of a parameter on one or more independent variables.
Below, using concrete practical examples, we will consider these two very popular analyzes among economists. We will also give an example of obtaining results when they are combined.
Regression Analysis in Excel
Shows the influence of some values (independent, independent) on the dependent variable. For example, how the number of economically active population depends on the number of enterprises, wages, and other parameters. Or: how do foreign investments, energy prices, etc. affect the level of GDP.
The result of the analysis allows you to prioritize. And based on the main factors, to predict, plan the development priority areas to make managerial decisions.
Regression happens:
- linear (y = a + bx);
- parabolic (y = a + bx + cx 2);
- exponential (y = a * exp(bx));
- power (y = a*x^b);
- hyperbolic (y = b/x + a);
- logarithmic (y = b * 1n(x) + a);
- exponential (y = a * b^x).
Consider the example of building a regression model in Excel and interpreting the results. Let's take a linear type of regression.
Task. At 6 enterprises, the average monthly salary and the number of employees who left were analyzed. It is necessary to determine the dependence of the number of retired employees on the average salary.
The linear regression model has the following form:
Y \u003d a 0 + a 1 x 1 + ... + a k x k.
Where a are the regression coefficients, x are the influencing variables, and k is the number of factors.
In our example, Y is the indicator of quit workers. The influencing factor is wages (x).
Excel has built-in functions that can be used to calculate the parameters of a linear regression model. But the Analysis ToolPak add-in will do it faster.
Activate a powerful analytical tool:
Once activated, the add-on will be available under the Data tab.
Now we will deal directly with the regression analysis.
First of all, we pay attention to the R-square and coefficients.
R-square is the coefficient of determination. In our example, it is 0.755, or 75.5%. This means that the calculated parameters of the model explain the relationship between the studied parameters by 75.5%. The higher the coefficient of determination, the better the model. Good - above 0.8. Poor - less than 0.5 (such an analysis can hardly be considered reasonable). In our example - "not bad".
The coefficient 64.1428 shows what Y will be if all the variables in the model under consideration are equal to 0. That is, other factors that are not described in the model also affect the value of the analyzed parameter.
The coefficient -0.16285 shows the weight of the variable X on Y. That is, the average monthly salary within this model affects the number of quitters with a weight of -0.16285 (this is a small degree of influence). The "-" sign indicates bad influence: the higher the salary, the less quit. Which is fair.
Correlation analysis in Excel
Correlation analysis helps to establish whether there is a relationship between indicators in one or two samples. For example, between the operating time of the machine and the cost of repairs, the price of equipment and the duration of operation, the height and weight of children, etc.
If there is a relationship, then whether an increase in one parameter leads to an increase (positive correlation) or a decrease (negative) in the other. Correlation analysis helps the analyst determine whether the value of one indicator can predict the possible value of another.
The correlation coefficient is denoted r. Varies from +1 to -1. The classification of correlations for different areas will be different. When the coefficient value is 0, there is no linear relationship between the samples.
Consider how to use Excel to find the correlation coefficient.
The CORREL function is used to find the paired coefficients.
Task: Determine if there is a relationship between the operating time of a lathe and the cost of its maintenance.
Put the cursor in any cell and press the fx button.
- In the "Statistical" category, select the CORREL function.
- Argument "Array 1" - the first range of values - the time of the machine: A2: A14.
- Argument "Array 2" - the second range of values - the cost of repairs: B2:B14. Click OK.
To determine the type of connection, you need to look at the absolute number of the coefficient (each field of activity has its own scale).
For correlation analysis of several parameters (more than 2), it is more convenient to use "Data Analysis" ("Analysis Package" add-on). In the list, you need to select a correlation and designate an array. All.
The resulting coefficients will be displayed in the correlation matrix. Like this one:
Correlation-regression analysis
In practice, these two techniques are often used together.
Example:
Now the regression analysis data is visible.
1. For the first time the term "regression" was introduced by the founder of biometrics F. Galton (XIX century), whose ideas were developed by his follower K. Pearson.
Regression analysis- a method of statistical data processing that allows you to measure the relationship between one or more causes (factorial signs) and a consequence (effective sign).
sign- this is the main distinguishing feature, a feature of the phenomenon or process being studied.
Effective sign - investigated indicator.
Factor sign- an indicator that affects the value of the effective feature.
The purpose of the regression analysis is to evaluate the functional dependence of the average value of the effective feature ( at) from factorial ( x 1, x 2, ..., x n), expressed as regression equations
at= f(x 1, x 2, ..., x n). (6.1)
There are two types of regression: paired and multiple.
Paired (simple) regression- equation of the form:
at= f(x). (6.2)
The resultant feature in pairwise regression is considered as a function of one argument, i.e. one factor.
Regression analysis includes the following steps:
definition of the function type;
determination of regression coefficients;
Calculation of theoretical values of the effective feature;
Checking the statistical significance of the regression coefficients;
Checking the statistical significance of the regression equation.
Multiple regression- equation of the form:
at= f(x 1, x 2, ..., x n). (6.3)
The resultant feature is considered as a function of several arguments, i.e. many factors.
2. In order to correctly determine the type of function, it is necessary to find the direction of the connection based on theoretical data.
According to the direction of the connection, the regression is divided into:
· direct regression, arising under the condition that with an increase or decrease independent value « X" values of the dependent quantity " at" also increase or decrease accordingly;
· reverse regression, arising under the condition that with an increase or decrease in the independent value "X" dependent value " at" decreases or increases accordingly.
To characterize the relationships, the following types of paired regression equations are used:
· y=a+bx– linear;
· y=e ax + b – exponential;
· y=a+b/x – hyperbolic;
· y=a+b 1 x+b 2 x 2 – parabolic;
· y=ab x – exponential and etc.
Where a, b 1 , b 2- coefficients (parameters) of the equation; at- effective sign; X- factor sign.
3. The construction of the regression equation is reduced to estimating its coefficients (parameters), for this they use least square method(MNK).
The least squares method allows you to obtain such estimates of the parameters, in which the sum of the squared deviations of the actual values of the effective feature " at»from theoretical « y x» is minimal, that is
Regression Equation Options y=a+bx by the least squares method are estimated using the formulas:
Where A - free coefficient, b- regression coefficient, shows how much the resultant sign will change y» when changing the factor attribute « x» per unit of measure.
4. To assess the statistical significance of the regression coefficients, Student's t-test is used.
Scheme for checking the significance of regression coefficients:
1) H 0: a=0, b=0 - regression coefficients are insignificantly different from zero.
H 1: a≠ 0, b≠ 0 - regression coefficients are significantly different from zero.
2) R=0.05 – significance level.
Where m b,m a- random errors:
; . (6.7)
4) t table(R; f),
Where f=n-k- 1 - number of degrees of freedom (table value), n- number of observations, k X".
5) If , then deviates, i.e. significant coefficient.
If , then is accepted, i.e. coefficient is insignificant.
5. To check the correctness of the constructed regression equation, the Fisher criterion is used.
Scheme for checking the significance of the regression equation:
1) H 0: the regression equation is not significant.
H 1: the regression equation is significant.
2) R=0.05 – significance level.
3) , (6.8)
where is the number of observations; k- the number of parameters in the equation with variables " X"; at- the actual value of the effective feature; y x- the theoretical value of the effective feature; - coefficient of pair correlation.
4) F table(R; f 1 ; f2),
Where f 1 \u003d k, f 2 \u003d n-k-1- number of degrees of freedom (table values).
5) If F calc >F table, then the regression equation is chosen correctly and can be applied in practice.
If F calc
6. The main indicator reflecting the measure of the quality of regression analysis is coefficient of determination (R 2).
Determination coefficient shows what proportion of the dependent variable " at» is taken into account in the analysis and is caused by the influence of the factors included in the analysis.
Determination coefficient (R2) takes values in the range . The regression equation is qualitative if R2 ≥0,8.
The determination coefficient is equal to the square of the correlation coefficient, i.e.
Example 6.1. Based on the following data, construct and analyze the regression equation:
Solution.
1) Calculate the correlation coefficient: . The relationship between the signs is direct and moderate.
2) Build a paired linear regression equation.
2.1) Make a calculation table.
№ | X | at | Hu | x 2 | y x | (y-y x) 2 | ||
55,89 | 47,54 | 65,70 | ||||||
45,07 | 15,42 | 222,83 | ||||||
54,85 | 34,19 | 8,11 | ||||||
51,36 | 5,55 | 11,27 | ||||||
42,28 | 45,16 | 13,84 | ||||||
47,69 | 1,71 | 44,77 | ||||||
45,86 | 9,87 | 192,05 | ||||||
Sum | 159,45 | 558,55 | ||||||
Average | 77519,6 | 22,78 | 79,79 | 2990,6 |
,
Paired linear regression equation: y x \u003d 25.17 + 0.087x.
3) Find theoretical values " y x» by substituting actual values into the regression equation « X».
4) Plot graphs of actual " at" and theoretical values " y x» effective feature (Figure 6.1): r xy =0.47) and a small number of observations.
7) Calculate the coefficient of determination: R2=(0.47) 2 =0.22. The constructed equation is of poor quality.
Because calculations during regression analysis are quite voluminous, it is recommended to use special programs ("Statistica 10", SPSS, etc.).
Figure 6.2 shows a table with the results of the regression analysis carried out using the program "Statistica 10".
Figure 6.2. The results of the regression analysis carried out using the program "Statistica 10"
5. Literature:
1. Gmurman V.E. Probability Theory and Mathematical Statistics: Proc. manual for universities / V.E. Gmurman. - M.: Higher school, 2003. - 479 p.
2. Koichubekov B.K. Biostatistics: Textbook. - Almaty: Evero, 2014. - 154 p.
3. Lobotskaya N.L. Higher Mathematics. / N.L. Lobotskaya, Yu.V. Morozov, A.A. Dunaev. - Minsk: Higher School, 1987. - 319 p.
4. Medic V.A., Tokmachev M.S., Fishman B.B. Statistics in Medicine and Biology: A Guide. In 2 volumes / Ed. Yu.M. Komarov. T. 1. Theoretical statistics. - M.: Medicine, 2000. - 412 p.
5. Application of statistical analysis methods for the study of public health and health care: textbook / ed. Kucherenko V.Z. - 4th ed., revised. and additional - M.: GEOTAR - Media, 2011. - 256 p.