其他分享
首页 > 其他分享> > R_回归

R_回归

作者:互联网

回归分析是统计学的核心问题,通常用来用一个或多个解释变量来预测相应变量,有效的回归是一个交互的、整体的、多步骤的过程,而不仅仅是一个技巧

OLS回归

为了能够恰当地解释OLS模型的系数,数据必须满足以下假设:

简单线性回归

#import data
> setosa <- filter(iris,Species=="setosa")  

#view the structure of data
> names(setosa)
[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"     
> summary(setosa)
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species  
 Min.   :4.300   Min.   :2.300   Min.   :1.000   Min.   :0.100   setosa    :50  
 1st Qu.:4.800   1st Qu.:3.200   1st Qu.:1.400   1st Qu.:0.200   versicolor: 0  
 Median :5.000   Median :3.400   Median :1.500   Median :0.200   virginica : 0  
 Mean   :5.006   Mean   :3.428   Mean   :1.462   Mean   :0.246                  
 3rd Qu.:5.200   3rd Qu.:3.675   3rd Qu.:1.575   3rd Qu.:0.300                  
 Max.   :5.800   Max.   :4.400   Max.   :1.900   Max.   :0.600    
> head(setosa)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
> str(setosa)
'data.frame':	50 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
#Draws a binary diagram between variables
library(car)
scatterplot(Sepal.Width~Sepal.Length,data = setosa,
            spread=FALSE,smoother.args=list(lty=2),pch=19,
            xlab = "Sepal.Length",
            ylab = "Sepal.Width",
            main="Setosa")

在这里插入图片描述
从图中可以看出,花萼的长度与宽度之间存在一定的线性关系,下一步建立模型进行分析。

 #creat a simple OLS model
> fit1 <- lm(Sepal.Width~Sepal.Length,data = setosa)

#reture the results of model
> summary(fit1)

Call:
lm(formula = Sepal.Width ~ Sepal.Length, data = setosa)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.72394 -0.18273 -0.00306  0.15738  0.51709 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -0.5694     0.5217  -1.091    0.281    
Sepal.Length   0.7985     0.1040   7.681 6.71e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2565 on 48 degrees of freedom
Multiple R-squared:  0.5514,	Adjusted R-squared:  0.542 
F-statistic: 58.99 on 1 and 48 DF,  p-value: 6.71e-10

#return predicting value
> fitted(fit1)
       1        2        3        4        5        6        7        8        9       10       11 
3.503062 3.343356 3.183650 3.103798 3.423209 3.742620 3.103798 3.423209 2.944092 3.343356 3.742620 
      12       13       14       15       16       17       18       19       20       21       22 
3.263503 3.263503 2.864239 4.062031 3.982179 3.742620 3.503062 3.982179 3.503062 3.742620 3.503062 
      23       24       25       26       27       28       29       30       31       32       33 
3.103798 3.503062 3.263503 3.423209 3.423209 3.582914 3.582914 3.183650 3.263503 3.742620 3.582914 
      34       35       36       37       38       39       40       41       42       43       44 
3.822473 3.343356 3.423209 3.822473 3.343356 2.944092 3.503062 3.423209 3.023945 2.944092 3.423209 
      45       46       47       48       49       50 
3.503062 3.263503 3.503062 3.103798 3.662767 3.423209 
#return residuals
> residuals(fit1)
          1           2           3           4           5           6           7           8 
-0.00306166 -0.34335600  0.01634966 -0.00379751  0.17679117  0.15737985  0.29620249 -0.02320883 
          9          10          11          12          13          14          15          16 
-0.04409185 -0.24335600 -0.04262015  0.13649683 -0.26350317  0.13576098 -0.06203147  0.41782136 
         17          18          19          20          21          22          23          24 
 0.15737985 -0.00306166 -0.18217864  0.29693834 -0.34262015  0.19693834  0.49620249 -0.20306166 
         25          26          27          28          29          30          31          32 
 0.13649683 -0.42320883 -0.02320883 -0.08291449 -0.18291449  0.01634966 -0.16350317 -0.34262015 
         33          34          35          36          37          38          39          40 
 0.51708551  0.37752702 -0.24335600 -0.22320883 -0.32247298  0.25664400  0.05590815 -0.10306166 
         41          42          43          44          45          46          47          48 
 0.07679117 -0.72394468  0.25590815  0.07679117  0.29693834 -0.26350317  0.29693834  0.09620249 
         49          50 
 0.03723268 -0.12320883 
#plot the results
> plot(setosa$Sepal.Length,setosa$Sepal.Width,
+      xlab = "Sepal.Length",
+      ylab = "Sepal.Width",
+      main="Setosa")
> abline(fit1,col="red")

在这里插入图片描述

多项式回归

多项式回归只需在简单回归的基础上加点参数即可

> fit2 <- lm(Sepal.Width~Sepal.Length+I(Sepal.Length^2),data = setosa)
> summary(fit2)

Call:
lm(formula = Sepal.Width ~ Sepal.Length + I(Sepal.Length^2), 
    data = setosa)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.7429 -0.1845  0.0112  0.1552  0.5283 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)
(Intercept)         2.5020     5.9163   0.423    0.674
Sepal.Length       -0.4295     2.3586  -0.182    0.856
I(Sepal.Length^2)   0.1222     0.2344   0.521    0.605

Residual standard error: 0.2585 on 47 degrees of freedom
Multiple R-squared:  0.554,	Adjusted R-squared:  0.535 
F-statistic: 29.19 on 2 and 47 DF,  p-value: 5.76e-09


可以发现,该数据集利用多项式回归预测效果不明显

多元线性回归

本次用到的数据集为state.x77,是美国50个州的统计数据,用该数据集来探究一个州的犯罪率与其他因素的关系,首先看一下数据集介绍:

#extract data
There were 50 or more warnings (use warnings() to see the first 50)
> states_data <- as.data.frame(state.x77[,c("Murder","Population","Illiteracy",
 +                                           "Income","Frost")])
> #the correlation index
> cor(states_data)
               Murder Population Illiteracy     Income      Frost
Murder      1.0000000  0.3436428  0.7029752 -0.2300776 -0.5388834
Population  0.3436428  1.0000000  0.1076224  0.2082276 -0.3321525
Illiteracy  0.7029752  0.1076224  1.0000000 -0.4370752 -0.6719470
Income     -0.2300776  0.2082276 -0.4370752  1.0000000  0.2262822
Frost      -0.5388834 -0.3321525 -0.6719470  0.2262822  1.0000000
> #view the relationship between variables
> library(car)
> scatterplotMatrix(states_data,spread=FALSE,
 +                   smoother.args=list(lty=2),pch=19,
 +                   main="Scatter Plot Matrix")
There were 50 or more warnings (use warnings() to see the first 50)

在这里插入图片描述

 #creat a model
> fit3 <- lm(Murder~Population+Illiteracy+Income+Frost,data = states_data)
> summary(fit3)

Call:
lm(formula = Murder ~ Population + Illiteracy + Income + Frost, 
    data = states_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.7960 -1.6495 -0.0811  1.4815  7.6210 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.235e+00  3.866e+00   0.319   0.7510    
Population  2.237e-04  9.052e-05   2.471   0.0173 *  
Illiteracy  4.143e+00  8.744e-01   4.738 2.19e-05 ***
Income      6.442e-05  6.837e-04   0.094   0.9253    
Frost       5.813e-04  1.005e-02   0.058   0.9541    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.535 on 45 degrees of freedom
Multiple R-squared:  0.567,	Adjusted R-squared:  0.5285 
F-statistic: 14.73 on 4 and 45 DF,  p-value: 9.133e-08

标签:回归,0.2,Sepal,Width,Length,data,setosa
来源: https://blog.csdn.net/qingchen98/article/details/106993428