首页 > 其他分享> > 【Andrew Gelman Data Analysis Using Regression and Multilevel/Hierarchical Models】4.9 exercises 解答

【Andrew Gelman Data Analysis Using Regression and Multilevel/Hierarchical Models】4.9 exercises 解答

2021-04-25 10:57:55 作者：互联网

（部分不全）

第一题

Logarithmic transformation and regression: consider the following regression:
log(weight) = −3.5+2.0 log(height) + error
with errors that have standard deviation 0.25. Weights are in pounds and heights are in inches.
(a) Fill in the blanks: approximately 68% of the persons will have weights within a factor of ___ and of ___ their predicted values from the regression.
(b) Draw the regression line and scatterplot of log(weight) versus log(height) that make sense and are consistent with the fitted model. Be sure to label the axes of your graph.

( a ) (a) (a)

− 0.25 , + 0.25 -0.25,+0.25 −0.25,+0.25

( b ) (b) (b)

#随机生成身高变量数组
height <- rnorm(100,160,1.6)
#用随机生成的数组生成weight变量
weight <- rnorm(-3.5 + 2.0*log(height),0.25)
weight <- exp(weight)
#绘制模拟生成的变量的散点图
plot(log(height),log(weight))
#重新拟合模型
fit.1 <- lm(log(weight) ~ log(height))
#绘制回归曲线
curve(cbind(1,x) %*% coef(fit.1), add=TRUE)

log(weight)存在负数，可见这是一个很差的模型
在这里插入图片描述

第二题

The folder earnings has data from the Work, Family, and Well-Being Survey (Ross, 1990). Pull out the data on earnings, sex, height, and weight.
(a) In R, check the dataset and clean any unusually coded data.
(b) Fit a linear regression model predicting earnings from height. What transformation should you perform in order to interpret the intercept from this model as average earnings for people with average height?
© Fit some regression models with the goal of predicting earnings from some combination of sex, height, and weight. Be sure to try various transformations and interactions that might make sense. Choose your preferred model and justify.

简单回归，这次不造数据了，分析思路如下：

( a ) (a) (a)

删除含NA的行data<-na.omit(data)
删除含缺失值的行x <- x[complete.cases(x),]

( b ) (b) (b)

#将earning和height分别减去各自的均值
m.earning <- earning-mean(earning)
m.height <- height-mean(height
#拟合模型
fit.1 <- lm(m.earning ~ m.height)

当height等于均值时，z.height=0，此时截距就是height取均值时earning的平均数。

( c ) (c) (c)

首先是一个简单模型，正常情况下R-square不会太高，看看哪个变量比较显著

#首先考虑一个简单模型
earning <- height + weight + sex

然后考虑显著的变量与其他变量的交互项，如sex。sex=1男性，sex=0女性
考虑height和sex的交互项

earning <- height + weight + sex*height + sex*weight

这时height和weight前面的系数为女性情况下，身高体重对收入的影响。
height前面的系数加上sex*height前面的系数为，男性情况下，身高对收入的影响，体重同上。
我认为考虑height*weight交互项的可解释性不强

第三题

Plotting linear and nonlinear regressions: we downloaded data with weight (in pounds) and age (in years) from a random sample of American adults. We first created new variables: age10 = age/10 and age10.sq = (age/10)^2, and indicators age18.29, age30.44, age45.64, and age65up for four age categories. We then fit some regressions, with the following results:

lm(formula = weight ~ age10) 
R output
             coef.est coef.se
(Intercept)     161.0     7.3
age10             2.6     1.6
n = 2009, k = 2
residual sd = 119.7, R-Squared = 0.00


lm(formula = weight ~ age10 + age10.sq)
             coef.est coef.se
(Intercept)      96.2    19.3
age10            33.6     8.7
age10.sq         -3.2     0.9
n = 2009, k = 3
residual sd = 119.3, R-Squared = 0.01


lm(formula = weight ~ age30.44 + age45.64 + age65up)
             coef.est coef.se
(Intercept)     157.2     5.4
age30.44TRUE     19.1     7.0
age45.64TRUE     27.2     7.6
age65upTRUE       8.5     8.7
n = 2009, k = 4
residual sd = 119.4, R-Squared = 0.01

(a) On a graph of weights versus age (that is, weight on y-axis, age on x-axis), draw the fitted regression line from the first model.
(b) On the same graph, draw the fitted regression line from the second model.
© On another graph with the same axes and scale, draw the fitted regression
line from the third model. (It will be discontinuous.)

( a ) (a) (a)

#随机生成模型参数数组
n.sim <- 2009
mu <- rnorm(n.sim,0,119.3) #残差
intercept <- rnorm(n.sim,161.0,7.3)
age.coef <- rnorm(n.sim,2.6,1.6)
#随机生成自变量原始数据
#假设age10为最小值为1，最大值为10的均匀分布随机数，其余年龄组以此类推
age <- runif(n.sim,18,100)
age10 <- age/10
#用方程拟合出weight
weight <- intercept + age.coef*age10 + mu
#绘制回归散点图
plot(age10,weight,col="gray")
fit.1 <- lm(weight ~ age10)
curve(cbind(1,x) %*% coef(fit.1), add=TRUE,col="blue")

weight存在负数，可见这是一个很差的模型
在这里插入图片描述

( b ) (b) (b)

#****************(b)******************************
age10.sq <- (age10)^2
fit.2 <- lm(weight ~ age10 + age10.sq) #模型二
#绘制模型二回归线
curve(cbind(1,x,x^2) %*% coef(fit.2), add=TRUE,col="red")

在这里插入图片描述
这个结果有点微微的抛物线的形状。（但是残差的方差太大了不是很明显）。

( c ) (c) (c)

这道题画出来的结果，使年龄和体重之间的关系更加明显了。说明年龄不适合用连续变量，而适合采用离散变量。

#****************(c)*******************************
for(i in 1: n.sim){
  if(age[i]>=18&&age[i]<=29){age[i] <- 1} 
  else if(age[i]>=30&&age[i]<=44) {age[i] <- 2}
  else if(age[i]>=45&&age[i]<=64) {age[i] <- 3}
  else {age[i] <- 4}
}
#建立模型
fit.3 <- lm( weight ~ as.factor(age))
#绘制回归曲线
plot(age,weight,col="gray")
curve(cbind(1,x,x,x) %*% coef(fit.3), add=TRUE, col="purple")

在这里插入图片描述

第四题

Logarithmic transformations: the folder pollution contains mortality rates and various environmental factors from 60 U.S. metropolitan areas (see McDonald and Schwing, 1973). For this exercise we shall model mortality rate given nitric oxides, sulfur dioxide, and hydrocarbons as inputs. This model is an extreme oversimplification as it combines all sources of mortality and does not adjust for crucial factors such as age and smoking. We use it to illustrate log transformations in regression.
(a) Create a scatterplot of mortality rate versus level of nitric oxides. Do you think linear regression will fit these data well? Fit the regression and evaluate a residual plot from the regression.
(b) Find an appropriate transformation that will result in data more appropriate for linear regression. Fit a regression to the transformed data and evaluate the new residual plot.
© Interpret the slope coefficient from the model you chose in (b).
(d) Now fit a model predicting mortality rate using levels of nitric oxides, sulfur dioxide, and hydrocarbons as inputs. Use appropriate transformations when helpful. Plot the fitted regression model and interpret the coefficients.
(e) Cross-validate: fit the model you chose above to the first half of the data and then predict for the second half. (You used all the data to construct the model in (d), so this is not really cross-validation, but it gives a sense of how the steps of cross-validation can be implemented.)

第五题

Special-purpose transformations: for a study of congressional elections, you would like a measure of the relative amount of money raised by each of the two majorparty candidates in each district. Suppose that you know the amount of money raised by each candidate; label these dollar values Di and Ri. You would like to combine these into a single variable that can be included as an input variable into a model predicting vote share for the Democrats.
(a) Discuss the advantages and disadvantages of the following measures:
• The simple difference, Di − Ri
• The ratio, Di/Ri
• The difference on the logarithmic scale, log Di − log Ri
• The relative proportion, Di/(Di + Ri).
(b) Propose an idiosyncratic transformation (as in the example on page 65) and discuss the advantages and disadvantages of using it as a regression input.

( a ) (a) (a)

simple difference:
能够使截距有意义，成为该项取均值时，因变量的估计值。系数较好解释，为差异每扩大一个单位时因变量的变化量。这里是民主党每每比共和党多捐献一美元，带来的选表比例的增加量。

ratio：
与因变量同为比率变量，系数的大小对于变量的影响力更为直观。但系数较难解释。

difference on the logarithmic scale：
等于log(Di/Ri)，为这个比率增加1%，因变量增加的量。系数较难解释，但如果自变量变化较小，则具有放大效应，可以更好地观察变化。

relative proportion：
与因变量同为百分比变量，系数的大小对于变量的影响力更为直观，且系数较容易解释。即，当民主党所获投资占总投资额的比重增加1%，民主党的选票变化的百分比。

( b ) (b) (b)

这本书在P65页的论述非常经典，即线性的、对数的、开平方的数据，其度量标准会如何改变。何时用连续变量（比如用输赢而不是选票作为因变量就会使模凌两可的情况变得不可预测），何时用离散变量（如上一题提到的年龄，按代际来分才能更好地体现出差异。

第六题

An economist runs a regression examining the relations between the average price of cigarettes, P, and the quantity purchased, Q, across a large sample of counties in the United States, assuming the following functional form, log Q = α+β log P. Suppose the estimate for β is 0.3. Interpret this coefficient.

这个 β β β时弹性的意思，烟

标签：weight,Models,age,4.9,Andrew,height,model,data,regression
来源： https://blog.csdn.net/tianty1121/article/details/116032673