【Andrew Gelman Data Analysis Using Regression and Multilevel/Hierarchical Models】4.9 exercises 解答




  1. Logarithmic transformation and regression: consider the following regression:
    log(weight) = −3.5+2.0 log(height) + error
    with errors that have standard deviation 0.25. Weights are in pounds and heights are in inches.
    (a) Fill in the blanks: approximately 68% of the persons will have weights within a factor of ___ and of ___ their predicted values from the regression.
    (b) Draw the regression line and scatterplot of log(weight) versus log(height) that make sense and are consistent with the fitted model. Be sure to label the axes of your graph.

− 0.25 , + 0.25 -0.25,+0.25 −0.25,+0.25

height <- rnorm(100,160,1.6)
weight <- rnorm(-3.5 + 2.0*log(height),0.25)
weight <- exp(weight)
fit.1 <- lm(log(weight) ~ log(height))
curve(cbind(1,x) %*% coef(fit.1), add=TRUE)



  1. The folder earnings has data from the Work, Family, and Well-Being Survey (Ross, 1990). Pull out the data on earnings, sex, height, and weight.
    (a) In R, check the dataset and clean any unusually coded data.
    (b) Fit a linear regression model predicting earnings from height. What transformation should you perform in order to interpret the intercept from this model as average earnings for people with average height?
    © Fit some regression models with the goal of predicting earnings from some combination of sex, height, and weight. Be sure to try various transformations and interactions that might make sense. Choose your preferred model and justify.


删除含缺失值的行x <- x[complete.cases(x),]

m.earning <- earning-mean(earning)
m.height <- height-mean(height
fit.1 <- lm(m.earning ~ m.height)


earning <- height + weight + sex


earning <- height + weight + sex*height + sex*weight



  1. Plotting linear and nonlinear regressions: we downloaded data with weight (in pounds) and age (in years) from a random sample of American adults. We first created new variables: age10 = age/10 and age10.sq = (age/10)^2, and indicators age18.29, age30.44, age45.64, and age65up for four age categories. We then fit some regressions, with the following results:
lm(formula = weight ~ age10) 
R output
             coef.est coef.se
(Intercept)     161.0     7.3
age10             2.6     1.6
n = 2009, k = 2
residual sd = 119.7, R-Squared = 0.00

lm(formula = weight ~ age10 + age10.sq)
             coef.est coef.se
(Intercept)      96.2    19.3
age10            33.6     8.7
age10.sq         -3.2     0.9
n = 2009, k = 3
residual sd = 119.3, R-Squared = 0.01

lm(formula = weight ~ age30.44 + age45.64 + age65up)
             coef.est coef.se
(Intercept)     157.2     5.4
age30.44TRUE     19.1     7.0
age45.64TRUE     27.2     7.6
age65upTRUE       8.5     8.7
n = 2009, k = 4
residual sd = 119.4, R-Squared = 0.01

(a) On a graph of weights versus age (that is, weight on y-axis, age on x-axis), draw the fitted regression line from the first model.
(b) On the same graph, draw the fitted regression line from the second model.
© On another graph with the same axes and scale, draw the fitted regression
line from the third model. (It will be discontinuous.)

n.sim <- 2009
mu <- rnorm(n.sim,0,119.3) #残差
intercept <- rnorm(n.sim,161.0,7.3)
age.coef <- rnorm(n.sim,2.6,1.6)
age <- runif(n.sim,18,100)
age10 <- age/10
weight <- intercept + age.coef*age10 + mu
fit.1 <- lm(weight ~ age10)
curve(cbind(1,x) %*% coef(fit.1), add=TRUE,col="blue")


age10.sq <- (age10)^2
fit.2 <- lm(weight ~ age10 + age10.sq) #模型二
curve(cbind(1,x,x^2) %*% coef(fit.2), add=TRUE,col="red")


for(i in 1: n.sim){
  if(age[i]>=18&&age[i]<=29){age[i] <- 1} 
  else if(age[i]>=30&&age[i]<=44) {age[i] <- 2}
  else if(age[i]>=45&&age[i]<=64) {age[i] <- 3}
  else {age[i] <- 4}
fit.3 <- lm( weight ~ as.factor(age))
curve(cbind(1,x,x,x) %*% coef(fit.3), add=TRUE, col="purple")



  1. Logarithmic transformations: the folder pollution contains mortality rates and various environmental factors from 60 U.S. metropolitan areas (see McDonald and Schwing, 1973). For this exercise we shall model mortality rate given nitric oxides, sulfur dioxide, and hydrocarbons as inputs. This model is an extreme oversimplification as it combines all sources of mortality and does not adjust for crucial factors such as age and smoking. We use it to illustrate log transformations in regression.
    (a) Create a scatterplot of mortality rate versus level of nitric oxides. Do you think linear regression will fit these data well? Fit the regression and evaluate a residual plot from the regression.
    (b) Find an appropriate transformation that will result in data more appropriate for linear regression. Fit a regression to the transformed data and evaluate the new residual plot.
    © Interpret the slope coefficient from the model you chose in (b).
    (d) Now fit a model predicting mortality rate using levels of nitric oxides, sulfur dioxide, and hydrocarbons as inputs. Use appropriate transformations when helpful. Plot the fitted regression model and interpret the coefficients.
    (e) Cross-validate: fit the model you chose above to the first half of the data and then predict for the second half. (You used all the data to construct the model in (d), so this is not really cross-validation, but it gives a sense of how the steps of cross-validation can be implemented.)


  1. Special-purpose transformations: for a study of congressional elections, you would like a measure of the relative amount of money raised by each of the two majorparty candidates in each district. Suppose that you know the amount of money raised by each candidate; label these dollar values Di and Ri. You would like to combine these into a single variable that can be included as an input variable into a model predicting vote share for the Democrats.
    (a) Discuss the advantages and disadvantages of the following measures:
    • The simple difference, Di − Ri
    • The ratio, Di/Ri
    • The difference on the logarithmic scale, log Di − log Ri
    • The relative proportion, Di/(Di + Ri).
    (b) Propose an idiosyncratic transformation (as in the example on page 65) and discuss the advantages and disadvantages of using it as a regression input.

simple difference:


difference on the logarithmic scale:

relative proportion:

  1. An economist runs a regression examining the relations between the average price of cigarettes, P, and the quantity purchased, Q, across a large sample of counties in the United States, assuming the following functional form, log Q = α+β log P. Suppose the estimate for β is 0.3. Interpret this coefficient.

