人工智能第二次作业
作者:互联网
人工智能第二次作业
1 生成人工数据集
import numpy as np
import matplotlib.pyplot as plt
#训练集和验证集
n_vali,n_train,true_w, true_b = 20,80,[1.2, -3.4, 5.6], 5
X = np.random.normal(size=100)
X = X.reshape(-1,1) #重新排列为一列
# print(X)
poly_features = np.concatenate((X, np.power(X, 2),np.power(X, 3)),axis=1) #融合数组 表示对应行的数组进行拼接
y = (true_w[0] * poly_features[:, 0] + true_w[1] * poly_features[:, 1] + true_w[2] * poly_features[:, 2] + 5)
y += np.random.normal(scale=0.1, size=y.size)
plt.plot(X,y,'b.')
plt.show()
<Figure size 640x480 with 1 Axes>
2 与数据生成函数同阶的三阶多项式函数拟合
2.1 打印出最佳参数的取值,与真实参数值进行对比,并对结果进行分析点评
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(poly_features[:n_train],y[:n_train]) # 训练模型
y_predict = lin_reg.predict(poly_features[n_train:]) # 模型预测值
plt.plot(np.sort(poly_features[n_train:,0]),y_predict[np.argsort(poly_features[n_train:,0])],'r-')
plt.plot(X,y,'b.')
plt.show()
lin_reg.coef_,lin_reg.intercept_ # coef回归系数 intercept 截距
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-PHQjPkvx-1634088477861)(output_5_0.png)]
(array([ 1.21711851, -3.41096439, 5.5983987 ]), 5.01692982243803)
真实参数值为:[1.2, -3.4, 5.6], 5 最佳参数值为[ 1.21711851, -3.41096439, 5.5983987 ]), 5.01692982243803
可以看到结果还是比较接近的
2.2 打印出训练误差和验证误差。并以此来判断模型是欠拟合,还是过拟合,还是正好
from sklearn.metrics import mean_squared_error
# 训练误差
mse = mean_squared_error(y_predict, y[n_train:])
mse
0.008063949551266974
# 验证误差
mse2 = mean_squared_error(lin_reg.predict(poly_features[:n_train]),y[:n_train])
mse2
0.008685981589950435
可以看到误差很小 模型正好
2.3 尝试画出训练过程中模型分别在训练集和验证集上的学习曲线;观察分析这2条学习曲线,判断该模型是欠拟合,还是过拟合,还是表现很好?请说明理由
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
def plot_learning_curves(model, X, y):
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.5, random_state=0)
train_errors, val_errors = [], []
for m in range(1, len(X_train)):
model.fit(X_train[:m], y_train[:m]) # 训练模型
y_train_predict = model.predict(X_train[:m])
y_val_predict = model.predict(X_val)
train_errors.append(mean_squared_error(y_train[:m], y_train_predict))
val_errors.append(mean_squared_error(y_val, y_val_predict))
plt.plot(np.sqrt(train_errors), "r-+", linewidth=2, label="train") # 训练集
plt.plot(np.sqrt(val_errors), "b-", linewidth=3, label="val") # 验证集
plt.legend(loc="upper right", fontsize=14)
plt.xlabel("Training set size", fontsize=14)
plt.ylabel("RMSE", fontsize=14)
plot_learning_curves(lin_reg,poly_features, y) # 绘制学习曲线
plt.axis([0, 80, 0, 3])
plt.show()
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ptjYyEoR-1634088477864)(output_12_0.png)]
如上图可以看到 随着训练数据的增大 训练集和验证集的RMSE趋于相同 该模型表现很好
2.4 尝试在模型训练前,先对特征进行标准化。打印出训练好的模型的参数。并与真实参数值进行对比。结果与你预期的一致吗?请进一步分析原因
from sklearn.preprocessing import StandardScaler
# 去均值和方差归一化
scaler = StandardScaler()
X_standard = scaler.fit_transform(X)
lin_reg_standard = LinearRegression()
poly_features_standard = np.concatenate((X_standard, np.power(X_standard, 2),np.power(X_standard, 3)),axis=1)
lin_reg_standard.fit(poly_features_standard[:n_train],y[:n_train])
lin_reg_standard.coef_,lin_reg.intercept_
(array([ 1.95900193, -3.80500342, 3.16362369]), 4.998577877853462)
真实参数值为[1.2, -3.4, 5.6], 5
模型参数值为[ 1.95900193, -3.80500342, 3.16362369]), 4.998577877853462
可以看到误差较大,远比起标准化之前大。而一般来说标准化数据可以提高模型的精度,所以与我预期的并不一样。
原因我认为可能是由于数据量太小导致的
3 直接用线性函数拟合
lin_reg_l = LinearRegression()
lin_reg_l.fit(X[:n_train],y[:n_train])
y_predict_l = lin_reg_l.predict(X[n_train:])
lin_reg_l.coef_,lin_reg_l.intercept_
(array([17.3088394]), 2.375876452061711)
3.1 打印出训练误差和验证误差。并以此来判断模型是欠拟合,还是过拟合,还是正好
# 训练误差
mse = mean_squared_error(X[:n_train],y[:n_train])
mse
324.13110743886955
# 验证误差
mse2 = mean_squared_error(y_predict_l,y[n_train:])
mse2
57.79268376626595
可以看到误差相当大 显然模型欠拟合
3.2 尝试画出训练过程中模型分别在训练集和验证集上的学习曲线;观察分析这2条学习曲线,判断该模型是欠拟合,还是过拟合,还是表现很好?
plt.show()
plot_learning_curves(lin_reg_l,X,y)
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-pK5aVnfe-1634088477867)(output_25_0.png)]
当训练集数据很少时,RMSE很小,说明模型拟合的很好,但随着数据量增大,RMSE直线上升,这说明数据并不符合线性。中间的上下浮动可能是由于数据噪声导致,最后当模型经过更多学习后,RMSE有一定下降但是很少,这也同样说明数据不符合线性,用直线来拟合不能很好的建模。我们需要更复杂的模型
3.3 请说明,可以通过哪些技术让模型变得更好
- 添加其他特征项
- 使用更复杂的模型
- 如果有正则项可以减小正则项参数
4 使⽤40阶多项式函数模型来拟合
4.1 打印出最佳参数的取值,并与真实参数值进行对比
from sklearn.preprocessing import PolynomialFeatures
ploy_standard_features = PolynomialFeatures(degree = 40,include_bias=False) # 40阶 没有0次项
# 标准化数据 加快收敛速度
x_standard_train_ploy = ploy_standard_features.fit_transform(X_standard)
poly_standard_reg = LinearRegression()
poly_standard_reg.fit(x_standard_train_ploy[:n_train],y[:n_train])
y_predict = poly_standard_reg.predict(x_standard_train_ploy[n_train:])
poly_standard_reg.coef_,poly_standard_reg.intercept_
(array([ 5.14922712e-04, 1.99051969e-04, -2.09776213e-03, -1.41197227e-04,
8.43683489e-05, 9.33758451e-06, 2.66917111e-05, 3.30215891e-05,
5.26737562e-05, 5.26666805e-05, 9.62625133e-05, 8.58476393e-05,
1.77048838e-04, 1.36725084e-04, 3.05514354e-04, 2.10654449e-04,
5.22233575e-04, 3.10666202e-04, 8.49405824e-04, 4.32048821e-04,
1.27673703e-03, 5.54222807e-04, 1.68535925e-03, 6.30204596e-04,
1.75416840e-03, 5.77520861e-04, 1.00088658e-03, 2.90536115e-04,
-6.25584267e-04, -2.38515229e-04, -1.43014812e-03, -5.26001032e-04,
1.05838436e-03, 4.09131920e-04, -2.73765513e-04, -1.08961175e-04,
3.05134818e-05, 1.23954079e-05, -1.23740714e-06, -5.10600696e-07]),
3.7136512248815214)
最佳参数值如上输出
真实参数值[1.2, -3.4, 5.6], 5
4.2 打印出训练误差和验证误差。并以此来判断模型是欠拟合,还是过拟合,还是正好
# 训练误差
mse = mean_squared_error(poly_standard_reg.predict(x_standard_train_ploy[:n_train]),y[:n_train])
mse
11.426871317150074
# 验证误差
mse2 = mean_squared_error(y_predict,y[n_train:])
mse2
31.237044090226522
可以看到训练误差很大,而验证误差较小。显然模型过拟合
4.3 在训练模型前,不做特征标准化。打印出训练误差和验证误差。并以此来判断模型是欠拟合,还是过拟合,还是正好
ploy_features = PolynomialFeatures(degree = 40,include_bias=False)
x_train_ploy = ploy_features.fit_transform(X)
poly_reg = LinearRegression()
poly_reg.fit(x_train_ploy[:n_train],y[:n_train])
y_predict = poly_reg.predict(ploy_features.fit_transform(X[n_train:]))
poly_standard_reg.coef_,poly_standard_reg.intercept_
(array([ 5.14922712e-04, 1.99051969e-04, -2.09776213e-03, -1.41197227e-04,
8.43683489e-05, 9.33758451e-06, 2.66917111e-05, 3.30215891e-05,
5.26737562e-05, 5.26666805e-05, 9.62625133e-05, 8.58476393e-05,
1.77048838e-04, 1.36725084e-04, 3.05514354e-04, 2.10654449e-04,
5.22233575e-04, 3.10666202e-04, 8.49405824e-04, 4.32048821e-04,
1.27673703e-03, 5.54222807e-04, 1.68535925e-03, 6.30204596e-04,
1.75416840e-03, 5.77520861e-04, 1.00088658e-03, 2.90536115e-04,
-6.25584267e-04, -2.38515229e-04, -1.43014812e-03, -5.26001032e-04,
1.05838436e-03, 4.09131920e-04, -2.73765513e-04, -1.08961175e-04,
3.05134818e-05, 1.23954079e-05, -1.23740714e-06, -5.10600696e-07]),
3.7136512248815214)
# 训练误差
mse = mean_squared_error(poly_reg.predict(x_train_ploy[:n_train]),y[:n_train])
mse
0.2580520097018944
# 验证误差
mse2 = mean_squared_error(y_predict,y[n_train:])
mse2
0.48198110242193337
可以观察到没有标准化与标准化相比,训练误差变小,验证误差变小。两者模型都过拟合
4.4 尝试画出训练过程中模型分别在训练集和验证集上的学习曲线;观察分析这2条学习曲线,判断该模型是欠拟合,还是过拟合,还是表现很好
from sklearn.pipeline import Pipeline
polynomial_regression = Pipeline([
("poly_features", PolynomialFeatures(degree=40, include_bias=False)),
("lin_reg", LinearRegression()),
])
plot_learning_curves(polynomial_regression,X,y)
plt.axis([0, 80, 0, 20000000000])
plt.show()
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-wvHUWEw9-1634088477870)(output_43_0.png)]
可以看到模型很明显过拟合
4.5 请说明, 可以通过哪些技术让模型变得更好
- 减小特征数
- 增大数据量
- 使用正则化
4.6 尝试利用L2正则化技术,让该模型的表现好起来。(提示:使用sklearn.linear_model中的Ridge模型)
a) 尝试使用不同的正则化强度,打印出不同正则化强度下的训练误差和验证误差。并以此来判断模型是欠拟合,还是过拟合
from sklearn.linear_model import Ridge
regul_reg = Ridge(alpha = 1e-05)
regul_reg.fit(x_standard_train_ploy[:n_train], y[:n_train])
print(mean_squared_error(regul_reg.predict(x_standard_train_ploy[:n_train]),y[:n_train]))
print(mean_squared_error(regul_reg.predict(x_standard_train_ploy[n_train:]),y[n_train:]))
print("------------------------------")
regul_reg = Ridge(alpha = 1e-03)
regul_reg.fit(x_standard_train_ploy[:n_train], y[:n_train])
print(mean_squared_error(regul_reg.predict(x_standard_train_ploy[:n_train]),y[:n_train]))
print(mean_squared_error(regul_reg.predict(x_standard_train_ploy[n_train:]),y[n_train:]))
print("------------------------------")
regul_reg = Ridge(alpha = 0)
regul_reg.fit(x_standard_train_ploy[:n_train], y[:n_train])
print(mean_squared_error(regul_reg.predict(x_standard_train_ploy[:n_train]),y[:n_train]))
print(mean_squared_error(regul_reg.predict(x_standard_train_ploy[n_train:]),y[n_train:]))
print("------------------------------")
regul_reg = Ridge(alpha =1)
regul_reg.fit(x_standard_train_ploy[:n_train], y[:n_train])
print(mean_squared_error(regul_reg.predict(x_standard_train_ploy[:n_train]),y[:n_train]))
print(mean_squared_error(regul_reg.predict(x_standard_train_ploy[n_train:]),y[n_train:]))
print("------------------------------")
regul_reg = Ridge(alpha =100)
regul_reg.fit(x_standard_train_ploy[:n_train], y[:n_train])
print(mean_squared_error(regul_reg.predict(x_standard_train_ploy[:n_train]),y[:n_train]))
print(mean_squared_error(regul_reg.predict(x_standard_train_ploy[n_train:]),y[n_train:]))
81.69299462950605
39.82951636595671
------------------------------
81.68853734458321
39.830034165366676
------------------------------
81.68978486316549
39.82951636215714
------------------------------
81.69147232682458
39.830004089343404
------------------------------
81.37482365212273
39.86916311212694
可以看到,模型显然过拟合
b) 找出最佳的正则化强度,并打印出该正则化强度下的最佳参数
from sklearn.linear_model import RidgeCV
regul_reg = RidgeCV(alphas = np.arange(1,1001,dtype=np.float64))
regul_reg.fit(x_standard_train_ploy[:n_train], y[:n_train])
print(regul_reg.alpha_)
1000.0
from sklearn.linear_model import Ridge
regul_reg = Ridge(alpha = regul_reg.alpha_)
regul_reg.fit(x_standard_train_ploy[:n_train], y[:n_train])
regul_reg.coef_,regul_reg.intercept_
(array([ 9.39187979e-02, -1.13553786e-02, 1.49937536e-02, -1.03161377e-03,
5.90093523e-04, -1.63279480e-04, 9.51295710e-05, -1.60597505e-04,
2.59544321e-04, -1.03458779e-04, 5.99888725e-04, -5.17818940e-04,
3.58036918e-04, -9.78468422e-04, 7.68003440e-04, -1.18845038e-03,
5.23562068e-04, -1.09469352e-03, 6.41130617e-04, -7.47134671e-04,
6.93080030e-04, 7.57460657e-05, 6.85419431e-04, 8.76005917e-04,
1.00833898e-03, 1.44422982e-03, 9.72198937e-04, 4.93048333e-04,
2.24536911e-05, -9.27185652e-04, -1.55355163e-03, -1.95604343e-04,
9.66395994e-04, 3.41012683e-04, -2.34194909e-04, -1.03115764e-04,
2.50100033e-05, 1.23708967e-05, -9.78373518e-07, -5.26391722e-07]),
2.3459144084752714)
在1-1000的范围内 最佳alpha强度为1000,此正则化强度下最佳超参如上
c) 尝试总结下L2正则化技术所带来的效果,并分析为何L2正则化技术可以解决过拟合问题
L2正则化减小了误差,一定程度上增大了拟合效果。因为L2正则化的解都比较小,抗扰动能力强。在求解过程中,L2通常倾向让权值尽可能小,最后构造一个所有参数都比较小的模型。因为一般认为参数值小的模型比较简单,能适应不同的数据集,也在一定程度上避免了过拟合现象。
标签:04,人工智能,ploy,作业,standard,predict,train,第二次,reg 来源: https://blog.csdn.net/weixin_44353950/article/details/120737330