其他分享
首页 > 其他分享> > MNL(使用自己的数据集)

MNL(使用自己的数据集)

作者:互联网

1. 导入包

import pandas as pd
import numpy as np
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.linear_model import LogisticRegression

2. 导入自己的数据

data_wide = pd.read_csv("./data/mode_wide.csv", index_col=0) #index_col=0 第一行为列名
data_wide
choicecost.carcost.carpoolcost.buscost.railtime.cartime.carpooltime.bustime.rail
1car1.5070102.3356121.8005122.35892018.50320026.33823320.86779430.033469
2rail6.0569982.8969192.2371281.85545031.31110734.25695667.18188960.293126
3car5.7946772.1374542.5763852.74747922.54742923.25517163.30905749.171643
4car1.8691442.5724271.9035182.26827626.09028229.89602319.75270413.472675
5car2.4989521.7220102.6860002.9738664.69914012.41408443.09203939.743252
..............................
449rail6.9909010.5151372.0660442.17117448.02279244.50157727.27191818.966319
450car4.5916472.8911481.9003791.79440729.44419233.72708766.11734539.842459
451car3.2362371.2068151.7546742.02367116.34901718.97507423.38772943.298276
452bus6.9327401.1718612.4614952.61248965.42064160.48166852.40431548.370662
453carpool6.5315091.4081712.2147911.85633859.56607355.14140667.81563573.447286

453 rows × 9 columns

2. 处理数据

y= 1(选car);

y = 2 (carpool);

y = 3 (rail);

y = 4 (bus);

def choice_to_y(choice):
    if choice == 'car':
        return 1
    elif choice == 'carpool':
        return 2
    elif choice == 'rail':
        return 3
    else:
        return 4

data_wide['y'] = data_wide['choice'].map(choice_to_y)
data_wide
choicecost.carcost.carpoolcost.buscost.railtime.cartime.carpooltime.bustime.raily
1car1.5070102.3356121.8005122.35892018.50320026.33823320.86779430.0334691
2rail6.0569982.8969192.2371281.85545031.31110734.25695667.18188960.2931263
3car5.7946772.1374542.5763852.74747922.54742923.25517163.30905749.1716431
4car1.8691442.5724271.9035182.26827626.09028229.89602319.75270413.4726751
5car2.4989521.7220102.6860002.9738664.69914012.41408443.09203939.7432521
.................................
449rail6.9909010.5151372.0660442.17117448.02279244.50157727.27191818.9663193
450car4.5916472.8911481.9003791.79440729.44419233.72708766.11734539.8424591
451car3.2362371.2068151.7546742.02367116.34901718.97507423.38772943.2982761
452bus6.9327401.1718612.4614952.61248965.42064160.48166852.40431548.3706624
453carpool6.5315091.4081712.2147911.85633859.56607355.14140667.81563573.4472862

453 rows × 10 columns

3. 确定自变量X和因变量y

data_wide.columns
Index(['choice', 'cost.car', 'cost.carpool', 'cost.bus', 'cost.rail',
       'time.car', 'time.carpool', 'time.bus', 'time.rail', 'y'],
      dtype='object')
X = data_wide[['cost.car', 'cost.carpool', 'cost.bus', 'cost.rail','time.car', 'time.carpool', 'time.bus', 'time.rail']]
y = data_wide['y']

4. 配置Logit模型并评估

model = LogisticRegression(multi_class='multinomial', solver='lbfgs')

# define the model evaluation procedure (定义模型评估程序) n_splits 就是K-flods中的K值;n_repeats是交叉验证的次数
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate the model and collect the scores (评估模型并收集分数)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report the model performance 
print('Mean Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))  
Mean Accuracy: 0.665 (0.061)

5. 拟合

model.fit(X, y)
D:\ANACONDA\lib\site-packages\sklearn\linear_model\_logistic.py:818: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG,





LogisticRegression(multi_class='multinomial')

6. 设置一个新的数据,预测结果

#生成一组新数据
new_data = np.random.rand(8)
new_data
array([0.11880174, 0.16505872, 0.14297278, 0.50355392, 0.87629855,
       0.91189688, 0.57073101, 0.19178997])
#预测
#预测新数据的分布概率
yhat = model.predict_proba([new_data])

#输出预测结果
print('Predicted Probabilities: %s' % yhat[0])
Predicted Probabilities: [0.3749058  0.20228137 0.20380141 0.21901142]


D:\ANACONDA\lib\site-packages\sklearn\base.py:451: UserWarning: X does not have valid feature names, but LogisticRegression was fitted with feature names
  "X does not have valid feature names, but"

已经可以了解如何使用自己的数据进行多元logit回归的一个思路;

上面的警告是出现了无效的特征名(列名不是正确的格式)

标签:wide,data,choice,cost,MNL,使用,import,model,数据
来源: https://blog.csdn.net/sheyueyu/article/details/123610338