航空公司客户价值聚类分析
作者:互联网
航空公司客户价值聚类分析
- 特征工程
- K-means聚类
- RFM模型
- DBSCAN算法
描述
信息时代的来临使得企业营销焦点从产品中心转变成客户中心。具体地,对不同的客户进行分类管理,给予不同类型的客户制定优化的个性化服务方案,采取不同的营销策略。将有限的营销资源集中于高价值的客户,实现企业利润最大化
- 借助航空公司数据,对客户进行分类
- 对不同类别的客户进行特征分析,比较不同类别客户的价值
- 对不同价值的客户类别进行个性化服务,制定相应的营销策略
思路
数据
数据集中字段含义
数据预处理
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime
import sklearn.preprocessing
import sklearn.cluster
air_data_path = "./dataset/air_data.csv"
air_data = pd.read_csv(air_data_path)
air_data.shape
(62988, 44)
air_data.head()
MEMBER_NO | FFP_DATE | FIRST_FLIGHT_DATE | GENDER | FFP_TIER | WORK_CITY | WORK_PROVINCE | WORK_COUNTRY | AGE | LOAD_TIME | ... | ADD_Point_SUM | Eli_Add_Point_Sum | L1Y_ELi_Add_Points | Points_Sum | L1Y_Points_Sum | Ration_L1Y_Flight_Count | Ration_P1Y_Flight_Count | Ration_P1Y_BPS | Ration_L1Y_BPS | Point_NotFlight | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 54993 | 2006/11/02 | 2008/12/24 | 男 | 6 | . | 北京 | CN | 31.0 | 2014/03/31 | ... | 39992 | 114452 | 111100 | 619760 | 370211 | 0.509524 | 0.490476 | 0.487221 | 0.512777 | 50 |
1 | 28065 | 2007/02/19 | 2007/08/03 | 男 | 6 | NaN | 北京 | CN | 42.0 | 2014/03/31 | ... | 12000 | 53288 | 53288 | 415768 | 238410 | 0.514286 | 0.485714 | 0.489289 | 0.510708 | 33 |
2 | 55106 | 2007/02/01 | 2007/08/30 | 男 | 6 | . | 北京 | CN | 40.0 | 2014/03/31 | ... | 15491 | 55202 | 51711 | 406361 | 233798 | 0.518519 | 0.481481 | 0.481467 | 0.518530 | 26 |
3 | 21189 | 2008/08/22 | 2008/08/23 | 男 | 5 | Los Angeles | CA | US | 64.0 | 2014/03/31 | ... | 0 | 34890 | 34890 | 372204 | 186100 | 0.434783 | 0.565217 | 0.551722 | 0.448275 | 12 |
4 | 39546 | 2009/04/10 | 2009/04/15 | 男 | 6 | 贵阳 | 贵州 | CN | 48.0 | 2014/03/31 | ... | 22704 | 64969 | 64969 | 338813 | 210365 | 0.532895 | 0.467105 | 0.469054 | 0.530943 | 39 |
5 rows × 44 columns
air_data.dtypes
MEMBER_NO int64
FFP_DATE object
FIRST_FLIGHT_DATE object
GENDER object
FFP_TIER int64
WORK_CITY object
WORK_PROVINCE object
WORK_COUNTRY object
AGE float64
LOAD_TIME object
FLIGHT_COUNT int64
BP_SUM int64
EP_SUM_YR_1 int64
EP_SUM_YR_2 int64
SUM_YR_1 float64
SUM_YR_2 float64
SEG_KM_SUM int64
WEIGHTED_SEG_KM float64
LAST_FLIGHT_DATE object
AVG_FLIGHT_COUNT float64
AVG_BP_SUM float64
BEGIN_TO_FIRST int64
LAST_TO_END int64
AVG_INTERVAL float64
MAX_INTERVAL int64
ADD_POINTS_SUM_YR_1 int64
ADD_POINTS_SUM_YR_2 int64
EXCHANGE_COUNT int64
avg_discount float64
P1Y_Flight_Count int64
L1Y_Flight_Count int64
P1Y_BP_SUM int64
L1Y_BP_SUM int64
EP_SUM int64
ADD_Point_SUM int64
Eli_Add_Point_Sum int64
L1Y_ELi_Add_Points int64
Points_Sum int64
L1Y_Points_Sum int64
Ration_L1Y_Flight_Count float64
Ration_P1Y_Flight_Count float64
Ration_P1Y_BPS float64
Ration_L1Y_BPS float64
Point_NotFlight int64
dtype: object
air_data.describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
MEMBER_NO | 62988.0 | 31494.500000 | 18183.213715 | 1.00 | 15747.750000 | 31494.500000 | 47241.250000 | 62988.000000 |
FFP_TIER | 62988.0 | 4.102162 | 0.373856 | 4.00 | 4.000000 | 4.000000 | 4.000000 | 6.000000 |
AGE | 62568.0 | 42.476346 | 9.885915 | 6.00 | 35.000000 | 41.000000 | 48.000000 | 110.000000 |
FLIGHT_COUNT | 62988.0 | 11.839414 | 14.049471 | 2.00 | 3.000000 | 7.000000 | 15.000000 | 213.000000 |
BP_SUM | 62988.0 | 10925.081254 | 16339.486151 | 0.00 | 2518.000000 | 5700.000000 | 12831.000000 | 505308.000000 |
EP_SUM_YR_1 | 62988.0 | 0.000000 | 0.000000 | 0.00 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
EP_SUM_YR_2 | 62988.0 | 265.689623 | 1645.702854 | 0.00 | 0.000000 | 0.000000 | 0.000000 | 74460.000000 |
SUM_YR_1 | 62437.0 | 5355.376064 | 8109.450147 | 0.00 | 1003.000000 | 2800.000000 | 6574.000000 | 239560.000000 |
SUM_YR_2 | 62850.0 | 5604.026014 | 8703.364247 | 0.00 | 780.000000 | 2773.000000 | 6845.750000 | 234188.000000 |
SEG_KM_SUM | 62988.0 | 17123.878691 | 20960.844623 | 368.00 | 4747.000000 | 9994.000000 | 21271.250000 | 580717.000000 |
WEIGHTED_SEG_KM | 62988.0 | 12777.152439 | 17578.586695 | 0.00 | 3219.045000 | 6978.255000 | 15299.632500 | 558440.140000 |
AVG_FLIGHT_COUNT | 62988.0 | 1.542154 | 1.786996 | 0.25 | 0.428571 | 0.875000 | 1.875000 | 26.625000 |
AVG_BP_SUM | 62988.0 | 1421.440249 | 2083.121324 | 0.00 | 336.000000 | 752.375000 | 1690.270833 | 63163.500000 |
BEGIN_TO_FIRST | 62988.0 | 120.145488 | 159.572867 | 0.00 | 9.000000 | 50.000000 | 166.000000 | 729.000000 |
LAST_TO_END | 62988.0 | 176.120102 | 183.822223 | 1.00 | 29.000000 | 108.000000 | 268.000000 | 731.000000 |
AVG_INTERVAL | 62988.0 | 67.749788 | 77.517866 | 0.00 | 23.370370 | 44.666667 | 82.000000 | 728.000000 |
MAX_INTERVAL | 62988.0 | 166.033895 | 123.397180 | 0.00 | 79.000000 | 143.000000 | 228.000000 | 728.000000 |
ADD_POINTS_SUM_YR_1 | 62988.0 | 540.316965 | 3956.083455 | 0.00 | 0.000000 | 0.000000 | 0.000000 | 600000.000000 |
ADD_POINTS_SUM_YR_2 | 62988.0 | 814.689258 | 5121.796929 | 0.00 | 0.000000 | 0.000000 | 0.000000 | 728282.000000 |
EXCHANGE_COUNT | 62988.0 | 0.319775 | 1.136004 | 0.00 | 0.000000 | 0.000000 | 0.000000 | 46.000000 |
avg_discount | 62988.0 | 0.721558 | 0.185427 | 0.00 | 0.611997 | 0.711856 | 0.809476 | 1.500000 |
P1Y_Flight_Count | 62988.0 | 5.766257 | 7.210922 | 0.00 | 2.000000 | 3.000000 | 7.000000 | 118.000000 |
L1Y_Flight_Count | 62988.0 | 6.073157 | 8.175127 | 0.00 | 1.000000 | 3.000000 | 8.000000 | 111.000000 |
P1Y_BP_SUM | 62988.0 | 5366.720550 | 8537.773021 | 0.00 | 946.000000 | 2692.000000 | 6485.250000 | 246197.000000 |
L1Y_BP_SUM | 62988.0 | 5558.360704 | 9351.956952 | 0.00 | 545.000000 | 2547.000000 | 6619.250000 | 259111.000000 |
EP_SUM | 62988.0 | 265.689623 | 1645.702854 | 0.00 | 0.000000 | 0.000000 | 0.000000 | 74460.000000 |
ADD_Point_SUM | 62988.0 | 1355.006223 | 7868.477000 | 0.00 | 0.000000 | 0.000000 | 0.000000 | 984938.000000 |
Eli_Add_Point_Sum | 62988.0 | 1620.695847 | 8294.398955 | 0.00 | 0.000000 | 0.000000 | 345.000000 | 984938.000000 |
L1Y_ELi_Add_Points | 62988.0 | 1080.378882 | 5639.857254 | 0.00 | 0.000000 | 0.000000 | 0.000000 | 728282.000000 |
Points_Sum | 62988.0 | 12545.777100 | 20507.816700 | 0.00 | 2775.000000 | 6328.500000 | 14302.500000 | 985572.000000 |
L1Y_Points_Sum | 62988.0 | 6638.739585 | 12601.819863 | 0.00 | 700.000000 | 2860.500000 | 7500.000000 | 728282.000000 |
Ration_L1Y_Flight_Count | 62988.0 | 0.486419 | 0.319105 | 0.00 | 0.250000 | 0.500000 | 0.711111 | 1.000000 |
Ration_P1Y_Flight_Count | 62988.0 | 0.513581 | 0.319105 | 0.00 | 0.288889 | 0.500000 | 0.750000 | 1.000000 |
Ration_P1Y_BPS | 62988.0 | 0.522293 | 0.339632 | 0.00 | 0.258150 | 0.514252 | 0.815091 | 0.999989 |
Ration_L1Y_BPS | 62988.0 | 0.468422 | 0.338956 | 0.00 | 0.167954 | 0.476747 | 0.728375 | 0.999993 |
Point_NotFlight | 62988.0 | 2.728155 | 7.364164 | 0.00 | 0.000000 | 0.000000 | 1.000000 | 140.000000 |
air_data['MEMBER_NO'].duplicated()
0 False
1 False
2 False
3 False
4 False
...
62983 False
62984 False
62985 False
62986 False
62987 False
Name: MEMBER_NO, Length: 62988, dtype: bool
air_data[air_data['MEMBER_NO'].duplicated()]
MEMBER_NO | FFP_DATE | FIRST_FLIGHT_DATE | GENDER | FFP_TIER | WORK_CITY | WORK_PROVINCE | WORK_COUNTRY | AGE | LOAD_TIME | ... | ADD_Point_SUM | Eli_Add_Point_Sum | L1Y_ELi_Add_Points | Points_Sum | L1Y_Points_Sum | Ration_L1Y_Flight_Count | Ration_P1Y_Flight_Count | Ration_P1Y_BPS | Ration_L1Y_BPS | Point_NotFlight |
---|
0 rows × 44 columns
air_data.isna().any()
MEMBER_NO False
FFP_DATE False
FIRST_FLIGHT_DATE False
GENDER True
FFP_TIER False
WORK_CITY True
WORK_PROVINCE True
WORK_COUNTRY True
AGE True
LOAD_TIME False
FLIGHT_COUNT False
BP_SUM False
EP_SUM_YR_1 False
EP_SUM_YR_2 False
SUM_YR_1 True
SUM_YR_2 True
SEG_KM_SUM False
WEIGHTED_SEG_KM False
LAST_FLIGHT_DATE False
AVG_FLIGHT_COUNT False
AVG_BP_SUM False
BEGIN_TO_FIRST False
LAST_TO_END False
AVG_INTERVAL False
MAX_INTERVAL False
ADD_POINTS_SUM_YR_1 False
ADD_POINTS_SUM_YR_2 False
EXCHANGE_COUNT False
avg_discount False
P1Y_Flight_Count False
L1Y_Flight_Count False
P1Y_BP_SUM False
L1Y_BP_SUM False
EP_SUM False
ADD_Point_SUM False
Eli_Add_Point_Sum False
L1Y_ELi_Add_Points False
Points_Sum False
L1Y_Points_Sum False
Ration_L1Y_Flight_Count False
Ration_P1Y_Flight_Count False
Ration_P1Y_BPS False
Ration_L1Y_BPS False
Point_NotFlight False
dtype: bool
air_data.isnull().any()
MEMBER_NO False
FFP_DATE False
FIRST_FLIGHT_DATE False
GENDER True
FFP_TIER False
WORK_CITY True
WORK_PROVINCE True
WORK_COUNTRY True
AGE True
LOAD_TIME False
FLIGHT_COUNT False
BP_SUM False
EP_SUM_YR_1 False
EP_SUM_YR_2 False
SUM_YR_1 True
SUM_YR_2 True
SEG_KM_SUM False
WEIGHTED_SEG_KM False
LAST_FLIGHT_DATE False
AVG_FLIGHT_COUNT False
AVG_BP_SUM False
BEGIN_TO_FIRST False
LAST_TO_END False
AVG_INTERVAL False
MAX_INTERVAL False
ADD_POINTS_SUM_YR_1 False
ADD_POINTS_SUM_YR_2 False
EXCHANGE_COUNT False
avg_discount False
P1Y_Flight_Count False
L1Y_Flight_Count False
P1Y_BP_SUM False
L1Y_BP_SUM False
EP_SUM False
ADD_Point_SUM False
Eli_Add_Point_Sum False
L1Y_ELi_Add_Points False
Points_Sum False
L1Y_Points_Sum False
Ration_L1Y_Flight_Count False
Ration_P1Y_Flight_Count False
Ration_P1Y_BPS False
Ration_L1Y_BPS False
Point_NotFlight False
dtype: bool
boolean_filter = air_data['SUM_YR_1'].notnull() & air_data['SUM_YR_2'].notnull()
boolean_filter
0 True1 True2 True3 True4 True ... 62983 True62984 True62985 True62986 True62987 FalseLength: 62988, dtype: bool
air_data = air_data[boolean_filter]
filter1 = air_data['SUM_YR_1'] != 0filter2 = air_data['SUM_YR_2'] != 0
air_data = air_data[filter1 | filter2]
air_data.shape
(62044, 44)
特征工程
RFM模型
对于客户价值分析的一个经典模型是 RFM 模型。
- Recency: 最近消费时间间隔。
- Frequency: 客户消费频率。
- Monetary Value: 客户总消费金额。
变体 - LRFMC 模型
- Length of Relationship: 客户关系时长,反映可能的活跃时长。
- Recency: 最近消费时间间隔,反映当前的活跃状态。
- Frequency: 客户消费频率,反映客户的忠诚度。
- Mileage: 客户总飞行里程,反映客户对乘机的依赖性。
- Coefficient of Discount: 客户所享受的平均折扣率,侧面反映客户价值高低。
load_time = datetime.datetime.strptime('2014/03/31','%Y/%m/%d')
load_time
datetime.datetime(2014, 3, 31, 0, 0)
ffp_dates = [datetime.datetime.strptime(ffp_date,'%Y/%m/%d') for ffp_date in air_data['FFP_DATE']]
length_of_relationship = [(load_time-ffp_date).days for ffp_date in ffp_dates]
air_data['LEN_REL'] = length_of_relationship
移除非重要列, 只保留LRFMC模型所需的属性
features = ['LEN_REL','FLIGHT_COUNT','avg_discount','SEG_KM_SUM','LAST_TO_END']
data = air_data[features]
features = ['L','F','C','M','R']
data.columns = features
data.shape
(62044, 5)
data.head()
L | F | C | M | R | |
---|---|---|---|---|---|
0 | 2706 | 210 | 0.961639 | 580717 | 1 |
1 | 2597 | 140 | 1.252314 | 293678 | 7 |
2 | 2615 | 135 | 1.254676 | 283712 | 11 |
3 | 2047 | 23 | 1.090870 | 281336 | 97 |
4 | 1816 | 152 | 0.970658 | 309928 | 5 |
data.describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
L | 62044.0 | 1488.691090 | 847.880920 | 365.000000 | 735.000000 | 1278.000000 | 2182.000000 | 3437.0 |
F | 62044.0 | 11.971359 | 14.110619 | 2.000000 | 3.000000 | 7.000000 | 15.000000 | 213.0 |
C | 62044.0 | 0.722180 | 0.184833 | 0.136017 | 0.613085 | 0.712162 | 0.809293 | 1.5 |
M | 62044.0 | 17321.694749 | 21052.728111 | 368.000000 | 4874.000000 | 10200.000000 | 21522.500000 | 580717.0 |
R | 62044.0 | 172.532703 | 181.526164 | 1.000000 | 29.000000 | 105.000000 | 260.000000 | 731.0 |
标准化
让不同属性的取值范围一致,即数据的标准化。标准化方法有极大极小标准化、标准差标准化等方法。
- 对特征标准化,使得各特征的均值为0、方差为1
((data -data.mean(axis=0)) /data.std(axis=0)).describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
L | 62044.0 | 1.117739e-16 | 1.0 | -1.325294 | -0.888911 | -0.248491 | 0.817696 | 2.297857 |
F | 62044.0 | 3.664717e-17 | 1.0 | -0.706656 | -0.635788 | -0.352313 | 0.214636 | 14.246621 |
C | 62044.0 | 4.251071e-16 | 1.0 | -3.171310 | -0.590233 | -0.054199 | 0.471304 | 4.208225 |
M | 62044.0 | -5.863547e-17 | 1.0 | -0.805297 | -0.591263 | -0.338279 | 0.199537 | 26.761154 |
R | 62044.0 | 1.465887e-16 | 1.0 | -0.944948 | -0.790700 | -0.372027 | 0.481844 | 3.076511 |
ss = sklearn.preprocessing.StandardScaler(with_mean=True,with_std=True)
data = ss.fit_transform(data)
data
array([[ 1.43571897, 14.03412875, 1.29555058, 26.76136996, -0.94495516], [ 1.30716214, 9.07328567, 2.86819902, 13.1269701 , -0.9119018 ], [ 1.32839171, 8.71893974, 2.88097321, 12.65358345, -0.88986623], ..., [-0.14942206, -0.70666211, -2.68990622, -0.77233818, -0.73561725], [-1.20618274, -0.70666211, -2.55464809, -0.77984321, 1.6056619 ], [-0.47965977, -0.70666211, -2.39233833, -0.78668323, 0.60304353]])
data = pd.DataFrame(data,columns=features)
data.head()
L | F | C | M | R | |
---|---|---|---|---|---|
0 | 1.435719 | 14.034129 | 1.295551 | 26.761370 | -0.944955 |
1 | 1.307162 | 9.073286 | 2.868199 | 13.126970 | -0.911902 |
2 | 1.328392 | 8.718940 | 2.880973 | 12.653583 | -0.889866 |
3 | 0.658481 | 0.781591 | 1.994730 | 12.540723 | -0.416102 |
4 | 0.386035 | 9.923716 | 1.344346 | 13.898848 | -0.922920 |
data_db.describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
L | 62044.0 | 1.246004e-16 | 1.000008 | -1.325304 | -0.888919 | -0.248493 | 0.817703 | 2.297875 |
F | 62044.0 | 5.863547e-17 | 1.000008 | -0.706662 | -0.635793 | -0.352316 | 0.214637 | 14.246736 |
C | 62044.0 | 3.957894e-16 | 1.000008 | -3.171335 | -0.590238 | -0.054200 | 0.471308 | 4.208258 |
M | 62044.0 | -1.026121e-16 | 1.000008 | -0.805303 | -0.591268 | -0.338282 | 0.199539 | 26.761370 |
R | 62044.0 | 4.397660e-17 | 1.000008 | -0.944955 | -0.790706 | -0.372030 | 0.481848 | 3.076536 |
模型训练与 数据的预测
将客户群体细分为重要保持客户、重要发展客户、重要挽留客户、一般客户、低价值客户五类
K-means聚类算法
- 目标是把 \(n\) 个观测样本划分成 \(k\) 个群体(cluster),每个群体都有一个中心(mean)。
- 每个样本仅属于其中一个群体,即与这个样本距离最近的中心的群体。
- 符号: \(S_{i}\) 是一个群体, \(m_{i}\) 是群体 \(S_{i}\) 里的样本的中心, \(x_{i}\) 是一个样本点。
- Assignment step (expectation step): 把每个样本分配给距离最近的中心的群体
- Update step (maximization step): 根据当前的样本及其所属群体,重新计算各群体的中心
num_clusters = 5 # 设置类别为5
km = sklearn.cluster.KMeans(n_clusters=num_clusters, n_jobs=4) #模型加载
km.fit(data) # 模型训练
/Users/gaozhiyong/Documents/pyenv/pyenv3.6/lib/python3.6/site-packages/sklearn/cluster/_kmeans.py:793: FutureWarning: 'n_jobs' was deprecated in version 0.23 and will be removed in 1.0 (renaming of 0.25).
" removed in 1.0 (renaming of 0.25).", FutureWarning)
KMeans(n_clusters=5, n_jobs=4)
# 查看模型学习出来的5个群体的中心, 以及5哥群体所包含的样本个数
r1 = pd.Series(km.labels_).value_counts()
r2 = pd.DataFrame(km.cluster_centers_)
r = pd.concat([r2,r1],axis=1)
r.columns = list(data.columns) + ['counts']
r
L | F | C | M | R | counts | |
---|---|---|---|---|---|---|
0 | 0.482004 | 2.478716 | 0.298630 | 2.420403 | -0.798959 | 5338 |
1 | 1.155203 | -0.091881 | -0.150515 | -0.099938 | -0.373781 | 15858 |
2 | 0.110721 | -0.189617 | 2.353276 | -0.185116 | -0.015167 | 3684 |
3 | -0.700396 | -0.164828 | -0.234397 | -0.165888 | -0.410842 | 24970 |
4 | -0.315083 | -0.574115 | -0.162570 | -0.537185 | 1.684579 | 12194 |
# 查看模型对每个样本预测的群体标签
km.labels_
array([0, 0, 0, ..., 3, 4, 4], dtype=int32)
尝试使用RFM模型
data_rfm = data[['R','F','M']]
data_rfm.head()
R | F | M | |
---|---|---|---|
0 | -0.944955 | 14.034129 | 26.761370 |
1 | -0.911902 | 9.073286 | 13.126970 |
2 | -0.889866 | 8.718940 | 12.653583 |
3 | -0.416102 | 0.781591 | 12.540723 |
4 | -0.922920 | 9.923716 | 13.898848 |
km.fit(data_rfm) # 模型对 只包含rfm数据集训练
/Users/gaozhiyong/Documents/pyenv/pyenv3.6/lib/python3.6/site-packages/sklearn/cluster/_kmeans.py:793: FutureWarning: 'n_jobs' was deprecated in version 0.23 and will be removed in 1.0 (renaming of 0.25).
" removed in 1.0 (renaming of 0.25).", FutureWarning)
KMeans(n_clusters=5, n_jobs=4)
km.labels_
array([3, 3, 3, ..., 2, 1, 2], dtype=int32)
r1 = pd.Series(km.labels_).value_counts()
r2 = pd.DataFrame(km.cluster_centers_)
rr = pd.concat([r2,r1],axis=1)
rr = pd.DataFrame(ss.fit_transform(rr))
rr.columns = list(data_rfm.columns) + ['counts']
rr
R | F | M | counts | |
---|---|---|---|---|
0 | -0.475915 | -0.389200 | -0.395668 | 0.146242 |
1 | 1.958565 | -0.918959 | -0.893438 | 0.118661 |
2 | -0.129480 | -0.846644 | -0.841995 | 1.712033 |
3 | -0.727717 | 1.772255 | 1.795436 | -1.187639 |
4 | -0.625453 | 0.382548 | 0.335664 | -0.789296 |
分析与决策
使用雷达图对模型学习出的5个群体特征进行可视化分析
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Circle,RegularPolygon
from matplotlib.path import Path
from matplotlib.projections.polar import PolarAxes
from matplotlib.projections import register_projection
from matplotlib.spines import Spine
from matplotlib.transforms import Affine2D
def radar_factory(num_vars,frame='circle'):
# 计算得到 evenly-spaced axis angles
theta = np.linspace(0,2*np.pi, num_vars, endpoint=False)
class RadarAxes(PolarAxes):
name= 'radar'
# 使用1条线段连接指定点
RESOLUTION = 1
def __init__(self,*args,**kwargs):
super().__init__(*args,**kwargs)
# 旋转绘图,使第一个轴位于顶部
self.set_theta_zero_location('N')
def fill(self, *args, closed=True, **kwargs):
"""覆盖填充,以便默认情况下关闭该行"""
return super().fill(closed=closed, *args, **kwargs)
def plot(self, *args, **kwargs):
"""覆盖填充,以便默认情况下关闭该行"""
lines = super().plot(*args, **kwargs)
for line in lines:
self._close_line(line)
def _close_line(self, line):
x, y = line.get_data()
# FIXME: x[0], y[0] 处的标记加倍
if x[0] != x[-1]:
x = np.concatenate((x, [x[0]]))
y = np.concatenate((y, [y[0]]))
line.set_data(x, y)
def set_varlabels(self, labels):
self.set_thetagrids(np.degrees(theta), labels)
def _gen_axes_patch(self):
# 轴必须以(0.5,0.5)为中心并且半径为0.5
# 在轴坐标中。
if frame == 'circle':
return Circle((0.5, 0.5), 0.5)
elif frame == 'polygon':
return RegularPolygon((0.5, 0.5), num_vars,
radius=.5, edgecolor="k")
else:
raise ValueError("unknown value for 'frame': %s" % frame)
def _gen_axes_spines(self):
if frame == 'circle':
return super()._gen_axes_spines()
elif frame == 'polygon':
# spine_type 必须是'left'/'right'/'top'/'bottom'/'circle'.
spine = Spine(axes=self,
spine_type='circle',
path=Path.unit_regular_polygon(num_vars))
# unit_regular_polygon 给出以1为中心的半径为1的多边形
#(0,0),但我们希望以(0.5,
# 0.5)的坐标轴。
spine.set_transform(Affine2D().scale(.5).translate(.5, .5)
+ self.transAxes)
return {'polar': spine}
else:
raise ValueError("unknown value for 'frame': %s" % frame)
register_projection(RadarAxes)
return theta
LCRFM模型作图
N = num_clusters
theta = radar_factory(N, frame='polygon')
data = r.to_numpy()
fig,ax = plt.subplots(figsize=(5,5), nrows = 1, ncols=1, subplot_kw=dict(projection='radar'))
fig.subplots_adjust(wspace=0.25,hspace=0.20,top=0.85,bottom=0.05)
# 去掉最后一列
case_data = data[:,:-1]
# 设置纵坐标不可见
ax.get_yaxis().set_visible(False)
# 图片标题
title = "Radar Chart for Different Means"
ax.set_title(title, weight='bold', size='medium', position=(0.5, 1.1),
horizontalalignment='center', verticalalignment='center')
for d in case_data:
# 画边
ax.plot(theta, d)
# 填充颜色
ax.fill(theta, d, alpha=0.05)
# 设置纵坐标名称
ax.set_varlabels(features)
# 添加图例
labels = ["CustomerCluster_" + str(i) for i in range(1,6)]
legend = ax.legend(labels, loc=(0.9, .75), labelspacing=0.1)
plt.show()
RFM模型作图
theta = radar_factory(3, frame='polygon')
data = rr.to_numpy()
fig, ax = plt.subplots(figsize=(5, 5), nrows=1, ncols=1,
subplot_kw=dict(projection='radar'))
fig.subplots_adjust(wspace=0.25, hspace=0.20, top=0.85, bottom=0.05)
# 去掉最后一列
case_data = data[:, :-1]
# 设置纵坐标不可见
ax.get_yaxis().set_visible(False)
# 图片标题
title = "Radar Chart for Different Means"
ax.set_title(title, weight='bold', size='medium', position=(0.5, 1.1),
horizontalalignment='center', verticalalignment='center')
for d in case_data:
# 画边
ax.plot(theta, d)
# 填充颜色
ax.fill(theta, d, alpha=0.05)
# 设置纵坐标名称
ax.set_varlabels(['R','F','M'])
# 添加图例
labels = ["CustomerCluster_" + str(i) for i in range(1,6)]
legend = ax.legend(labels, loc=(0.9, .75), labelspacing=0.1)
plt.show()
DBSCAN模型对LCRFM特征进行计算
from sklearn.cluster import DBSCAN
# Kagging debug
db = DBSCAN(eps=10,min_samples=2).fit(data_db.sample(10000))
DBSCAN_labels = db.labels_
DBSCAN_labels
array([0, 0, 0, ..., 0, 0, 0])
根据LCRFM结果进行分析
应实际业务对聚类结果进行分值离散转化,对应1-5分,其中属性值越大,分数越高:
- 重要保持客户
平均折扣率高(C↑),最近有乘机记录(R↓),乘机次数高(F↑)或里程高(M↑):
这类客户机票票价高,不在意机票折扣,经常乘机,是最理想的客户类型。
公司应优先将资源投放到他们身上,维持这类客户的忠诚度。
- 重要发展客户
平均折扣率高(C↑),最近有乘机记录(R↓),乘机次数低(F↓)或里程低(M↓):
这类客户机票票价高,不在意机票折扣,最近有乘机记录,但总里程低,具有很大的发展潜力。
公司应加强这类客户的满意度,使他们逐渐成为忠诚客户。
- 重要挽留客户
平均折扣率高(C↑),乘机次数高(F↑)或里程高(M↑),最近无乘机记录(R↑):
这类客户总里程高,但较长时间没有乘机,可能处于流失状态。
公司应加强与这类客户的互动,召回用户,延长客户的生命周期。
- 一般客户
平均折扣率低(C↓),最近无乘机记录(R↑),乘机次数高(F↓)或里程高(M↓),入会时间短(L↓):
这类客户机票票价低,经常买折扣机票,最近无乘机记录,可能是趁着折扣而选择购买,对品牌无忠诚度。
公司需要在资源支持的情况下强化对这类客户的联系。
- 低价值客户
平均折扣率低(C↓),最近无乘机记录(R↑),乘机次数高(F↓)或里程高(M↓),入会时间短(L↓):
这类客户与一般客户类似,机票票价低,经常买折扣机票,最近无乘机记录,可能是趁着折扣而选择购买,对品牌无忠诚度。
结果分析
-
群体1的L属性最大
-
群体2的L、C属性最小
-
群体3的C属性上最大
-
群体4的M、F属性属性最大,R属性最小
-
群体5的R属性最大,F、M属性最小
-
其中每项指标的实际业务意义为:
- L:加入会员的时长。越大代表会员资历越久
- R:最近一次乘机时间。越大代表越久没乘机
- F:乘机次数。越大代表乘机次数越多
- M:飞行总里程。越大代表总里程越多
- C:平均折扣率。越大代表折扣越弱,0表示0折免费机票,10代表无折机票
重要保持客户:客户群4
重要发展客户:客户群3
重要挽留客户:客户群1
一般客户:客户群2
低价值客户:客户群5
决策
- 重要发展客户、重要保持客户、重要挽留客户这三类客户其实也对应着客户生命周期中的发展期、稳定器、衰退期三个时期。
- 从客户生命周期的角度讲,也应重点投入资源召回衰退期的客户。
- 一般而言,数据分析最终的目的是针对分析结果提出并开展一系列的运营/营销策略,以期帮助企业发展。在本实例中,运营策略有三个方向:
- 提高活跃度:提高一般客户、低价值客户的活跃度。将其转化为优质客户
- 提高留存率:与重要挽留客户互动,提高这部分用户的留存率
- 提高付费率:维系重要保持客户、重要发展客户的忠诚度,保持企业良好收入
- 每个方向对应不同的策略,如会员升级、积分兑换、交叉销售、发放折扣券等手段
标签:62988.0,False,L1Y,data,SUM,客户,航空公司,聚类分析 来源: https://www.cnblogs.com/oceaneyes-gzy/p/16462998.html