基于关联规则算法实现电影推荐系统
作者:互联网
基于关联规则算法实现电影推荐系统
- 利用数据挖掘算法中的Apriori(关联规则)算法来实现一个电影推荐系统
- 加载数据
- 数据预处理
- 生成频繁项集、关联规则
- 通过关联规则生成电影推荐的列表
Apriori算法
-
案例:
啤酒与尿布: 沃尔玛超市在分析销售记录时,发现了啤酒与尿布经常一起被购买,于是他们调整了货架将两者放在了一起,结果真的提升了啤酒的销量。 原因解释: 爸爸在给宝宝买尿布的时候,会顺便给自己买点啤酒? -
概述:
Apriori算法是一种最有影响力的挖掘布尔关联规则的频繁项集的算法,其命名Apriori源于算法使用了频繁项集性质的先验(Prior)知识。
接下来我们将以超市订单的例子理解关联分析相关的重要概念: Support(支持度)、Confidence(置信度)、Lift(提升度)。- Support(支持度):指某事件出现的概率,在本例中即指某个商品组合出现的次数占总次数的比例。
例:Support('Bread') = 4/5 = 0.8 Support('Milk') = 4/5 = 0.8
Support('Bread+Milk') = 3/5 = 0.6- Confidence(置信度):本质上是个条件概率,即当购买了商品A的前提下,购买商品B的概率。
例:Confidence('Bread'—> 'Milk') = Support('Bread+Milk')/ Support('Bread') = 0.6/0.8 = 0.75
- Lift(提升度): 指商品A的出现,对商品B的出现的概率的提升程度。Lift(A->B) = Confidence(A, B) / Support(B)
例:Lift('Bread'—> 'Milk') = 0.75/0.8 = 0.9375
-
对于Lift(提升度)有三种情况:
- Lift(A->B)>1: 代表A对B的出现概率有提升。
- Lift(A->B)=1: 代表A对B的出现概率没有提升,也没有下降。
- Lift(A->B)<1: 代表A对B的出现概率有下降效果。
-
原理:
该算法挖掘关联规则的过程,即是查找频繁项集(frequent itemset)的过程:- 频繁项集:支持度大于等于最小支持度(Min Support)阈值的项集。
- 非频繁集:支持度小于最小支持度的项集。
-
流程:
K = 1, 计算K项集的支持度;
筛选掉小于最小支持度的项集;
如果项集为空,则对应K-1项集的结果为最终结果。否则K = K+1重复2-3步
import pandas as pd
import matplotlib.pyplot as plt
import mlxtend
import numpy as np
电影数据准备
movie_data_file = './movie_dataset/movies_metadata.csv'
ratings_file = './movie_dataset/ratings_small.csv'
movie_data_df = pd.read_csv(movie_data_file)
ratings_df = pd.read_csv(ratings_file)
c:\users\ysilhouette\documents\pyenv\py3.6.5\lib\site-packages\IPython\core\interactiveshell.py:3072: DtypeWarning: Columns (10) have mixed types.Specify dtype option on import or set low_memory=False.
interactivity=interactivity, compiler=compiler, result=result)
movie_data_df.head(5)
adult | belongs_to_collection | budget | genres | homepage | id | imdb_id | original_language | original_title | overview | ... | release_date | revenue | runtime | spoken_languages | status | tagline | title | video | vote_average | vote_count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | False | {'id': 10194, 'name': 'Toy Story Collection', ... | 30000000 | [{'id': 16, 'name': 'Animation'}, {'id': 35, '... | http://toystory.disney.com/toy-story | 862 | tt0114709 | en | Toy Story | Led by Woody, Andy's toys live happily in his ... | ... | 1995-10-30 | 373554033.0 | 81.0 | [{'iso_639_1': 'en', 'name': 'English'}] | Released | NaN | Toy Story | False | 7.7 | 5415.0 |
1 | False | NaN | 65000000 | [{'id': 12, 'name': 'Adventure'}, {'id': 14, '... | NaN | 8844 | tt0113497 | en | Jumanji | When siblings Judy and Peter discover an encha... | ... | 1995-12-15 | 262797249.0 | 104.0 | [{'iso_639_1': 'en', 'name': 'English'}, {'iso... | Released | Roll the dice and unleash the excitement! | Jumanji | False | 6.9 | 2413.0 |
2 | False | {'id': 119050, 'name': 'Grumpy Old Men Collect... | 0 | [{'id': 10749, 'name': 'Romance'}, {'id': 35, ... | NaN | 15602 | tt0113228 | en | Grumpier Old Men | A family wedding reignites the ancient feud be... | ... | 1995-12-22 | 0.0 | 101.0 | [{'iso_639_1': 'en', 'name': 'English'}] | Released | Still Yelling. Still Fighting. Still Ready for... | Grumpier Old Men | False | 6.5 | 92.0 |
3 | False | NaN | 16000000 | [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam... | NaN | 31357 | tt0114885 | en | Waiting to Exhale | Cheated on, mistreated and stepped on, the wom... | ... | 1995-12-22 | 81452156.0 | 127.0 | [{'iso_639_1': 'en', 'name': 'English'}] | Released | Friends are the people who let you be yourself... | Waiting to Exhale | False | 6.1 | 34.0 |
4 | False | {'id': 96871, 'name': 'Father of the Bride Col... | 0 | [{'id': 35, 'name': 'Comedy'}] | NaN | 11862 | tt0113041 | en | Father of the Bride Part II | Just when George Banks has recovered from his ... | ... | 1995-02-10 | 76578911.0 | 106.0 | [{'iso_639_1': 'en', 'name': 'English'}] | Released | Just When His World Is Back To Normal... He's ... | Father of the Bride Part II | False | 5.7 | 173.0 |
5 rows × 24 columns
movie_data_df.describe()
revenue | runtime | vote_average | vote_count | |
---|---|---|---|---|
count | 4.546000e+04 | 45203.000000 | 45460.000000 | 45460.000000 |
mean | 1.120935e+07 | 94.128199 | 5.618207 | 109.897338 |
std | 6.433225e+07 | 38.407810 | 1.924216 | 491.310374 |
min | 0.000000e+00 | 0.000000 | 0.000000 | 0.000000 |
25% | 0.000000e+00 | 85.000000 | 5.000000 | 3.000000 |
50% | 0.000000e+00 | 95.000000 | 6.000000 | 10.000000 |
75% | 0.000000e+00 | 107.000000 | 6.800000 | 34.000000 |
max | 2.787965e+09 | 1256.000000 | 10.000000 | 14075.000000 |
movie_data_df.info
<bound method DataFrame.info of adult belongs_to_collection budget \
0 False {'id': 10194, 'name': 'Toy Story Collection', ... 30000000
1 False NaN 65000000
2 False {'id': 119050, 'name': 'Grumpy Old Men Collect... 0
3 False NaN 16000000
4 False {'id': 96871, 'name': 'Father of the Bride Col... 0
... ... ... ...
45461 False NaN 0
45462 False NaN 0
45463 False NaN 0
45464 False NaN 0
45465 False NaN 0
genres \
0 [{'id': 16, 'name': 'Animation'}, {'id': 35, '...
1 [{'id': 12, 'name': 'Adventure'}, {'id': 14, '...
2 [{'id': 10749, 'name': 'Romance'}, {'id': 35, ...
3 [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...
4 [{'id': 35, 'name': 'Comedy'}]
... ...
45461 [{'id': 18, 'name': 'Drama'}, {'id': 10751, 'n...
45462 [{'id': 18, 'name': 'Drama'}]
45463 [{'id': 28, 'name': 'Action'}, {'id': 18, 'nam...
45464 []
45465 []
homepage id imdb_id \
0 http://toystory.disney.com/toy-story 862 tt0114709
1 NaN 8844 tt0113497
2 NaN 15602 tt0113228
3 NaN 31357 tt0114885
4 NaN 11862 tt0113041
... ... ... ...
45461 http://www.imdb.com/title/tt6209470/ 439050 tt6209470
45462 NaN 111109 tt2028550
45463 NaN 67758 tt0303758
45464 NaN 227506 tt0008536
45465 NaN 461257 tt6980792
original_language original_title \
0 en Toy Story
1 en Jumanji
2 en Grumpier Old Men
3 en Waiting to Exhale
4 en Father of the Bride Part II
... ... ...
45461 fa رگ خواب
45462 tl Siglo ng Pagluluwal
45463 en Betrayal
45464 en Satana likuyushchiy
45465 en Queerama
overview ... release_date \
0 Led by Woody, Andy's toys live happily in his ... ... 1995-10-30
1 When siblings Judy and Peter discover an encha... ... 1995-12-15
2 A family wedding reignites the ancient feud be... ... 1995-12-22
3 Cheated on, mistreated and stepped on, the wom... ... 1995-12-22
4 Just when George Banks has recovered from his ... ... 1995-02-10
... ... ... ...
45461 Rising and falling between a man and woman. ... NaN
45462 An artist struggles to finish his work while a... ... 2011-11-17
45463 When one of her hits goes wrong, a professiona... ... 2003-08-01
45464 In a small town live two brothers, one a minis... ... 1917-10-21
45465 50 years after decriminalisation of homosexual... ... 2017-06-09
revenue runtime spoken_languages \
0 373554033.0 81.0 [{'iso_639_1': 'en', 'name': 'English'}]
1 262797249.0 104.0 [{'iso_639_1': 'en', 'name': 'English'}, {'iso...
2 0.0 101.0 [{'iso_639_1': 'en', 'name': 'English'}]
3 81452156.0 127.0 [{'iso_639_1': 'en', 'name': 'English'}]
4 76578911.0 106.0 [{'iso_639_1': 'en', 'name': 'English'}]
... ... ... ...
45461 0.0 90.0 [{'iso_639_1': 'fa', 'name': 'فارسی'}]
45462 0.0 360.0 [{'iso_639_1': 'tl', 'name': ''}]
45463 0.0 90.0 [{'iso_639_1': 'en', 'name': 'English'}]
45464 0.0 87.0 []
45465 0.0 75.0 [{'iso_639_1': 'en', 'name': 'English'}]
status tagline \
0 Released NaN
1 Released Roll the dice and unleash the excitement!
2 Released Still Yelling. Still Fighting. Still Ready for...
3 Released Friends are the people who let you be yourself...
4 Released Just When His World Is Back To Normal... He's ...
... ... ...
45461 Released Rising and falling between a man and woman
45462 Released NaN
45463 Released A deadly game of wits.
45464 Released NaN
45465 Released NaN
title video vote_average vote_count
0 Toy Story False 7.7 5415.0
1 Jumanji False 6.9 2413.0
2 Grumpier Old Men False 6.5 92.0
3 Waiting to Exhale False 6.1 34.0
4 Father of the Bride Part II False 5.7 173.0
... ... ... ... ...
45461 Subdue False 4.0 1.0
45462 Century of Birthing False 9.0 3.0
45463 Betrayal False 3.8 6.0
45464 Satan Triumphant False 0.0 0.0
45465 Queerama False 0.0 0.0
[45466 rows x 24 columns]>
movie_data_df.count()
adult 45466
belongs_to_collection 4494
budget 45466
genres 45466
homepage 7782
id 45466
imdb_id 45449
original_language 45455
original_title 45466
overview 44512
popularity 45461
poster_path 45080
production_companies 45463
production_countries 45463
release_date 45379
revenue 45460
runtime 45203
spoken_languages 45460
status 45379
tagline 20412
title 45460
video 45460
vote_average 45460
vote_count 45460
dtype: int64
movie_data_df.columns
Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
'imdb_id', 'original_language', 'original_title', 'overview',
'popularity', 'poster_path', 'production_companies',
'production_countries', 'release_date', 'revenue', 'runtime',
'spoken_languages', 'status', 'tagline', 'title', 'video',
'vote_average', 'vote_count'],
dtype='object')
ratings_df.head(5)
userId | movieId | rating | timestamp | |
---|---|---|---|---|
0 | 1 | 31 | 2.5 | 1260759144 |
1 | 1 | 1029 | 3.0 | 1260759179 |
2 | 1 | 1061 | 3.0 | 1260759182 |
3 | 1 | 1129 | 2.0 | 1260759185 |
4 | 1 | 1172 | 4.0 | 1260759205 |
ratings_df.columns
Index(['userId', 'movieId', 'rating', 'timestamp'], dtype='object')
ratings_df.count()
userId 100004
movieId 100004
rating 100004
timestamp 100004
dtype: int64
ratings_df.shape
(100004, 4)
movie_data_df.shape
(45466, 24)
数据预处理
- 缺失值处理
- 数据去重
- 电影源信息 merge 电影评分信息
movie_data_df_t=movie_data_df[['title','id']]
movie_data_df_t.dtypes
title object
id object
dtype: object
ratings_df_s = ratings_df.drop(['timestamp'], axis=1) #axis=0 跨列删除行 ,axis=1 跨行删除列
ratings_df_s.dtypes
userId int64
movieId int64
rating float64
dtype: object
缺失值处理
- pandas中用NaN(Not a Number)表示浮点数和非浮点数数组中的缺失值,同时python中None值也被当作缺失值。
# pd.to_numeric 将id列 的数据 由字符串转为数值类型, 不能转换的数据设置为NaN
pd.to_numeric(movie_data_df_t['id'],errors='coerce')
0 862.0
1 8844.0
2 15602.0
3 31357.0
4 11862.0
...
45461 439050.0
45462 111109.0
45463 67758.0
45464 227506.0
45465 461257.0
Name: id, Length: 45466, dtype: float64
#np.where返回满足()内条件的数据所在的位置
np.where(pd.to_numeric(movie_data_df_t['id'], errors='coerce').isna()) #返回缺失值的位置,其中isna() 对于NaN返回True,否则返回False
(array([19730, 29503, 35587], dtype=int64),)
- loc works on labels in the index.
- loc为Selection by Label函数,即为按标签取数据,标签是什么,就是上面的’0’~‘4’, ‘A’~‘B’。
- 例如第一个参数选择index,第二个参数选择column
- 建议写df.loc[0, :],这样可以清楚的看出为第0行的所有记录,同样如果取第’A’列的所有记录,可以写df.loc[:, ‘A’],如下图:
- :表示所有,[]里边为先行后列
- iloc works on the positions in the index (so it only takes integers).
- iloc函数为Selection by Position,即按位置选择数据,即第n行,第n列数据,只接受整型参数, 比如 0:2为左闭右开区间,即取0,1
- 若要取第一列的所有数据,则为df.iloc[:, 0],不接受’A’作为参数
movie_data_df_t.iloc[19730]
title NaN
id 1997-08-20
Name: 19730, dtype: object
movie_data_df_t.iloc[[19730,29503,35587]]
title | id | |
---|---|---|
19730 | NaN | 1997-08-20 |
29503 | NaN | 2012-09-29 |
35587 | NaN | 2014-01-01 |
# 将格式转换后的数据 赋值给id列
movie_data_df_t['id'] = pd.to_numeric(movie_data_df_t['id'], errors='coerce')
c:\users\ysilhouette\documents\pyenv\py3.6.5\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
movie_data_df_t.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 title 45460 non-null object
1 id 45463 non-null float64
dtypes: float64(1), object(1)
memory usage: 710.5+ KB
movie_data_df_t.iloc[[19730,29503,35587]]
title | id | |
---|---|---|
19730 | NaN | NaN |
29503 | NaN | NaN |
35587 | NaN | NaN |
movie_data_df_t.shape
(45466, 2)
movie_data_df_t.drop(np.where(movie_data_df_t['id'].isna())[0], inplace=True)
c:\users\ysilhouette\documents\pyenv\py3.6.5\lib\site-packages\pandas\core\frame.py:4174: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
errors=errors,
movie_data_df_t.shape
(45463, 2)
数据去重
movie_data_df_t.duplicated(['id','title']).sum()
30
movie_data_df_t.drop_duplicates(['id'],inplace=True)
c:\users\ysilhouette\documents\pyenv\py3.6.5\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
"""Entry point for launching an IPython kernel.
movie_data_df_t.shape
(45433, 2)
ratings_df_s.duplicated(['userId','movieId']).sum()
0
movie_data_df_t['id'] = movie_data_df_t['id'].astype(np.int64)
c:\users\ysilhouette\documents\pyenv\py3.6.5\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
"""Entry point for launching an IPython kernel.
movie_data_df_t.dtypes
title object
id int64
dtype: object
ratings_df_s.dtypes
userId int64
movieId int64
rating float64
dtype: object
数据合并
# 左dataframe 和 右dataframe 根据 movieId 和 id进行合并
ratings_df_s = pd.merge(ratings_df_s,movie_data_df_t, left_on='movieId',right_on='id')
ratings_df_s.head()
userId | movieId | rating | title | id | |
---|---|---|---|---|---|
0 | 1 | 1371 | 2.5 | Rocky III | 1371 |
1 | 4 | 1371 | 4.0 | Rocky III | 1371 |
2 | 7 | 1371 | 3.0 | Rocky III | 1371 |
3 | 19 | 1371 | 4.0 | Rocky III | 1371 |
4 | 21 | 1371 | 3.0 | Rocky III | 1371 |
ratings_df_s.drop(['id'],axis=1,inplace=True)
ratings_df_s
userId | movieId | rating | title | |
---|---|---|---|---|
0 | 1 | 1371 | 2.5 | Rocky III |
1 | 4 | 1371 | 4.0 | Rocky III |
2 | 7 | 1371 | 3.0 | Rocky III |
3 | 19 | 1371 | 4.0 | Rocky III |
4 | 21 | 1371 | 3.0 | Rocky III |
... | ... | ... | ... | ... |
44984 | 652 | 129009 | 4.0 | Love Is a Ball |
44985 | 653 | 2103 | 3.0 | Solaris |
44986 | 659 | 167 | 4.0 | K-PAX |
44987 | 659 | 563 | 3.0 | Starship Troopers |
44988 | 665 | 129 | 3.0 | Spirited Away |
44989 rows × 4 columns
ratings_df_s.shape
(44989, 4)
# 有评分记录的电影的个数
len(ratings_df_s['title'].unique())
2794
ratings_df_s['title'].unique()
array(['Rocky III', 'Greed', 'American Pie', ..., 'K-PAX',
'Starship Troopers', 'Spirited Away'], dtype=object)
ratings_df_s.groupby([ratings_df_s['title'],ratings_df_s['rating']]).count().reset_index()
title | rating | userId | movieId | |
---|---|---|---|---|
0 | !Women Art Revolution | 3.0 | 1 | 1 |
1 | !Women Art Revolution | 3.5 | 1 | 1 |
2 | 'Gator Bait | 0.5 | 1 | 1 |
3 | 'Twas the Night Before Christmas | 3.5 | 1 | 1 |
4 | 'Twas the Night Before Christmas | 4.5 | 1 | 1 |
... | ... | ... | ... | ... |
10263 | À nos amours | 4.0 | 5 | 5 |
10264 | À nos amours | 4.5 | 1 | 1 |
10265 | À nos amours | 5.0 | 1 | 1 |
10266 | Ödipussi | 4.5 | 1 | 1 |
10267 | Şaban Oğlu Şaban | 4.5 | 1 | 1 |
10268 rows × 4 columns
ratings_df_s.groupby(ratings_df_s['title']).count().reset_index()
title | userId | movieId | rating | |
---|---|---|---|---|
0 | !Women Art Revolution | 2 | 2 | 2 |
1 | 'Gator Bait | 1 | 1 | 1 |
2 | 'Twas the Night Before Christmas | 2 | 2 | 2 |
3 | ...And God Created Woman | 1 | 1 | 1 |
4 | 00 Schneider - Jagd auf Nihil Baxter | 2 | 2 | 2 |
... | ... | ... | ... | ... |
2789 | xXx | 28 | 28 | 28 |
2790 | ¡Three Amigos! | 1 | 1 | 1 |
2791 | À nos amours | 14 | 14 | 14 |
2792 | Ödipussi | 1 | 1 | 1 |
2793 | Şaban Oğlu Şaban | 1 | 1 | 1 |
2794 rows × 4 columns
ratings_df_s_allcounts = ratings_df_s.groupby(ratings_df_s['title'])['userId'].count().reset_index()
ratings_df_s_allcounts = ratings_df_s_allcounts.rename(columns = {'userId':'totalRatings'})
ratings_df_s_allcounts
title | totalRatings | |
---|---|---|
0 | !Women Art Revolution | 2 |
1 | 'Gator Bait | 1 |
2 | 'Twas the Night Before Christmas | 2 |
3 | ...And God Created Woman | 1 |
4 | 00 Schneider - Jagd auf Nihil Baxter | 2 |
... | ... | ... |
2789 | xXx | 28 |
2790 | ¡Three Amigos! | 1 |
2791 | À nos amours | 14 |
2792 | Ödipussi | 1 |
2793 | Şaban Oğlu Şaban | 1 |
2794 rows × 2 columns
ratings_df_s_allcounts.shape
(2794, 2)
ratings_df_s_allcounts['totalRatings'].describe()
count 2794.000000
mean 16.102004
std 31.481795
min 1.000000
25% 1.000000
50% 4.000000
75% 15.750000
max 324.000000
Name: totalRatings, dtype: float64
ratings_df_s_allcounts.hist()
array([[<AxesSubplot:title={'center':'totalRatings'}>]], dtype=object)
ratings_df_s_allcounts['totalRatings'].quantile(np.arange(0.6,1, 0.01)) #分位点
0.60 7.00
0.61 7.00
0.62 7.00
0.63 8.00
0.64 8.00
0.65 9.00
0.66 9.00
0.67 10.00
0.68 10.00
0.69 11.00
0.70 12.00
0.71 12.00
0.72 13.00
0.73 14.00
0.74 14.00
0.75 15.75
0.76 17.00
0.77 18.00
0.78 19.00
0.79 20.00
0.80 21.00
0.81 22.33
0.82 24.00
0.83 26.00
0.84 27.00
0.85 29.00
0.86 31.00
0.87 34.00
0.88 37.00
0.89 41.77
0.90 45.00
0.91 49.00
0.92 52.56
0.93 59.00
0.94 64.42
0.95 71.00
0.96 83.28
0.97 98.21
0.98 119.14
0.99 168.49
Name: totalRatings, dtype: float64
- 从分位点数据分析可以看出,21%的电影 评分记录数超过20个
votes_count_threshold = 20
ratings_df_s_top=ratings_df_s_allcounts.query('totalRatings > @votes_count_threshold').reset_index()
ratings_df_s_top
index | title | totalRatings | |
---|---|---|---|
0 | 18 | 20,000 Leagues Under the Sea | 89 |
1 | 19 | 2001: A Space Odyssey | 87 |
2 | 24 | 24 Hour Party People | 22 |
3 | 26 | 28 Days Later | 26 |
4 | 27 | 28 Weeks Later | 47 |
... | ... | ... | ... |
575 | 2770 | Young Adam | 34 |
576 | 2772 | Young Frankenstein | 29 |
577 | 2774 | Young and Innocent | 193 |
578 | 2781 | Zatoichi | 61 |
579 | 2789 | xXx | 28 |
580 rows × 3 columns
ratings_df_s_top.drop(['index'],axis=1,inplace=True)
ratings_df_s_top.head()
title | totalRatings | |
---|---|---|
0 | 20,000 Leagues Under the Sea | 89 |
1 | 2001: A Space Odyssey | 87 |
2 | 24 Hour Party People | 22 |
3 | 28 Days Later | 26 |
4 | 28 Weeks Later | 47 |
ratings_df_s['title']
0 Rocky III
1 Rocky III
2 Rocky III
3 Rocky III
4 Rocky III
...
44984 Love Is a Ball
44985 Solaris
44986 K-PAX
44987 Starship Troopers
44988 Spirited Away
Name: title, Length: 44989, dtype: object
ratings_df_s_top['title']
0 20,000 Leagues Under the Sea
1 2001: A Space Odyssey
2 24 Hour Party People
3 28 Days Later
4 28 Weeks Later
...
575 Young Adam
576 Young Frankenstein
577 Young and Innocent
578 Zatoichi
579 xXx
Name: title, Length: 580, dtype: object
ratings_df_s[ratings_df_s['title'].isin(ratings_df_s_top['title'])]
userId | movieId | rating | title | |
---|---|---|---|---|
0 | 1 | 1371 | 2.5 | Rocky III |
1 | 4 | 1371 | 4.0 | Rocky III |
2 | 7 | 1371 | 3.0 | Rocky III |
3 | 19 | 1371 | 4.0 | Rocky III |
4 | 21 | 1371 | 3.0 | Rocky III |
... | ... | ... | ... | ... |
44507 | 624 | 3057 | 4.0 | Frankenstein |
44781 | 547 | 97936 | 3.0 | Sweet November |
44782 | 624 | 97936 | 3.0 | Sweet November |
44909 | 609 | 1450 | 5.0 | Blood: The Last Vampire |
44985 | 653 | 2103 | 3.0 | Solaris |
34552 rows × 4 columns
ratings_df_s[ratings_df_s['title'].isin(ratings_df_s_top['title'])] #得到评分数量大于20的
userId | movieId | rating | title | |
---|---|---|---|---|
0 | 1 | 1371 | 2.5 | Rocky III |
1 | 4 | 1371 | 4.0 | Rocky III |
2 | 7 | 1371 | 3.0 | Rocky III |
3 | 19 | 1371 | 4.0 | Rocky III |
4 | 21 | 1371 | 3.0 | Rocky III |
... | ... | ... | ... | ... |
44507 | 624 | 3057 | 4.0 | Frankenstein |
44781 | 547 | 97936 | 3.0 | Sweet November |
44782 | 624 | 97936 | 3.0 | Sweet November |
44909 | 609 | 1450 | 5.0 | Blood: The Last Vampire |
44985 | 653 | 2103 | 3.0 | Solaris |
34552 rows × 4 columns
ratings_df_s[~ratings_df_s['title'].isin(ratings_df_s_top['title'])] # 得到评分数量小于20的
userId | movieId | rating | title | |
---|---|---|---|---|
1714 | 2 | 248 | 3.0 | Pocketful of Miracles |
1715 | 36 | 248 | 2.0 | Pocketful of Miracles |
1716 | 110 | 248 | 4.0 | Pocketful of Miracles |
1717 | 239 | 248 | 4.0 | Pocketful of Miracles |
1718 | 242 | 248 | 3.0 | Pocketful of Miracles |
... | ... | ... | ... | ... |
44983 | 652 | 127728 | 5.0 | 8:46 |
44984 | 652 | 129009 | 4.0 | Love Is a Ball |
44986 | 659 | 167 | 4.0 | K-PAX |
44987 | 659 | 563 | 3.0 | Starship Troopers |
44988 | 665 | 129 | 3.0 | Spirited Away |
10437 rows × 4 columns
ratings_df_s_cntD20 = ratings_df_s[ratings_df_s['title'].isin(ratings_df_s_top['title'])]
ratings_df_s_cntX20 = ratings_df_s[~ratings_df_s['title'].isin(ratings_df_s_top['title'])]
ratings_df_s_cntD20.shape
(34552, 4)
ratings_df_s_cntX20.shape
(10437, 4)
ratings_df_s_cntD20.isna().sum() #检查有无缺失值
userId 0
movieId 0
rating 0
title 0
dtype: int64
ratings_df_s_cntD20.duplicated(['userId','title']).sum()
140
ratings_df_s_cntD20=ratings_df_s_cntD20.drop_duplicates(['userId','title']) # 只保留每个用户对每个电影的一条评论记录
ratings_df_s_cntD20
userId | movieId | rating | title | |
---|---|---|---|---|
0 | 1 | 1371 | 2.5 | Rocky III |
1 | 4 | 1371 | 4.0 | Rocky III |
2 | 7 | 1371 | 3.0 | Rocky III |
3 | 19 | 1371 | 4.0 | Rocky III |
4 | 21 | 1371 | 3.0 | Rocky III |
... | ... | ... | ... | ... |
44506 | 472 | 3057 | 3.0 | Frankenstein |
44507 | 624 | 3057 | 4.0 | Frankenstein |
44782 | 624 | 97936 | 3.0 | Sweet November |
44909 | 609 | 1450 | 5.0 | Blood: The Last Vampire |
44985 | 653 | 2103 | 3.0 | Solaris |
34412 rows × 4 columns
ratings_df_s_cntD20.duplicated(['userId','title']).sum()
0
# 将一个dataframe的记录数据整合成表格,而且是按照pivot(‘index=xx’,’columns=xx’,’values=xx’)来整合的。还有另外一种写法,就是pivot(‘索引列’,‘列名’,‘值’)。
ratings_df_s_cntD20_for_apriori = ratings_df_s_cntD20.pivot(index='userId',columns='title',values='rating')
ratings_df_s_cntD20_for_apriori
title | 20,000 Leagues Under the Sea | 2001: A Space Odyssey | 24 Hour Party People | 28 Days Later | 28 Weeks Later | 300 | 48 Hrs. | 5 Card Stud | 7 Virgins | 8 Women | ... | Within the Woods | X-Men Origins: Wolverine | Y Tu Mamá También | Yankee Doodle Dandy | Yesterday | Young Adam | Young Frankenstein | Young and Innocent | Zatoichi | xXx |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
userId | |||||||||||||||||||||
1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | NaN | 3.0 | NaN | NaN | NaN | NaN | 5.0 | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | NaN | NaN | NaN | NaN | NaN | 3.0 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 3.5 | NaN | NaN |
4 | 3.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | 5.0 | NaN | NaN | NaN | NaN | 5.0 | NaN | NaN | NaN |
5 | NaN | NaN | NaN | NaN | NaN | NaN | 4.0 | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 3.5 | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
667 | NaN | NaN | NaN | NaN | NaN | NaN | 4.0 | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
668 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
669 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
670 | NaN | NaN | NaN | NaN | NaN | NaN | 3.0 | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
671 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 5.0 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 4.0 | NaN | NaN |
671 rows × 580 columns
ratings_df_s_cntD20_for_apriori= ratings_df_s_cntD20_for_apriori.fillna(0) #缺失值 填充0
def encode_units(x): # 有效评分规则, 1表示有效,0 表示无效
if x <= 0:
return 0
if x>0:
return 1
ratings_df_s_cntD20_for_apriori = ratings_df_s_cntD20_for_apriori.applymap(encode_units)
计算频繁项集 和关联规则
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
ratings_df_s_cntD20_for_apriori.head()
title | 20,000 Leagues Under the Sea | 2001: A Space Odyssey | 24 Hour Party People | 28 Days Later | 28 Weeks Later | 300 | 48 Hrs. | 5 Card Stud | 7 Virgins | 8 Women | ... | Within the Woods | X-Men Origins: Wolverine | Y Tu Mamá También | Yankee Doodle Dandy | Yesterday | Young Adam | Young Frankenstein | Young and Innocent | Zatoichi | xXx |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
userId | |||||||||||||||||||||
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
4 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
5 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
5 rows × 580 columns
ratings_df_s_cntD20_for_apriori.isna().sum() #检查是否有nan值
title
20,000 Leagues Under the Sea 0
2001: A Space Odyssey 0
24 Hour Party People 0
28 Days Later 0
28 Weeks Later 0
..
Young Adam 0
Young Frankenstein 0
Young and Innocent 0
Zatoichi 0
xXx 0
Length: 580, dtype: int64
frequent_itemsets = apriori(ratings_df_s_cntD20_for_apriori, min_support=0.10, use_colnames=True) #生成符合条件的频繁项集
frequent_itemsets.sort_values('support',ascending=False) #support降序排列的频繁项集
support | itemsets | |
---|---|---|
111 | 0.482861 | (Terminator 3: Rise of the Machines) |
130 | 0.463487 | (The Million Dollar Hotel) |
105 | 0.454545 | (Solaris) |
113 | 0.433681 | (The 39 Steps) |
69 | 0.408346 | (Monsoon Wedding) |
... | ... | ... |
1613 | 0.101341 | (Sleepless in Seattle, 5 Card Stud, The Tunnel) |
5455 | 0.101341 | (Beauty and the Beast, Rain Man, Terminator 3:... |
5454 | 0.101341 | (The Passion of Joan of Arc, Beauty and the Be... |
6769 | 0.101341 | (The Million Dollar Hotel, The Hours, Three Co... |
3108 | 0.101341 | (The Conversation, Men in Black II, The Millio... |
7327 rows × 2 columns
rules= association_rules(frequent_itemsets, metric="lift", min_threshold=1) #生成关联规则,只保留lift>1的部分
rules
antecedents | consequents | antecedent support | consequent support | support | confidence | lift | leverage | conviction | |
---|---|---|---|---|---|---|---|---|---|
0 | (5 Card Stud) | (48 Hrs.) | 0.298063 | 0.298063 | 0.108793 | 0.365000 | 1.224575 | 0.019952 | 1.105413 |
1 | (48 Hrs.) | (5 Card Stud) | 0.298063 | 0.298063 | 0.108793 | 0.365000 | 1.224575 | 0.019952 | 1.105413 |
2 | (A Clockwork Orange) | (48 Hrs.) | 0.152012 | 0.298063 | 0.102832 | 0.676471 | 2.269559 | 0.057523 | 2.169625 |
3 | (48 Hrs.) | (A Clockwork Orange) | 0.298063 | 0.152012 | 0.102832 | 0.345000 | 2.269559 | 0.057523 | 1.294638 |
4 | (48 Hrs.) | (A Nightmare on Elm Street) | 0.298063 | 0.268256 | 0.156483 | 0.525000 | 1.957083 | 0.076526 | 1.540513 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
75531 | (The Hours) | (The Million Dollar Hotel, Terminator 3: Rise ... | 0.301043 | 0.126677 | 0.104322 | 0.346535 | 2.735585 | 0.066187 | 1.336449 |
75532 | (Terminator 3: Rise of the Machines) | (The Million Dollar Hotel, The Hours, Rain Man... | 0.482861 | 0.114754 | 0.104322 | 0.216049 | 1.882716 | 0.048912 | 1.129211 |
75533 | (Rain Man) | (The Million Dollar Hotel, The Hours, Terminat... | 0.295082 | 0.120715 | 0.104322 | 0.353535 | 2.928669 | 0.068701 | 1.360143 |
75534 | (Sissi) | (The Million Dollar Hotel, The Hours, Terminat... | 0.317437 | 0.117735 | 0.104322 | 0.328638 | 2.791347 | 0.066949 | 1.314143 |
75535 | (Solaris) | (The Million Dollar Hotel, The Hours, Terminat... | 0.454545 | 0.113264 | 0.104322 | 0.229508 | 2.026316 | 0.052838 | 1.150870 |
75536 rows × 9 columns
rules.sort_values('lift',ascending=False)
antecedents | consequents | antecedent support | consequent support | support | confidence | lift | leverage | conviction | |
---|---|---|---|---|---|---|---|---|---|
1473 | (Muxmäuschenstill) | (Waiter) | 0.156483 | 0.120715 | 0.105812 | 0.676190 | 5.601529 | 0.086922 | 2.715438 |
1472 | (Waiter) | (Muxmäuschenstill) | 0.120715 | 0.156483 | 0.105812 | 0.876543 | 5.601529 | 0.086922 | 6.832489 |
38208 | (Titanic, Big Fish) | (Psycho, Rain Man) | 0.150522 | 0.131148 | 0.101341 | 0.673267 | 5.133663 | 0.081601 | 2.659215 |
38209 | (Psycho, Rain Man) | (Titanic, Big Fish) | 0.131148 | 0.150522 | 0.101341 | 0.772727 | 5.133663 | 0.081601 | 3.737705 |
38238 | (Titanic, Big Fish) | (Psycho, Solaris) | 0.150522 | 0.134128 | 0.102832 | 0.683168 | 5.093399 | 0.082642 | 2.732908 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
108 | (5 Card Stud) | (Men in Black II) | 0.298063 | 0.333830 | 0.110283 | 0.370000 | 1.108348 | 0.010781 | 1.057413 |
571 | (Bang, Boom, Bang) | (The 39 Steps) | 0.260805 | 0.433681 | 0.125186 | 0.480000 | 1.106804 | 0.012080 | 1.089075 |
570 | (The 39 Steps) | (Bang, Boom, Bang) | 0.433681 | 0.260805 | 0.125186 | 0.288660 | 1.106804 | 0.012080 | 1.039159 |
1137 | (Sissi) | (License to Wed) | 0.317437 | 0.301043 | 0.102832 | 0.323944 | 1.076070 | 0.007269 | 1.033874 |
1136 | (License to Wed) | (Sissi) | 0.301043 | 0.317437 | 0.102832 | 0.341584 | 1.076070 | 0.007269 | 1.036675 |
75536 rows × 9 columns
- 结果说明:上述输出的即为所有关联规则的结果,每一行代表一个关联规则,其中行号1473所在的关联规则(Waiter->Muxmauschenstill)关联度最高(conviction值越大,代表antecedents与consequents的关联度越大))。
电影推荐
推荐电影列表
all_antecedents = [list(x) for x in rules['antecedents'].values]
desired_indices = [i for i in range(len(all_antecedents)) if len(all_antecedents[i]) == 1 and all_antecedents[i][0] == 'Batman Returns']
apriori_recommendations =rules.iloc[desired_indices,].sort_values(by=['lift'],ascending=False)
apriori_recommendations.head()
antecedents | consequents | antecedent support | consequent support | support | confidence | lift | leverage | conviction | |
---|---|---|---|---|---|---|---|---|---|
63981 | (Batman Returns) | (The Hours, Monsoon Wedding, Silent Hill, Rese... | 0.298063 | 0.107303 | 0.102832 | 0.345 | 3.215208 | 0.070849 | 1.362897 |
36084 | (Batman Returns) | (Reservoir Dogs, Wag the Dog, Silent Hill) | 0.298063 | 0.105812 | 0.101341 | 0.340 | 3.213239 | 0.069803 | 1.354830 |
63891 | (Batman Returns) | (Monsoon Wedding, Silent Hill, Reservoir Dogs,... | 0.298063 | 0.107303 | 0.101341 | 0.340 | 3.168611 | 0.069358 | 1.352572 |
63351 | (Batman Returns) | (Monsoon Wedding, Silent Hill, Reservoir Dogs,... | 0.298063 | 0.107303 | 0.101341 | 0.340 | 3.168611 | 0.069358 | 1.352572 |
36014 | (Batman Returns) | (The Hours, Reservoir Dogs, Silent Hill) | 0.298063 | 0.116244 | 0.108793 | 0.365 | 3.139936 | 0.074145 | 1.391741 |
apriori_recommendations_list = [list(x) for x in apriori_recommendations['consequents'].values]
print("Apriori Recommendations for movie: Batman Returns\n")
for i in range(5):
print("{0}:{1} with lift of {2}" .format(i+1, apriori_recommendations_list[i], apriori_recommendations.iloc[i,6]))
Apriori Recommendations for movie: Batman Returns
1:['The Hours', 'Monsoon Wedding', 'Silent Hill', 'Reservoir Dogs'] with lift of 3.215208333333333
2:['Reservoir Dogs', 'Wag the Dog', 'Silent Hill'] with lift of 3.2132394366197183
3:['Monsoon Wedding', 'Silent Hill', 'Reservoir Dogs', 'Sissi'] with lift of 3.168611111111111
4:['Monsoon Wedding', 'Silent Hill', 'Reservoir Dogs', 'Rain Man'] with lift of 3.168611111111111
5:['The Hours', 'Reservoir Dogs', 'Silent Hill'] with lift of 3.139935897435898
推荐单部电影
apriori_single_recommendations = apriori_recommendations.iloc[[x for x in range(len(apriori_recommendations_list)) if len(apriori_recommendations_list[x]) ==1],]
apriori_single_recommendations_list = [list(x) for x in apriori_single_recommendations['consequents'].values]
print("Apriori single-movie Recommendations for movie: Batman Returns\n")
for i in range(5):
print("{0}: {1}, with lift of {2}".format(i+1,apriori_single_recommendations_list[i][0],apriori_single_recommendations.iloc[i,6]))
Apriori single-movie Recommendations for movie: Batman Returns
1: Reservoir Dogs, with lift of 2.6094444444444447
2: Ariel, with lift of 2.5397663551401872
3: Wag the Dog, with lift of 2.496744186046512
4: To Kill a Mockingbird, with lift of 2.478125
5: Romeo + Juliet, with lift of 2.4705000000000004
- 结果说明:我们约束consequents(后件)的长度为1,选出lift降序排列的前五个关联规则(关联规则格式为前件——>后件)。对于用户观看的电影记录《Batman Returns》,即antecedents(前件),我们根据规则按照推荐程度降序给出了单部电影推荐结果
协同过滤
基于user的协同过滤
- 在海量的用户中发现一小部分和你品味比较相近的,在协同过滤中,这些用户称为邻居,然后根据他们喜欢的东西组织成一个排序的目录来推荐给你
- 重点就是怎样去寻找和你比较相似的用户,怎么将那些邻居的喜好组织成一个排序的目录给用户
- 在世纪钟给出一个数字K表示和你最为相似的用户。
- 在计算相似度的时候,理论上要计算被推荐的用户与所有用户的相似度,但是当数据量比较大的时候,这样做是很费时间的 ,
- 数据集中可能有很多用户和需要被推荐的用户是没有关系的, 在计算是完全是没有必要的,
- 所以需要物品到用户的反查表,也就是没一件物品对应的用户信息,有了这个表,就可以过滤掉很多和你没有关系的用户,减少计算量。
- 总结来说,推荐的过程就是先计算用户之间的相似度,根据相似度的高低选取前K个用户,在这K个用户中计算每一件物品的推荐程度。
- 重点就是怎样去寻找和你比较相似的用户,怎么将那些邻居的喜好组织成一个排序的目录给用户
# 读取ratings_small.csv数据用于建模
ratings_small_path = "./movie_dataset/ratings_small.csv"
ratings_small_df = pd.read_csv(ratings_small_path)
ratings_small_df.shape
(100004, 4)
ratings_small_df.head()
userId | movieId | rating | timestamp | |
---|---|---|---|---|
0 | 1 | 31 | 2.5 | 1260759144 |
1 | 1 | 1029 | 3.0 | 1260759179 |
2 | 1 | 1061 | 3.0 | 1260759182 |
3 | 1 | 1129 | 2.0 | 1260759185 |
4 | 1 | 1172 | 4.0 | 1260759205 |
# 原始的movieId 并非从0到1 的连续值, 为方便更贱user-item矩阵, 重新排列movie_id
movie_id = ratings_small_df['movieId'].drop_duplicates()
movie_id = pd.DataFrame(movie_id)
movie_id['movieid'] = range(len(movie_id))
movie_id
movieId | movieid | |
---|---|---|
0 | 31 | 0 |
1 | 1029 | 1 |
2 | 1061 | 2 |
3 | 1129 | 3 |
4 | 1172 | 4 |
... | ... | ... |
99131 | 64997 | 9061 |
99159 | 72380 | 9062 |
99274 | 129 | 9063 |
99678 | 4736 | 9064 |
99820 | 6425 | 9065 |
9066 rows × 2 columns
ratings_small_df = pd.merge(ratings_small_df, movie_id, on =['movieId'], how='left')
ratings_small_df
userId | movieId | rating | timestamp | movieid | |
---|---|---|---|---|---|
0 | 1 | 31 | 2.5 | 1260759144 | 0 |
1 | 1 | 1029 | 3.0 | 1260759179 | 1 |
2 | 1 | 1061 | 3.0 | 1260759182 | 2 |
3 | 1 | 1129 | 2.0 | 1260759185 | 3 |
4 | 1 | 1172 | 4.0 | 1260759205 | 4 |
... | ... | ... | ... | ... | ... |
99999 | 671 | 6268 | 2.5 | 1065579370 | 7005 |
100000 | 671 | 6269 | 4.0 | 1065149201 | 4771 |
100001 | 671 | 6365 | 4.0 | 1070940363 | 1329 |
100002 | 671 | 6385 | 2.5 | 1070979663 | 1331 |
100003 | 671 | 6565 | 3.5 | 1074784724 | 2946 |
100004 rows × 5 columns
ratings_small_df = ratings_small_df[['userId','movieid','rating','timestamp']] #更新 movieId ----> movieid
ratings_small_df
userId | movieid | rating | timestamp | |
---|---|---|---|---|
0 | 1 | 0 | 2.5 | 1260759144 |
1 | 1 | 1 | 3.0 | 1260759179 |
2 | 1 | 2 | 3.0 | 1260759182 |
3 | 1 | 3 | 2.0 | 1260759185 |
4 | 1 | 4 | 4.0 | 1260759205 |
... | ... | ... | ... | ... |
99999 | 671 | 7005 | 2.5 | 1065579370 |
100000 | 671 | 4771 | 4.0 | 1065149201 |
100001 | 671 | 1329 | 4.0 | 1070940363 |
100002 | 671 | 1331 | 2.5 | 1070979663 |
100003 | 671 | 2946 | 3.5 | 1074784724 |
100004 rows × 4 columns
# 用户物品统计
# unique()是以 数组形式(numpy.ndarray)返回列的所有唯一值(特征的所有唯一值)
# nunique() Return number of unique elements in the object.即返回的是唯一值的个数
n_users = ratings_small_df.userId.nunique()
n_users
671
n_items = ratings_small_df.movieid.nunique()
n_items
9066
# 拆分数据集
from sklearn.model_selection import train_test_split
#按照训练集70% 测试集30%的比例 对数据进行拆分
train_data,test_data = train_test_split(ratings_small_df,test_size= 0.3)
train_data
userId | movieid | rating | timestamp | |
---|---|---|---|---|
69526 | 481 | 329 | 4.0 | 1437001087 |
41670 | 299 | 917 | 3.5 | 1344188856 |
49260 | 358 | 288 | 2.0 | 957480147 |
39317 | 287 | 3582 | 4.0 | 1470168974 |
35991 | 262 | 2094 | 3.0 | 1433899624 |
... | ... | ... | ... | ... |
6262 | 33 | 1095 | 2.0 | 1032769543 |
8504 | 56 | 367 | 2.0 | 1467005360 |
8540 | 56 | 1435 | 4.0 | 1467006577 |
77937 | 542 | 1496 | 1.0 | 1424966216 |
94226 | 624 | 476 | 3.0 | 1053249671 |
70002 rows × 4 columns
# 训练集 用户-物品 矩阵
user_item_matrix = np.zeros((n_users,n_items))
user_item_matrix.shape
(671, 9066)
# iterrows() : 将DataFrame迭代成(index ,series)
# iteritems(): 将DataFrame迭代成(列名,series)
# itertuples(): 将DataFrame迭代成元组
for line in train_data.itertuples():
user_item_matrix[line[1]-1,line[2]]=line[3]
user_item_matrix
array([[0., 3., 3., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]])
user_item_matrix.shape
(671, 9066)
# 构建用户相似矩阵 ---采用余弦距离
from sklearn.metrics.pairwise import pairwise_distances
# 相似度计算 定义余弦距离
user_similarity_m = pairwise_distances(user_item_matrix,metric='cosine') # 每个用户为1行数据,故此处不需要再进行转置
a=[[1,3],[2,2]]
a
[[1, 3], [2, 2]]
pairwise_distances(a,metric='euclidean')
array([[0. , 1.41421356],
[1.41421356, 0. ]])
b = np.array([[1,2],[1,3],[2,1]])
b
array([[1, 2],
[1, 3],
[2, 1]])
pairwise_distances(b,metric='euclidean') #结果数组的第一行第二列表示 a[0]与a[1]的距离
array([[0. , 1. , 1.41421356],
[1. , 0. , 2.23606798],
[1.41421356, 2.23606798, 0. ]])
pairwise_distances(b,metric='cosine')
array([[0. , 0.01005051, 0.2 ],
[0.01005051, 0. , 0.29289322],
[0.2 , 0.29289322, 0. ]])
b.shape
(3, 2)
b[1]
array([1, 3])
b[0]
array([1, 2])
user_similarity_m.shape
(671, 671)
user_similarity_m[0:5,0:5].round(2)
array([[0. , 1. , 1. , 0.94, 0.97],
[1. , 0. , 0.89, 0.93, 0.92],
[1. , 0.89, 0. , 0.93, 0.93],
[0.94, 0.93, 0.93, 0. , 0.94],
[0.97, 0.92, 0.93, 0.94, 0. ]])
user_similarity_m_triu = np.triu(user_similarity_m,k=1) #取得上三角数据
np.round(user_similarity_m_triu[user_similarity_m_triu.nonzero()],3)
array([1. , 1. , 0.938, ..., 0.934, 0.919, 0.814])
user_sim_nonzero = np.round(user_similarity_m_triu[user_similarity_m_triu.nonzero()],3)
np.percentile(user_sim_nonzero,np.arange(0,101,10))
array([0.316, 0.844, 0.885, 0.911, 0.93 , 0.947, 0.961, 0.976, 1. ,
1. , 1. ])
训练集预测
mean_user_rating = user_item_matrix.mean(axis=1)
mean_user_rating
array([0.00297816, 0.0198544 , 0.01301566, 0.06265167, 0.03027796,
0.01196779, 0.02404589, 0.03805427, 0.01114053, 0.0147805 ,
0.01047871, 0.01301566, 0.01615928, 0.004743 , 0.33984116,
0.01069932, 0.0991617 , 0.0147805 , 0.11780278, 0.02294286,
0.0443415 , 0.04936025, 0.21111846, 0.00683874, 0.00617692,
0.04031546, 0.00816236, 0.01555261, 0.00363997, 0.29643724,
0.02625193, 0.01080962, 0.03684094, 0.05702625, 0.0025921 ,
0.03000221, 0.01147143, 0.03838518, 0.0196338 , 0.01301566,
0.0592323 , 0.0196338 , 0.02172954, 0.00694904, 0.00512905,
0.01312597, 0.01069932, 0.14212442, 0.02371498, 0.01169204,
0.01069932, 0.01941319, 0.00893448, 0.01384293, 0.00838297,
0.14615045, 0.06254136, 0.01753805, 0.02090227, 0.02018531,
0.04197 , 0.0172623 , 0.02454225, 0.00739025, 0.00694904,
0.01544231, 0.02856828, 0.03331127, 0.02327377, 0.02856828,
0.00794176, 0.05035297, 0.42096845, 0.01544231, 0.0394882 ,
0.00573572, 0.07346128, 0.08234061, 0.00921024, 0.0100375 ,
0.05283477, 0.01235385, 0.04555482, 0.03424884, 0.0247077 ,
0.05084933, 0.00650783, 0.06281712, 0.02448709, 0.01577322,
0.04671299, 0.02779616, 0.0444518 , 0.0497463 , 0.08769027,
0.02288771, 0.03160159, 0.02332892, 0.0497463 , 0.00661813,
0.0173726 , 0.1978822 , 0.02437679, 0.02360468, 0.13942202,
0.01411869, 0.00656298, 0.00783146, 0.00628723, 0.04015001,
0.09541143, 0.00650783, 0.00959629, 0.00827267, 0.01351202,
0.00667328, 0.01389808, 0.05592323, 0.17185087, 0.03833002,
0.02503861, 0.00882418, 0.00937569, 0.02415619, 0.0666777 ,
0.01808957, 0.00573572, 0.09695566, 0.00595632, 0.09254357,
0.01433929, 0.0297816 , 0.03518641, 0.10065078, 0.00529451,
0.01637988, 0.02018531, 0.02024046, 0.02090227, 0.01213325,
0.00816236, 0.01103022, 0.02365983, 0.01158173, 0.01455989,
0.02134348, 0.01312597, 0.04081182, 0.06121774, 0.09557688,
0.01654533, 0.05664019, 0.01698654, 0.00926539, 0.01279506,
0.01604897, 0.08697331, 0.00562541, 0.04070152, 0.03132583,
0.02702405, 0.00915508, 0.02007501, 0.02421134, 0.10870285,
0.01422899, 0.00761085, 0.02790646, 0.03623428, 0.00590117,
0.01588352, 0.0051842 , 0.01169204, 0.00639753, 0.03551732,
0.05834988, 0.07059343, 0.03259431, 0.01080962, 0.00452239,
0.01108537, 0.04301787, 0.01433929, 0.01125083, 0.05625414,
0.0099272 , 0.1014229 , 0.02867858, 0.03320097, 0.02150893,
0.00739025, 0.01687624, 0.01886168, 0.01367748, 0.10533863,
0.02702405, 0.02051621, 0.01979925, 0.12221487, 0.05846018,
0.03910214, 0.01831017, 0.01086477, 0.00871388, 0.05542687,
0.00948599, 0.00672844, 0.01604897, 0.00452239, 0.00683874,
0.01610413, 0.2176263 , 0.18409442, 0.06254136, 0.01753805,
0.02465255, 0.03430399, 0.01433929, 0.03855063, 0.07842488,
0.00501875, 0.02658284, 0.01158173, 0.02625193, 0.00827267,
0.00921024, 0.00750055, 0.02349437, 0.00987205, 0.03347673,
0.00937569, 0.20416942, 0.00871388, 0.03160159, 0.04858813,
0.06507831, 0.01075447, 0.02432164, 0.0843812 , 0.07246856,
0.0147805 , 0.14328259, 0.08151335, 0.02195014, 0.04268696,
0.00739025, 0.07677035, 0.03595853, 0.0051842 , 0.05581293,
0.03628943, 0.01147143, 0.05840503, 0.03739246, 0.04252151,
0.00650783, 0.02768586, 0.01025811, 0.00926539, 0.01235385,
0.01047871, 0.13710567, 0.03000221, 0.01091992, 0.054379 ,
0.00959629, 0.01158173, 0.10324289, 0.00628723, 0.06176925,
0.02029561, 0.01158173, 0.0296713 , 0.01114053, 0.06585043,
0.00479815, 0.01808957, 0.0147805 , 0.01136113, 0.00838297,
0.02509376, 0.03524156, 0.05333113, 0.01433929, 0.10302228,
0.00915508, 0.0893448 , 0.02029561, 0.0049636 , 0.02090227,
0.02553497, 0.08018972, 0.02217075, 0.25672844, 0.06849768,
0.00634238, 0.03662034, 0.02647253, 0.11763733, 0.01389808,
0.00551511, 0.00750055, 0.06243106, 0.03309067, 0.00595632,
0.16997573, 0.02029561, 0.0148908 , 0.04594088, 0.00468784,
0.23830796, 0.07290977, 0.08112729, 0.01169204, 0.01246415,
0.03524156, 0.00573572, 0.01588352, 0.00595632, 0.01571807,
0.02283256, 0.01323627, 0.00700419, 0.04180454, 0.00446724,
0.00783146, 0.02073682, 0.04649239, 0.00584602, 0.02680344,
0.00689389, 0.00816236, 0.02503861, 0.01086477, 0.007666 ,
0.00816236, 0.00330907, 0.01323627, 0.02950585, 0.01384293,
0.00595632, 0.05868079, 0.01114053, 0.04500331, 0.0619347 ,
0.09055813, 0.00650783, 0.01621443, 0.00639753, 0.0495257 ,
0.01378778, 0.02443194, 0.1039047 , 0.01544231, 0.09039268,
0.00419148, 0.00948599, 0.15243768, 0.01483565, 0.0098169 ,
0.01533201, 0.03071917, 0.05404809, 0.00909993, 0.0224465 ,
0.0097066 , 0.05217295, 0.00628723, 0.01345687, 0.03055372,
0.0446724 , 0.00849327, 0.06165895, 0.00838297, 0.00705934,
0.01808957, 0.00645268, 0.03750276, 0.01990955, 0.28375248,
0.02945069, 0.07654975, 0.01544231, 0.11973307, 0.03132583,
0.02691374, 0.09276417, 0.22865652, 0.01246415, 0.03430399,
0.02923009, 0.00617692, 0.0125193 , 0.04511361, 0.00683874,
0.03540702, 0.01632473, 0.01544231, 0.00595632, 0.01676594,
0.024818 , 0.09303993, 0.00783146, 0.0098169 , 0.11675491,
0.0270792 , 0.10699316, 0.05978381, 0.01566292, 0.00799691,
0.00882418, 0.05129054, 0.00650783, 0.01698654, 0.00893448,
0.02724465, 0.04114273, 0.0494154 , 0.01643503, 0.02823737,
0.0101478 , 0.0296713 , 0.09458416, 0.00799691, 0.01588352,
0.06507831, 0.09458416, 0.04560997, 0.00457754, 0.09618354,
0.09303993, 0.02013016, 0.06221046, 0.05382749, 0.00606662,
0.02161924, 0.00683874, 0.00612177, 0.05779837, 0.01367748,
0.03568277, 0.07572248, 0.01775866, 0.00441209, 0.00540481,
0.00904478, 0.01808957, 0.00639753, 0.00871388, 0.03943305,
0.01599382, 0.33085153, 0.02294286, 0.0101478 , 0.00821752,
0.01660049, 0.14179351, 0.02272226, 0.00705934, 0.08283697,
0.15784249, 0.0121884 , 0.13335539, 0.01058901, 0.01119568,
0.0593426 , 0.02095742, 0.30228326, 0.0048533 , 0.01869623,
0.0569711 , 0.24652548, 0.02614163, 0.01301566, 0.14284139,
0.01114053, 0.00490845, 0.02774101, 0.03132583, 0.1185749 ,
0.1435032 , 0.01819987, 0.03259431, 0.00573572, 0.004743 ,
0.0398191 , 0.04037062, 0.01781381, 0.00672844, 0.0051842 ,
0.01875138, 0.01941319, 0.02923009, 0.02415619, 0.00617692,
0.03309067, 0.03419369, 0.0048533 , 0.01235385, 0.05741231,
0.05658504, 0.03353188, 0.01334657, 0.004743 , 0.09927201,
0.0051842 , 0.01125083, 0.01334657, 0.2351092 , 0.04367968,
0.00948599, 0.00921024, 0.00584602, 0.1037944 , 0.00876903,
0.03805427, 0.01411869, 0.19170527, 0.05619899, 0.03987426,
0.01384293, 0.06083168, 0.04003971, 0.01968895, 0.03992941,
0.00777631, 0.03171189, 0.03325612, 0.16804544, 0.02062652,
0.03298037, 0.01384293, 0.0394882 , 0.08030002, 0.01378778,
0.03011251, 0.10070593, 0.00739025, 0.01058901, 0.00551511,
0.00683874, 0.01704169, 0.01544231, 0.09265387, 0.02713435,
0.02178469, 0.63484447, 0.03562762, 0.00623208, 0.03353188,
0.02360468, 0.00783146, 0.06358923, 0.01511141, 0.01831017,
0.00959629, 0.01329142, 0.07224796, 0.04378998, 0.03253916,
0.07798368, 0.07026252, 0.04616148, 0.52404589, 0.00871388,
0.00777631, 0.01147143, 0.01180234, 0.02283256, 0.03634458,
0.01577322, 0.02950585, 0.0101478 , 0.09022722, 0.14284139,
0.01125083, 0.0917163 , 0.00805206, 0.00209574, 0.22887712,
0.00595632, 0.03502096, 0.00821752, 0.06072138, 0.09728657,
0.0150011 , 0.15938672, 0.01400838, 0.01047871, 0.02228105,
0.00849327, 0.03904699, 0.02128833, 0.02514891, 0.05118023,
0.14399956, 0.06243106, 0.07842488, 0.05757776, 0.01119568,
0.01268476, 0.03926759, 0.03617913, 0.00330907, 0.11096404,
0.0196338 , 0.12618575, 0.08879329, 0.02283256, 0.01913744,
0.01080962, 0.01742775, 0.01560777, 0.02889918, 0.10225017,
0.01069932, 0.01764836, 0.0100375 , 0.01257445, 0.04086698,
0.02614163, 0.01185749, 0.03105008, 0.39383411, 0.02079197,
0.04290757, 0.04500331, 0.0223362 , 0.00959629, 0.0075557 ,
0.00937569, 0.01185749, 0.00772116, 0.00534966, 0.00750055,
0.00739025, 0.00976175, 0.004743 , 0.01455989, 0.01191264,
0.04059122, 0.01169204, 0.00490845, 0.01125083, 0.007666 ,
0.05834988, 0.05162144, 0.07715641, 0.0245974 , 0.00827267,
0.00595632, 0.08509817, 0.01753805, 0.20257004, 0.03353188,
0.0445621 , 0.00419148, 0.01952349, 0.03827487, 0.02950585,
0.00843812, 0.01742775, 0.00871388, 0.15927642, 0.1088683 ,
0.00816236, 0.01687624, 0.00739025, 0.0098169 , 0.00716964,
0.0347452 ])
rating_diff = (user_item_matrix - mean_user_rating[:,np.newaxis]) # np.newaxis作用:为mean_user_rating增加一个维度,实现加减操作
rating_diff
array([[-2.97816016e-03, 2.99702184e+00, 2.99702184e+00, ...,
-2.97816016e-03, -2.97816016e-03, -2.97816016e-03],
[-1.98544011e-02, -1.98544011e-02, -1.98544011e-02, ...,
-1.98544011e-02, -1.98544011e-02, -1.98544011e-02],
[-1.30156629e-02, -1.30156629e-02, -1.30156629e-02, ...,
-1.30156629e-02, -1.30156629e-02, -1.30156629e-02],
...,
[-9.81689830e-03, -9.81689830e-03, -9.81689830e-03, ...,
-9.81689830e-03, -9.81689830e-03, -9.81689830e-03],
[-7.16964483e-03, -7.16964483e-03, -7.16964483e-03, ...,
-7.16964483e-03, -7.16964483e-03, -7.16964483e-03],
[-3.47452019e-02, -3.47452019e-02, -3.47452019e-02, ...,
-3.47452019e-02, -3.47452019e-02, -3.47452019e-02]])
user_prediction = mean_user_rating[:,np.newaxis] + user_similarity_m.dot(rating_diff) / np.array([np.abs(user_similarity_m).sum(axis=1)]).T
# 处以np.array([np.abs(item_similarity_m).sum(axis=1)]是为了可以使评分在1~5之间,使1~5的标准化
user_prediction
array([[ 8.48587738e-02, 1.11549860e-01, 7.78496257e-02, ...,
-3.30873704e-02, -3.59785123e-02, -3.59132569e-02],
[ 9.36489784e-02, 1.35396758e-01, 1.04357090e-01, ...,
-1.62815182e-02, -1.93136443e-02, -1.93247190e-02],
[ 9.44428457e-02, 1.33314515e-01, 9.83052575e-02, ...,
-2.28228892e-02, -2.58037344e-02, -2.59258365e-02],
...,
[ 9.29750987e-02, 1.27902780e-01, 9.32275326e-02, ...,
-2.60694824e-02, -2.89101875e-02, -2.87905826e-02],
[ 8.62056229e-02, 1.26697599e-01, 9.17810994e-02, ...,
-2.88942031e-02, -3.19119828e-02, -3.20590645e-02],
[ 1.17342284e-01, 1.50739909e-01, 1.17908253e-01, ...,
-7.69495365e-05, -2.99819315e-03, -3.02101562e-03]])
# 只取数据集中有评分的数据集进行评估
from sklearn.metrics import mean_squared_error
from math import sqrt
prediction_flatten = user_prediction[user_item_matrix.nonzero()]
prediction_flatten
array([0.11154986, 0.07784963, 0.14877094, ..., 0.04236321, 0.01114962,
0.02448394])
user_item_matrix_flatten = user_item_matrix[user_item_matrix.nonzero()]
user_item_matrix_flatten
array([3., 3., 2., ..., 4., 4., 4.])
error_test = sqrt(mean_squared_error(prediction_flatten,user_item_matrix_flatten)) # 均方根误差计算
error_test
3.390138302832629
标签:...,ratings,df,电影,title,NaN,关联,算法,id 来源: https://www.cnblogs.com/oceaneyes-gzy/p/16462988.html