首页 > 编程语言> > 基于关联规则算法实现电影推荐系统

基于关联规则算法实现电影推荐系统

2022-07-10 13:04:05 作者：互联网

基于关联规则算法实现电影推荐系统

利用数据挖掘算法中的Apriori(关联规则)算法来实现一个电影推荐系统
- 加载数据
- 数据预处理
- 生成频繁项集、关联规则
通过关联规则生成电影推荐的列表

Apriori算法

案例：
啤酒与尿布: 沃尔玛超市在分析销售记录时，发现了啤酒与尿布经常一起被购买，于是他们调整了货架将两者放在了一起，结果真的提升了啤酒的销量。原因解释: 爸爸在给宝宝买尿布的时候，会顺便给自己买点啤酒？
概述：
Apriori算法是一种最有影响力的挖掘布尔关联规则的频繁项集的算法，其命名Apriori源于算法使用了频繁项集性质的先验(Prior)知识。
接下来我们将以超市订单的例子理解关联分析相关的重要概念: Support(支持度)、Confidence(置信度)、Lift(提升度）。
- Support(支持度)：指某事件出现的概率，在本例中即指某个商品组合出现的次数占总次数的比例。
例：Support('Bread') = 4/5 = 0.8 Support('Milk') = 4/5 = 0.8
Support('Bread+Milk') = 3/5 = 0.6
- Confidence(置信度)：本质上是个条件概率，即当购买了商品A的前提下，购买商品B的概率。
例：Confidence('Bread'—> 'Milk') = Support('Bread+Milk')/ Support('Bread') = 0.6/0.8 = 0.75
- Lift(提升度）: 指商品A的出现，对商品B的出现的概率的提升程度。Lift(A->B) = Confidence(A, B) / Support(B)
例：Lift('Bread'—> 'Milk') = 0.75/0.8 = 0.9375
对于Lift(提升度）有三种情况：
- Lift(A->B)>1: 代表A对B的出现概率有提升。
- Lift(A->B)=1: 代表A对B的出现概率没有提升，也没有下降。
- Lift(A->B)<1: 代表A对B的出现概率有下降效果。
原理：
该算法挖掘关联规则的过程，即是查找频繁项集(frequent itemset)的过程:
- 频繁项集：支持度大于等于最小支持度(Min Support)阈值的项集。
- 非频繁集：支持度小于最小支持度的项集。
流程：
K = 1, 计算K项集的支持度；
筛选掉小于最小支持度的项集；
如果项集为空，则对应K-1项集的结果为最终结果。否则K = K+1重复2-3步

import pandas as pd
import matplotlib.pyplot as plt
import mlxtend
import numpy as np

电影数据准备

movie_data_file = './movie_dataset/movies_metadata.csv'
ratings_file = './movie_dataset/ratings_small.csv'

movie_data_df = pd.read_csv(movie_data_file)
ratings_df = pd.read_csv(ratings_file)

c:\users\ysilhouette\documents\pyenv\py3.6.5\lib\site-packages\IPython\core\interactiveshell.py:3072: DtypeWarning: Columns (10) have mixed types.Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

movie_data_df.head(5)

	adult	belongs_to_collection	budget	genres	homepage	id	imdb_id	original_language	original_title	overview	...	release_date	revenue	runtime	spoken_languages	status	tagline	title	video	vote_average	vote_count
0	False	{'id': 10194, 'name': 'Toy Story Collection', ...	30000000	[{'id': 16, 'name': 'Animation'}, {'id': 35, '...	http://toystory.disney.com/toy-story	862	tt0114709	en	Toy Story	Led by Woody, Andy's toys live happily in his ...	...	1995-10-30	373554033.0	81.0	[{'iso_639_1': 'en', 'name': 'English'}]	Released	NaN	Toy Story	False	7.7	5415.0
1	False	NaN	65000000	[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...	NaN	8844	tt0113497	en	Jumanji	When siblings Judy and Peter discover an encha...	...	1995-12-15	262797249.0	104.0	[{'iso_639_1': 'en', 'name': 'English'}, {'iso...	Released	Roll the dice and unleash the excitement!	Jumanji	False	6.9	2413.0
2	False	{'id': 119050, 'name': 'Grumpy Old Men Collect...	0	[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...	NaN	15602	tt0113228	en	Grumpier Old Men	A family wedding reignites the ancient feud be...	...	1995-12-22	0.0	101.0	[{'iso_639_1': 'en', 'name': 'English'}]	Released	Still Yelling. Still Fighting. Still Ready for...	Grumpier Old Men	False	6.5	92.0
3	False	NaN	16000000	[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...	NaN	31357	tt0114885	en	Waiting to Exhale	Cheated on, mistreated and stepped on, the wom...	...	1995-12-22	81452156.0	127.0	[{'iso_639_1': 'en', 'name': 'English'}]	Released	Friends are the people who let you be yourself...	Waiting to Exhale	False	6.1	34.0
4	False	{'id': 96871, 'name': 'Father of the Bride Col...	0	[{'id': 35, 'name': 'Comedy'}]	NaN	11862	tt0113041	en	Father of the Bride Part II	Just when George Banks has recovered from his ...	...	1995-02-10	76578911.0	106.0	[{'iso_639_1': 'en', 'name': 'English'}]	Released	Just When His World Is Back To Normal... He's ...	Father of the Bride Part II	False	5.7	173.0

5 rows × 24 columns

movie_data_df.describe()

	revenue	runtime	vote_average	vote_count
count	4.546000e+04	45203.000000	45460.000000	45460.000000
mean	1.120935e+07	94.128199	5.618207	109.897338
std	6.433225e+07	38.407810	1.924216	491.310374
min	0.000000e+00	0.000000	0.000000	0.000000
25%	0.000000e+00	85.000000	5.000000	3.000000
50%	0.000000e+00	95.000000	6.000000	10.000000
75%	0.000000e+00	107.000000	6.800000	34.000000
max	2.787965e+09	1256.000000	10.000000	14075.000000

movie_data_df.info

<bound method DataFrame.info of        adult                              belongs_to_collection    budget  \
0      False  {'id': 10194, 'name': 'Toy Story Collection', ...  30000000   
1      False                                                NaN  65000000   
2      False  {'id': 119050, 'name': 'Grumpy Old Men Collect...         0   
3      False                                                NaN  16000000   
4      False  {'id': 96871, 'name': 'Father of the Bride Col...         0   
...      ...                                                ...       ...   
45461  False                                                NaN         0   
45462  False                                                NaN         0   
45463  False                                                NaN         0   
45464  False                                                NaN         0   
45465  False                                                NaN         0   

                                                  genres  \
0      [{'id': 16, 'name': 'Animation'}, {'id': 35, '...   
1      [{'id': 12, 'name': 'Adventure'}, {'id': 14, '...   
2      [{'id': 10749, 'name': 'Romance'}, {'id': 35, ...   
3      [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...   
4                         [{'id': 35, 'name': 'Comedy'}]   
...                                                  ...   
45461  [{'id': 18, 'name': 'Drama'}, {'id': 10751, 'n...   
45462                      [{'id': 18, 'name': 'Drama'}]   
45463  [{'id': 28, 'name': 'Action'}, {'id': 18, 'nam...   
45464                                                 []   
45465                                                 []   

                                   homepage      id    imdb_id  \
0      http://toystory.disney.com/toy-story     862  tt0114709   
1                                       NaN    8844  tt0113497   
2                                       NaN   15602  tt0113228   
3                                       NaN   31357  tt0114885   
4                                       NaN   11862  tt0113041   
...                                     ...     ...        ...   
45461  http://www.imdb.com/title/tt6209470/  439050  tt6209470   
45462                                   NaN  111109  tt2028550   
45463                                   NaN   67758  tt0303758   
45464                                   NaN  227506  tt0008536   
45465                                   NaN  461257  tt6980792   

      original_language               original_title  \
0                    en                    Toy Story   
1                    en                      Jumanji   
2                    en             Grumpier Old Men   
3                    en            Waiting to Exhale   
4                    en  Father of the Bride Part II   
...                 ...                          ...   
45461                fa                      رگ خواب   
45462                tl          Siglo ng Pagluluwal   
45463                en                     Betrayal   
45464                en          Satana likuyushchiy   
45465                en                     Queerama   

                                                overview  ... release_date  \
0      Led by Woody, Andy's toys live happily in his ...  ...   1995-10-30   
1      When siblings Judy and Peter discover an encha...  ...   1995-12-15   
2      A family wedding reignites the ancient feud be...  ...   1995-12-22   
3      Cheated on, mistreated and stepped on, the wom...  ...   1995-12-22   
4      Just when George Banks has recovered from his ...  ...   1995-02-10   
...                                                  ...  ...          ...   
45461        Rising and falling between a man and woman.  ...          NaN   
45462  An artist struggles to finish his work while a...  ...   2011-11-17   
45463  When one of her hits goes wrong, a professiona...  ...   2003-08-01   
45464  In a small town live two brothers, one a minis...  ...   1917-10-21   
45465  50 years after decriminalisation of homosexual...  ...   2017-06-09   

           revenue runtime                                   spoken_languages  \
0      373554033.0    81.0           [{'iso_639_1': 'en', 'name': 'English'}]   
1      262797249.0   104.0  [{'iso_639_1': 'en', 'name': 'English'}, {'iso...   
2              0.0   101.0           [{'iso_639_1': 'en', 'name': 'English'}]   
3       81452156.0   127.0           [{'iso_639_1': 'en', 'name': 'English'}]   
4       76578911.0   106.0           [{'iso_639_1': 'en', 'name': 'English'}]   
...            ...     ...                                                ...   
45461          0.0    90.0             [{'iso_639_1': 'fa', 'name': 'فارسی'}]   
45462          0.0   360.0                  [{'iso_639_1': 'tl', 'name': ''}]   
45463          0.0    90.0           [{'iso_639_1': 'en', 'name': 'English'}]   
45464          0.0    87.0                                                 []   
45465          0.0    75.0           [{'iso_639_1': 'en', 'name': 'English'}]   

         status                                            tagline  \
0      Released                                                NaN   
1      Released          Roll the dice and unleash the excitement!   
2      Released  Still Yelling. Still Fighting. Still Ready for...   
3      Released  Friends are the people who let you be yourself...   
4      Released  Just When His World Is Back To Normal... He's ...   
...         ...                                                ...   
45461  Released         Rising and falling between a man and woman   
45462  Released                                                NaN   
45463  Released                             A deadly game of wits.   
45464  Released                                                NaN   
45465  Released                                                NaN   

                             title  video vote_average vote_count  
0                        Toy Story  False          7.7     5415.0  
1                          Jumanji  False          6.9     2413.0  
2                 Grumpier Old Men  False          6.5       92.0  
3                Waiting to Exhale  False          6.1       34.0  
4      Father of the Bride Part II  False          5.7      173.0  
...                            ...    ...          ...        ...  
45461                       Subdue  False          4.0        1.0  
45462          Century of Birthing  False          9.0        3.0  
45463                     Betrayal  False          3.8        6.0  
45464             Satan Triumphant  False          0.0        0.0  
45465                     Queerama  False          0.0        0.0  

[45466 rows x 24 columns]>

movie_data_df.count()

adult                    45466
belongs_to_collection     4494
budget                   45466
genres                   45466
homepage                  7782
id                       45466
imdb_id                  45449
original_language        45455
original_title           45466
overview                 44512
popularity               45461
poster_path              45080
production_companies     45463
production_countries     45463
release_date             45379
revenue                  45460
runtime                  45203
spoken_languages         45460
status                   45379
tagline                  20412
title                    45460
video                    45460
vote_average             45460
vote_count               45460
dtype: int64

movie_data_df.columns

Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count'],
      dtype='object')

ratings_df.head(5)

	userId	movieId	rating	timestamp
0	1	31	2.5	1260759144
1	1	1029	3.0	1260759179
2	1	1061	3.0	1260759182
3	1	1129	2.0	1260759185
4	1	1172	4.0	1260759205

ratings_df.columns

Index(['userId', 'movieId', 'rating', 'timestamp'], dtype='object')

ratings_df.count()

userId       100004
movieId      100004
rating       100004
timestamp    100004
dtype: int64

ratings_df.shape

(100004, 4)

movie_data_df.shape

(45466, 24)

数据预处理

缺失值处理
数据去重
电影源信息 merge 电影评分信息

movie_data_df_t=movie_data_df[['title','id']]

movie_data_df_t.dtypes

title    object
id       object
dtype: object

ratings_df_s = ratings_df.drop(['timestamp'], axis=1)  #axis=0 跨列删除行 ，axis=1 跨行删除列

ratings_df_s.dtypes

userId       int64
movieId      int64
rating     float64
dtype: object

缺失值处理

pandas中用NaN(Not a Number)表示浮点数和非浮点数数组中的缺失值，同时python中None值也被当作缺失值。

# pd.to_numeric 将id列 的数据 由字符串转为数值类型， 不能转换的数据设置为NaN
pd.to_numeric(movie_data_df_t['id'],errors='coerce')

0           862.0
1          8844.0
2         15602.0
3         31357.0
4         11862.0
           ...   
45461    439050.0
45462    111109.0
45463     67758.0
45464    227506.0
45465    461257.0
Name: id, Length: 45466, dtype: float64

#np.where返回满足（）内条件的数据所在的位置
np.where(pd.to_numeric(movie_data_df_t['id'], errors='coerce').isna()) #返回缺失值的位置，其中isna() 对于NaN返回True，否则返回False

(array([19730, 29503, 35587], dtype=int64),)

loc works on labels in the index.
- loc为Selection by Label函数，即为按标签取数据，标签是什么，就是上面的’0’~‘4’, ‘A’~‘B’。
- 例如第一个参数选择index，第二个参数选择column
- 建议写df.loc[0, :]，这样可以清楚的看出为第0行的所有记录，同样如果取第’A’列的所有记录，可以写df.loc[:, ‘A’]，如下图：
- :表示所有，[]里边为先行后列
iloc works on the positions in the index (so it only takes integers).
- iloc函数为Selection by Position，即按位置选择数据，即第n行，第n列数据，只接受整型参数, 比如 0:2为左闭右开区间，即取0，1
- 若要取第一列的所有数据，则为df.iloc[:, 0]，不接受’A’作为参数

movie_data_df_t.iloc[19730]

title           NaN
id       1997-08-20
Name: 19730, dtype: object

movie_data_df_t.iloc[[19730,29503,35587]]

	title	id
19730	NaN	1997-08-20
29503	NaN	2012-09-29
35587	NaN	2014-01-01

# 将格式转换后的数据 赋值给id列
movie_data_df_t['id'] = pd.to_numeric(movie_data_df_t['id'], errors='coerce')

c:\users\ysilhouette\documents\pyenv\py3.6.5\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

movie_data_df_t.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   title   45460 non-null  object 
 1   id      45463 non-null  float64
dtypes: float64(1), object(1)
memory usage: 710.5+ KB

movie_data_df_t.iloc[[19730,29503,35587]]

	title	id
19730	NaN	NaN
29503	NaN	NaN
35587	NaN	NaN

movie_data_df_t.shape

(45466, 2)

movie_data_df_t.drop(np.where(movie_data_df_t['id'].isna())[0], inplace=True)

c:\users\ysilhouette\documents\pyenv\py3.6.5\lib\site-packages\pandas\core\frame.py:4174: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,

movie_data_df_t.shape

(45463, 2)

数据去重

movie_data_df_t.duplicated(['id','title']).sum()

movie_data_df_t.drop_duplicates(['id'],inplace=True)

c:\users\ysilhouette\documents\pyenv\py3.6.5\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.

movie_data_df_t.shape

(45433, 2)

ratings_df_s.duplicated(['userId','movieId']).sum()

movie_data_df_t['id'] = movie_data_df_t['id'].astype(np.int64)

c:\users\ysilhouette\documents\pyenv\py3.6.5\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.

movie_data_df_t.dtypes

title    object
id        int64
dtype: object

ratings_df_s.dtypes

userId       int64
movieId      int64
rating     float64
dtype: object

数据合并

# 左dataframe 和 右dataframe 根据 movieId 和 id进行合并
ratings_df_s = pd.merge(ratings_df_s,movie_data_df_t, left_on='movieId',right_on='id')

ratings_df_s.head()

	userId	movieId	rating	title	id
0	1	1371	2.5	Rocky III	1371
1	4	1371	4.0	Rocky III	1371
2	7	1371	3.0	Rocky III	1371
3	19	1371	4.0	Rocky III	1371
4	21	1371	3.0	Rocky III	1371

ratings_df_s.drop(['id'],axis=1,inplace=True)

ratings_df_s

	userId	movieId	rating	title
0	1	1371	2.5	Rocky III
1	4	1371	4.0	Rocky III
2	7	1371	3.0	Rocky III
3	19	1371	4.0	Rocky III
4	21	1371	3.0	Rocky III
...	...	...	...	...
44984	652	129009	4.0	Love Is a Ball
44985	653	2103	3.0	Solaris
44986	659	167	4.0	K-PAX
44987	659	563	3.0	Starship Troopers
44988	665	129	3.0	Spirited Away

44989 rows × 4 columns

ratings_df_s.shape

(44989, 4)

# 有评分记录的电影的个数
len(ratings_df_s['title'].unique())

ratings_df_s['title'].unique()

array(['Rocky III', 'Greed', 'American Pie', ..., 'K-PAX',
       'Starship Troopers', 'Spirited Away'], dtype=object)

ratings_df_s.groupby([ratings_df_s['title'],ratings_df_s['rating']]).count().reset_index()

	title	rating	userId	movieId
0	!Women Art Revolution	3.0	1	1
1	!Women Art Revolution	3.5	1	1
2	'Gator Bait	0.5	1	1
3	'Twas the Night Before Christmas	3.5	1	1
4	'Twas the Night Before Christmas	4.5	1	1
...	...	...	...	...
10263	À nos amours	4.0	5	5
10264	À nos amours	4.5	1	1
10265	À nos amours	5.0	1	1
10266	Ödipussi	4.5	1	1
10267	Şaban Oğlu Şaban	4.5	1	1

10268 rows × 4 columns

ratings_df_s.groupby(ratings_df_s['title']).count().reset_index()

	title	userId	movieId	rating
0	!Women Art Revolution	2	2	2
1	'Gator Bait	1	1	1
2	'Twas the Night Before Christmas	2	2	2
3	...And God Created Woman	1	1	1
4	00 Schneider - Jagd auf Nihil Baxter	2	2	2
...	...	...	...	...
2789	xXx	28	28	28
2790	¡Three Amigos!	1	1	1
2791	À nos amours	14	14	14
2792	Ödipussi	1	1	1
2793	Şaban Oğlu Şaban	1	1	1

2794 rows × 4 columns

ratings_df_s_allcounts = ratings_df_s.groupby(ratings_df_s['title'])['userId'].count().reset_index()

ratings_df_s_allcounts = ratings_df_s_allcounts.rename(columns = {'userId':'totalRatings'})

ratings_df_s_allcounts

	title	totalRatings
0	!Women Art Revolution	2
1	'Gator Bait	1
2	'Twas the Night Before Christmas	2
3	...And God Created Woman	1
4	00 Schneider - Jagd auf Nihil Baxter	2
...	...	...
2789	xXx	28
2790	¡Three Amigos!	1
2791	À nos amours	14
2792	Ödipussi	1
2793	Şaban Oğlu Şaban	1

2794 rows × 2 columns

ratings_df_s_allcounts.shape

(2794, 2)

ratings_df_s_allcounts['totalRatings'].describe()

count    2794.000000
mean       16.102004
std        31.481795
min         1.000000
25%         1.000000
50%         4.000000
75%        15.750000
max       324.000000
Name: totalRatings, dtype: float64

ratings_df_s_allcounts.hist()

array([[<AxesSubplot:title={'center':'totalRatings'}>]], dtype=object)

png

ratings_df_s_allcounts['totalRatings'].quantile(np.arange(0.6,1, 0.01)) #分位点

0.60      7.00
0.61      7.00
0.62      7.00
0.63      8.00
0.64      8.00
0.65      9.00
0.66      9.00
0.67     10.00
0.68     10.00
0.69     11.00
0.70     12.00
0.71     12.00
0.72     13.00
0.73     14.00
0.74     14.00
0.75     15.75
0.76     17.00
0.77     18.00
0.78     19.00
0.79     20.00
0.80     21.00
0.81     22.33
0.82     24.00
0.83     26.00
0.84     27.00
0.85     29.00
0.86     31.00
0.87     34.00
0.88     37.00
0.89     41.77
0.90     45.00
0.91     49.00
0.92     52.56
0.93     59.00
0.94     64.42
0.95     71.00
0.96     83.28
0.97     98.21
0.98    119.14
0.99    168.49
Name: totalRatings, dtype: float64

从分位点数据分析可以看出，21%的电影评分记录数超过20个

votes_count_threshold = 20

ratings_df_s_top=ratings_df_s_allcounts.query('totalRatings > @votes_count_threshold').reset_index()

ratings_df_s_top

	index	title	totalRatings
0	18	20,000 Leagues Under the Sea	89
1	19	2001: A Space Odyssey	87
2	24	24 Hour Party People	22
3	26	28 Days Later	26
4	27	28 Weeks Later	47
...	...	...	...
575	2770	Young Adam	34
576	2772	Young Frankenstein	29
577	2774	Young and Innocent	193
578	2781	Zatoichi	61
579	2789	xXx	28

580 rows × 3 columns

ratings_df_s_top.drop(['index'],axis=1,inplace=True)

ratings_df_s_top.head()

	title	totalRatings
0	20,000 Leagues Under the Sea	89
1	2001: A Space Odyssey	87
2	24 Hour Party People	22
3	28 Days Later	26
4	28 Weeks Later	47

ratings_df_s['title']

0                Rocky III
1                Rocky III
2                Rocky III
3                Rocky III
4                Rocky III
               ...        
44984       Love Is a Ball
44985              Solaris
44986                K-PAX
44987    Starship Troopers
44988        Spirited Away
Name: title, Length: 44989, dtype: object

ratings_df_s_top['title']

0      20,000 Leagues Under the Sea
1             2001: A Space Odyssey
2              24 Hour Party People
3                     28 Days Later
4                    28 Weeks Later
                   ...             
575                      Young Adam
576              Young Frankenstein
577              Young and Innocent
578                        Zatoichi
579                             xXx
Name: title, Length: 580, dtype: object

ratings_df_s[ratings_df_s['title'].isin(ratings_df_s_top['title'])]

	userId	movieId	rating	title
0	1	1371	2.5	Rocky III
1	4	1371	4.0	Rocky III
2	7	1371	3.0	Rocky III
3	19	1371	4.0	Rocky III
4	21	1371	3.0	Rocky III
...	...	...	...	...
44507	624	3057	4.0	Frankenstein
44781	547	97936	3.0	Sweet November
44782	624	97936	3.0	Sweet November
44909	609	1450	5.0	Blood: The Last Vampire
44985	653	2103	3.0	Solaris

34552 rows × 4 columns

ratings_df_s[ratings_df_s['title'].isin(ratings_df_s_top['title'])]  #得到评分数量大于20的

	userId	movieId	rating	title
0	1	1371	2.5	Rocky III
1	4	1371	4.0	Rocky III
2	7	1371	3.0	Rocky III
3	19	1371	4.0	Rocky III
4	21	1371	3.0	Rocky III
...	...	...	...	...
44507	624	3057	4.0	Frankenstein
44781	547	97936	3.0	Sweet November
44782	624	97936	3.0	Sweet November
44909	609	1450	5.0	Blood: The Last Vampire
44985	653	2103	3.0	Solaris

34552 rows × 4 columns

ratings_df_s[~ratings_df_s['title'].isin(ratings_df_s_top['title'])] # 得到评分数量小于20的

	userId	movieId	rating	title
1714	2	248	3.0	Pocketful of Miracles
1715	36	248	2.0	Pocketful of Miracles
1716	110	248	4.0	Pocketful of Miracles
1717	239	248	4.0	Pocketful of Miracles
1718	242	248	3.0	Pocketful of Miracles
...	...	...	...	...
44983	652	127728	5.0	8:46
44984	652	129009	4.0	Love Is a Ball
44986	659	167	4.0	K-PAX
44987	659	563	3.0	Starship Troopers
44988	665	129	3.0	Spirited Away

10437 rows × 4 columns

ratings_df_s_cntD20 = ratings_df_s[ratings_df_s['title'].isin(ratings_df_s_top['title'])]

ratings_df_s_cntX20 = ratings_df_s[~ratings_df_s['title'].isin(ratings_df_s_top['title'])]

ratings_df_s_cntD20.shape

(34552, 4)

ratings_df_s_cntX20.shape

(10437, 4)

ratings_df_s_cntD20.isna().sum() #检查有无缺失值

userId     0
movieId    0
rating     0
title      0
dtype: int64

ratings_df_s_cntD20.duplicated(['userId','title']).sum()

ratings_df_s_cntD20=ratings_df_s_cntD20.drop_duplicates(['userId','title']) # 只保留每个用户对每个电影的一条评论记录

ratings_df_s_cntD20

	userId	movieId	rating	title
0	1	1371	2.5	Rocky III
1	4	1371	4.0	Rocky III
2	7	1371	3.0	Rocky III
3	19	1371	4.0	Rocky III
4	21	1371	3.0	Rocky III
...	...	...	...	...
44506	472	3057	3.0	Frankenstein
44507	624	3057	4.0	Frankenstein
44782	624	97936	3.0	Sweet November
44909	609	1450	5.0	Blood: The Last Vampire
44985	653	2103	3.0	Solaris

34412 rows × 4 columns

ratings_df_s_cntD20.duplicated(['userId','title']).sum()

# 将一个dataframe的记录数据整合成表格，而且是按照pivot(‘index=xx’,’columns=xx’,’values=xx’)来整合的。还有另外一种写法，就是pivot(‘索引列’，‘列名’，‘值’)。
ratings_df_s_cntD20_for_apriori = ratings_df_s_cntD20.pivot(index='userId',columns='title',values='rating')

ratings_df_s_cntD20_for_apriori

title	20,000 Leagues Under the Sea	2001: A Space Odyssey	24 Hour Party People	28 Days Later	28 Weeks Later	300	48 Hrs.	5 Card Stud	7 Virgins	8 Women	...	Within the Woods	X-Men Origins: Wolverine	Y Tu Mamá También	Yankee Doodle Dandy	Yesterday	Young Adam	Young Frankenstein	Young and Innocent	Zatoichi	xXx
userId
1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	NaN	3.0	NaN	NaN	NaN	NaN	5.0	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	NaN	NaN	NaN	NaN	NaN	3.0	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	3.5	NaN	NaN
4	3.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	5.0	NaN	NaN	NaN	NaN	5.0	NaN	NaN	NaN
5	NaN	NaN	NaN	NaN	NaN	NaN	4.0	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	3.5	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
667	NaN	NaN	NaN	NaN	NaN	NaN	4.0	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
668	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
669	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
670	NaN	NaN	NaN	NaN	NaN	NaN	3.0	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
671	NaN	NaN	NaN	NaN	NaN	NaN	NaN	5.0	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	4.0	NaN	NaN

671 rows × 580 columns

ratings_df_s_cntD20_for_apriori= ratings_df_s_cntD20_for_apriori.fillna(0) #缺失值 填充0

def encode_units(x): # 有效评分规则， 1表示有效，0 表示无效
    if x <= 0:
        return 0
    if x>0:
        return 1

ratings_df_s_cntD20_for_apriori = ratings_df_s_cntD20_for_apriori.applymap(encode_units)

计算频繁项集和关联规则

from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

ratings_df_s_cntD20_for_apriori.head()

title	20,000 Leagues Under the Sea	2001: A Space Odyssey	24 Hour Party People	28 Days Later	28 Weeks Later	300	48 Hrs.	5 Card Stud	7 Virgins	8 Women	...	Within the Woods	X-Men Origins: Wolverine	Y Tu Mamá También	Yankee Doodle Dandy	Yesterday	Young Adam	Young Frankenstein	Young and Innocent	Zatoichi	xXx
userId
1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	0	1	0	0	0	0	1	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	1	0	0	0	0	...	0	0	0	0	0	0	0	1	0	0
4	1	0	0	0	0	0	0	0	0	0	...	0	1	0	0	0	0	1	0	0	0
5	0	0	0	0	0	0	1	0	0	0	...	0	0	0	0	0	0	0	1	0	0

5 rows × 580 columns

ratings_df_s_cntD20_for_apriori.isna().sum() #检查是否有nan值

title
20,000 Leagues Under the Sea    0
2001: A Space Odyssey           0
24 Hour Party People            0
28 Days Later                   0
28 Weeks Later                  0
                               ..
Young Adam                      0
Young Frankenstein              0
Young and Innocent              0
Zatoichi                        0
xXx                             0
Length: 580, dtype: int64

frequent_itemsets = apriori(ratings_df_s_cntD20_for_apriori, min_support=0.10, use_colnames=True)  #生成符合条件的频繁项集

frequent_itemsets.sort_values('support',ascending=False)  #support降序排列的频繁项集

	support	itemsets
111	0.482861	(Terminator 3: Rise of the Machines)
130	0.463487	(The Million Dollar Hotel)
105	0.454545	(Solaris)
113	0.433681	(The 39 Steps)
69	0.408346	(Monsoon Wedding)
...	...	...
1613	0.101341	(Sleepless in Seattle, 5 Card Stud, The Tunnel)
5455	0.101341	(Beauty and the Beast, Rain Man, Terminator 3:...
5454	0.101341	(The Passion of Joan of Arc, Beauty and the Be...
6769	0.101341	(The Million Dollar Hotel, The Hours, Three Co...
3108	0.101341	(The Conversation, Men in Black II, The Millio...

7327 rows × 2 columns

rules= association_rules(frequent_itemsets, metric="lift", min_threshold=1)  #生成关联规则，只保留lift>1的部分
rules

	antecedents	consequents	antecedent support	consequent support	support	confidence	lift	leverage	conviction
0	(5 Card Stud)	(48 Hrs.)	0.298063	0.298063	0.108793	0.365000	1.224575	0.019952	1.105413
1	(48 Hrs.)	(5 Card Stud)	0.298063	0.298063	0.108793	0.365000	1.224575	0.019952	1.105413
2	(A Clockwork Orange)	(48 Hrs.)	0.152012	0.298063	0.102832	0.676471	2.269559	0.057523	2.169625
3	(48 Hrs.)	(A Clockwork Orange)	0.298063	0.152012	0.102832	0.345000	2.269559	0.057523	1.294638
4	(48 Hrs.)	(A Nightmare on Elm Street)	0.298063	0.268256	0.156483	0.525000	1.957083	0.076526	1.540513
...	...	...	...	...	...	...	...	...	...
75531	(The Hours)	(The Million Dollar Hotel, Terminator 3: Rise ...	0.301043	0.126677	0.104322	0.346535	2.735585	0.066187	1.336449
75532	(Terminator 3: Rise of the Machines)	(The Million Dollar Hotel, The Hours, Rain Man...	0.482861	0.114754	0.104322	0.216049	1.882716	0.048912	1.129211
75533	(Rain Man)	(The Million Dollar Hotel, The Hours, Terminat...	0.295082	0.120715	0.104322	0.353535	2.928669	0.068701	1.360143
75534	(Sissi)	(The Million Dollar Hotel, The Hours, Terminat...	0.317437	0.117735	0.104322	0.328638	2.791347	0.066949	1.314143
75535	(Solaris)	(The Million Dollar Hotel, The Hours, Terminat...	0.454545	0.113264	0.104322	0.229508	2.026316	0.052838	1.150870

75536 rows × 9 columns

rules.sort_values('lift',ascending=False)

	antecedents	consequents	antecedent support	consequent support	support	confidence	lift	leverage	conviction
1473	(Muxmäuschenstill)	(Waiter)	0.156483	0.120715	0.105812	0.676190	5.601529	0.086922	2.715438
1472	(Waiter)	(Muxmäuschenstill)	0.120715	0.156483	0.105812	0.876543	5.601529	0.086922	6.832489
38208	(Titanic, Big Fish)	(Psycho, Rain Man)	0.150522	0.131148	0.101341	0.673267	5.133663	0.081601	2.659215
38209	(Psycho, Rain Man)	(Titanic, Big Fish)	0.131148	0.150522	0.101341	0.772727	5.133663	0.081601	3.737705
38238	(Titanic, Big Fish)	(Psycho, Solaris)	0.150522	0.134128	0.102832	0.683168	5.093399	0.082642	2.732908
...	...	...	...	...	...	...	...	...	...
108	(5 Card Stud)	(Men in Black II)	0.298063	0.333830	0.110283	0.370000	1.108348	0.010781	1.057413
571	(Bang, Boom, Bang)	(The 39 Steps)	0.260805	0.433681	0.125186	0.480000	1.106804	0.012080	1.089075
570	(The 39 Steps)	(Bang, Boom, Bang)	0.433681	0.260805	0.125186	0.288660	1.106804	0.012080	1.039159
1137	(Sissi)	(License to Wed)	0.317437	0.301043	0.102832	0.323944	1.076070	0.007269	1.033874
1136	(License to Wed)	(Sissi)	0.301043	0.317437	0.102832	0.341584	1.076070	0.007269	1.036675

75536 rows × 9 columns

结果说明：上述输出的即为所有关联规则的结果，每一行代表一个关联规则，其中行号1473所在的关联规则(Waiter->Muxmauschenstill)关联度最高(conviction值越大，代表antecedents与consequents的关联度越大）)。

电影推荐

	antecedents	consequents	antecedent support	consequent support	support	confidence	lift	leverage	conviction
63981	(Batman Returns)	(The Hours, Monsoon Wedding, Silent Hill, Rese...	0.298063	0.107303	0.102832	0.345	3.215208	0.070849	1.362897
36084	(Batman Returns)	(Reservoir Dogs, Wag the Dog, Silent Hill)	0.298063	0.105812	0.101341	0.340	3.213239	0.069803	1.354830
63891	(Batman Returns)	(Monsoon Wedding, Silent Hill, Reservoir Dogs,...	0.298063	0.107303	0.101341	0.340	3.168611	0.069358	1.352572
63351	(Batman Returns)	(Monsoon Wedding, Silent Hill, Reservoir Dogs,...	0.298063	0.107303	0.101341	0.340	3.168611	0.069358	1.352572
36014	(Batman Returns)	(The Hours, Reservoir Dogs, Silent Hill)	0.298063	0.116244	0.108793	0.365	3.139936	0.074145	1.391741

协同过滤

基于user的协同过滤

在海量的用户中发现一小部分和你品味比较相近的，在协同过滤中，这些用户称为邻居，然后根据他们喜欢的东西组织成一个排序的目录来推荐给你
- 重点就是怎样去寻找和你比较相似的用户，怎么将那些邻居的喜好组织成一个排序的目录给用户
  - 在世纪钟给出一个数字K表示和你最为相似的用户。
  - 在计算相似度的时候，理论上要计算被推荐的用户与所有用户的相似度，但是当数据量比较大的时候，这样做是很费时间的，
  - 数据集中可能有很多用户和需要被推荐的用户是没有关系的，在计算是完全是没有必要的，
  - 所以需要物品到用户的反查表，也就是没一件物品对应的用户信息，有了这个表，就可以过滤掉很多和你没有关系的用户，减少计算量。
- 总结来说，推荐的过程就是先计算用户之间的相似度，根据相似度的高低选取前K个用户，在这K个用户中计算每一件物品的推荐程度。

# 读取ratings_small.csv数据用于建模
ratings_small_path = "./movie_dataset/ratings_small.csv"
ratings_small_df = pd.read_csv(ratings_small_path)

ratings_small_df.shape

(100004, 4)

ratings_small_df.head()

	userId	movieId	rating	timestamp
0	1	31	2.5	1260759144
1	1	1029	3.0	1260759179
2	1	1061	3.0	1260759182
3	1	1129	2.0	1260759185
4	1	1172	4.0	1260759205

# 原始的movieId 并非从0到1 的连续值， 为方便更贱user-item矩阵， 重新排列movie_id
movie_id = ratings_small_df['movieId'].drop_duplicates()
movie_id = pd.DataFrame(movie_id)
movie_id['movieid'] = range(len(movie_id))

movie_id

	movieId	movieid
0	31	0
1	1029	1
2	1061	2
3	1129	3
4	1172	4
...	...	...
99131	64997	9061
99159	72380	9062
99274	129	9063
99678	4736	9064
99820	6425	9065

9066 rows × 2 columns

ratings_small_df = pd.merge(ratings_small_df, movie_id, on =['movieId'], how='left')
ratings_small_df

	userId	movieId	rating	timestamp	movieid
0	1	31	2.5	1260759144	0
1	1	1029	3.0	1260759179	1
2	1	1061	3.0	1260759182	2
3	1	1129	2.0	1260759185	3
4	1	1172	4.0	1260759205	4
...	...	...	...	...	...
99999	671	6268	2.5	1065579370	7005
100000	671	6269	4.0	1065149201	4771
100001	671	6365	4.0	1070940363	1329
100002	671	6385	2.5	1070979663	1331
100003	671	6565	3.5	1074784724	2946

100004 rows × 5 columns

ratings_small_df = ratings_small_df[['userId','movieid','rating','timestamp']]  #更新 movieId ----> movieid
ratings_small_df

	userId	movieid	rating	timestamp
0	1	0	2.5	1260759144
1	1	1	3.0	1260759179
2	1	2	3.0	1260759182
3	1	3	2.0	1260759185
4	1	4	4.0	1260759205
...	...	...	...	...
99999	671	7005	2.5	1065579370
100000	671	4771	4.0	1065149201
100001	671	1329	4.0	1070940363
100002	671	1331	2.5	1070979663
100003	671	2946	3.5	1074784724

100004 rows × 4 columns

# 用户物品统计
# unique()是以 数组形式（numpy.ndarray）返回列的所有唯一值（特征的所有唯一值）
# nunique() Return number of unique elements in the object.即返回的是唯一值的个数

n_users = ratings_small_df.userId.nunique()
n_users

n_items = ratings_small_df.movieid.nunique()
n_items

# 拆分数据集
from sklearn.model_selection import train_test_split
#按照训练集70% 测试集30%的比例 对数据进行拆分
train_data,test_data = train_test_split(ratings_small_df,test_size= 0.3)

train_data

	userId	movieid	rating	timestamp
69526	481	329	4.0	1437001087
41670	299	917	3.5	1344188856
49260	358	288	2.0	957480147
39317	287	3582	4.0	1470168974
35991	262	2094	3.0	1433899624
...	...	...	...	...
6262	33	1095	2.0	1032769543
8504	56	367	2.0	1467005360
8540	56	1435	4.0	1467006577
77937	542	1496	1.0	1424966216
94226	624	476	3.0	1053249671

70002 rows × 4 columns

# 训练集 用户-物品 矩阵
user_item_matrix = np.zeros((n_users,n_items))
user_item_matrix.shape

(671, 9066)

# iterrows() : 将DataFrame迭代成（index ,series）
# iteritems()： 将DataFrame迭代成（列名，series）
# itertuples()： 将DataFrame迭代成元组 
for line in train_data.itertuples():
    user_item_matrix[line[1]-1,line[2]]=line[3]

user_item_matrix

array([[0., 3., 3., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

user_item_matrix.shape

(671, 9066)

# 构建用户相似矩阵 ---采用余弦距离
from sklearn.metrics.pairwise import pairwise_distances
# 相似度计算 定义余弦距离
user_similarity_m = pairwise_distances(user_item_matrix,metric='cosine')  # 每个用户为1行数据，故此处不需要再进行转置

a=[[1,3],[2,2]]
a

[[1, 3], [2, 2]]

pairwise_distances(a,metric='euclidean')

array([[0.        , 1.41421356],
       [1.41421356, 0.        ]])

b = np.array([[1,2],[1,3],[2,1]])
b

array([[1, 2],
       [1, 3],
       [2, 1]])

pairwise_distances(b,metric='euclidean') #结果数组的第一行第二列表示 a[0]与a[1]的距离

array([[0.        , 1.        , 1.41421356],
       [1.        , 0.        , 2.23606798],
       [1.41421356, 2.23606798, 0.        ]])

pairwise_distances(b,metric='cosine')

array([[0.        , 0.01005051, 0.2       ],
       [0.01005051, 0.        , 0.29289322],
       [0.2       , 0.29289322, 0.        ]])

b.shape

(3, 2)

b[1]

array([1, 3])

b[0]

array([1, 2])

user_similarity_m.shape

(671, 671)

user_similarity_m[0:5,0:5].round(2)

array([[0.  , 1.  , 1.  , 0.94, 0.97],
       [1.  , 0.  , 0.89, 0.93, 0.92],
       [1.  , 0.89, 0.  , 0.93, 0.93],
       [0.94, 0.93, 0.93, 0.  , 0.94],
       [0.97, 0.92, 0.93, 0.94, 0.  ]])

user_similarity_m_triu = np.triu(user_similarity_m,k=1) #取得上三角数据
np.round(user_similarity_m_triu[user_similarity_m_triu.nonzero()],3)

array([1.   , 1.   , 0.938, ..., 0.934, 0.919, 0.814])

user_sim_nonzero = np.round(user_similarity_m_triu[user_similarity_m_triu.nonzero()],3)

np.percentile(user_sim_nonzero,np.arange(0,101,10))

array([0.316, 0.844, 0.885, 0.911, 0.93 , 0.947, 0.961, 0.976, 1.   ,
       1.   , 1.   ])

训练集预测

mean_user_rating = user_item_matrix.mean(axis=1)
mean_user_rating

array([0.00297816, 0.0198544 , 0.01301566, 0.06265167, 0.03027796,
       0.01196779, 0.02404589, 0.03805427, 0.01114053, 0.0147805 ,
       0.01047871, 0.01301566, 0.01615928, 0.004743  , 0.33984116,
       0.01069932, 0.0991617 , 0.0147805 , 0.11780278, 0.02294286,
       0.0443415 , 0.04936025, 0.21111846, 0.00683874, 0.00617692,
       0.04031546, 0.00816236, 0.01555261, 0.00363997, 0.29643724,
       0.02625193, 0.01080962, 0.03684094, 0.05702625, 0.0025921 ,
       0.03000221, 0.01147143, 0.03838518, 0.0196338 , 0.01301566,
       0.0592323 , 0.0196338 , 0.02172954, 0.00694904, 0.00512905,
       0.01312597, 0.01069932, 0.14212442, 0.02371498, 0.01169204,
       0.01069932, 0.01941319, 0.00893448, 0.01384293, 0.00838297,
       0.14615045, 0.06254136, 0.01753805, 0.02090227, 0.02018531,
       0.04197   , 0.0172623 , 0.02454225, 0.00739025, 0.00694904,
       0.01544231, 0.02856828, 0.03331127, 0.02327377, 0.02856828,
       0.00794176, 0.05035297, 0.42096845, 0.01544231, 0.0394882 ,
       0.00573572, 0.07346128, 0.08234061, 0.00921024, 0.0100375 ,
       0.05283477, 0.01235385, 0.04555482, 0.03424884, 0.0247077 ,
       0.05084933, 0.00650783, 0.06281712, 0.02448709, 0.01577322,
       0.04671299, 0.02779616, 0.0444518 , 0.0497463 , 0.08769027,
       0.02288771, 0.03160159, 0.02332892, 0.0497463 , 0.00661813,
       0.0173726 , 0.1978822 , 0.02437679, 0.02360468, 0.13942202,
       0.01411869, 0.00656298, 0.00783146, 0.00628723, 0.04015001,
       0.09541143, 0.00650783, 0.00959629, 0.00827267, 0.01351202,
       0.00667328, 0.01389808, 0.05592323, 0.17185087, 0.03833002,
       0.02503861, 0.00882418, 0.00937569, 0.02415619, 0.0666777 ,
       0.01808957, 0.00573572, 0.09695566, 0.00595632, 0.09254357,
       0.01433929, 0.0297816 , 0.03518641, 0.10065078, 0.00529451,
       0.01637988, 0.02018531, 0.02024046, 0.02090227, 0.01213325,
       0.00816236, 0.01103022, 0.02365983, 0.01158173, 0.01455989,
       0.02134348, 0.01312597, 0.04081182, 0.06121774, 0.09557688,
       0.01654533, 0.05664019, 0.01698654, 0.00926539, 0.01279506,
       0.01604897, 0.08697331, 0.00562541, 0.04070152, 0.03132583,
       0.02702405, 0.00915508, 0.02007501, 0.02421134, 0.10870285,
       0.01422899, 0.00761085, 0.02790646, 0.03623428, 0.00590117,
       0.01588352, 0.0051842 , 0.01169204, 0.00639753, 0.03551732,
       0.05834988, 0.07059343, 0.03259431, 0.01080962, 0.00452239,
       0.01108537, 0.04301787, 0.01433929, 0.01125083, 0.05625414,
       0.0099272 , 0.1014229 , 0.02867858, 0.03320097, 0.02150893,
       0.00739025, 0.01687624, 0.01886168, 0.01367748, 0.10533863,
       0.02702405, 0.02051621, 0.01979925, 0.12221487, 0.05846018,
       0.03910214, 0.01831017, 0.01086477, 0.00871388, 0.05542687,
       0.00948599, 0.00672844, 0.01604897, 0.00452239, 0.00683874,
       0.01610413, 0.2176263 , 0.18409442, 0.06254136, 0.01753805,
       0.02465255, 0.03430399, 0.01433929, 0.03855063, 0.07842488,
       0.00501875, 0.02658284, 0.01158173, 0.02625193, 0.00827267,
       0.00921024, 0.00750055, 0.02349437, 0.00987205, 0.03347673,
       0.00937569, 0.20416942, 0.00871388, 0.03160159, 0.04858813,
       0.06507831, 0.01075447, 0.02432164, 0.0843812 , 0.07246856,
       0.0147805 , 0.14328259, 0.08151335, 0.02195014, 0.04268696,
       0.00739025, 0.07677035, 0.03595853, 0.0051842 , 0.05581293,
       0.03628943, 0.01147143, 0.05840503, 0.03739246, 0.04252151,
       0.00650783, 0.02768586, 0.01025811, 0.00926539, 0.01235385,
       0.01047871, 0.13710567, 0.03000221, 0.01091992, 0.054379  ,
       0.00959629, 0.01158173, 0.10324289, 0.00628723, 0.06176925,
       0.02029561, 0.01158173, 0.0296713 , 0.01114053, 0.06585043,
       0.00479815, 0.01808957, 0.0147805 , 0.01136113, 0.00838297,
       0.02509376, 0.03524156, 0.05333113, 0.01433929, 0.10302228,
       0.00915508, 0.0893448 , 0.02029561, 0.0049636 , 0.02090227,
       0.02553497, 0.08018972, 0.02217075, 0.25672844, 0.06849768,
       0.00634238, 0.03662034, 0.02647253, 0.11763733, 0.01389808,
       0.00551511, 0.00750055, 0.06243106, 0.03309067, 0.00595632,
       0.16997573, 0.02029561, 0.0148908 , 0.04594088, 0.00468784,
       0.23830796, 0.07290977, 0.08112729, 0.01169204, 0.01246415,
       0.03524156, 0.00573572, 0.01588352, 0.00595632, 0.01571807,
       0.02283256, 0.01323627, 0.00700419, 0.04180454, 0.00446724,
       0.00783146, 0.02073682, 0.04649239, 0.00584602, 0.02680344,
       0.00689389, 0.00816236, 0.02503861, 0.01086477, 0.007666  ,
       0.00816236, 0.00330907, 0.01323627, 0.02950585, 0.01384293,
       0.00595632, 0.05868079, 0.01114053, 0.04500331, 0.0619347 ,
       0.09055813, 0.00650783, 0.01621443, 0.00639753, 0.0495257 ,
       0.01378778, 0.02443194, 0.1039047 , 0.01544231, 0.09039268,
       0.00419148, 0.00948599, 0.15243768, 0.01483565, 0.0098169 ,
       0.01533201, 0.03071917, 0.05404809, 0.00909993, 0.0224465 ,
       0.0097066 , 0.05217295, 0.00628723, 0.01345687, 0.03055372,
       0.0446724 , 0.00849327, 0.06165895, 0.00838297, 0.00705934,
       0.01808957, 0.00645268, 0.03750276, 0.01990955, 0.28375248,
       0.02945069, 0.07654975, 0.01544231, 0.11973307, 0.03132583,
       0.02691374, 0.09276417, 0.22865652, 0.01246415, 0.03430399,
       0.02923009, 0.00617692, 0.0125193 , 0.04511361, 0.00683874,
       0.03540702, 0.01632473, 0.01544231, 0.00595632, 0.01676594,
       0.024818  , 0.09303993, 0.00783146, 0.0098169 , 0.11675491,
       0.0270792 , 0.10699316, 0.05978381, 0.01566292, 0.00799691,
       0.00882418, 0.05129054, 0.00650783, 0.01698654, 0.00893448,
       0.02724465, 0.04114273, 0.0494154 , 0.01643503, 0.02823737,
       0.0101478 , 0.0296713 , 0.09458416, 0.00799691, 0.01588352,
       0.06507831, 0.09458416, 0.04560997, 0.00457754, 0.09618354,
       0.09303993, 0.02013016, 0.06221046, 0.05382749, 0.00606662,
       0.02161924, 0.00683874, 0.00612177, 0.05779837, 0.01367748,
       0.03568277, 0.07572248, 0.01775866, 0.00441209, 0.00540481,
       0.00904478, 0.01808957, 0.00639753, 0.00871388, 0.03943305,
       0.01599382, 0.33085153, 0.02294286, 0.0101478 , 0.00821752,
       0.01660049, 0.14179351, 0.02272226, 0.00705934, 0.08283697,
       0.15784249, 0.0121884 , 0.13335539, 0.01058901, 0.01119568,
       0.0593426 , 0.02095742, 0.30228326, 0.0048533 , 0.01869623,
       0.0569711 , 0.24652548, 0.02614163, 0.01301566, 0.14284139,
       0.01114053, 0.00490845, 0.02774101, 0.03132583, 0.1185749 ,
       0.1435032 , 0.01819987, 0.03259431, 0.00573572, 0.004743  ,
       0.0398191 , 0.04037062, 0.01781381, 0.00672844, 0.0051842 ,
       0.01875138, 0.01941319, 0.02923009, 0.02415619, 0.00617692,
       0.03309067, 0.03419369, 0.0048533 , 0.01235385, 0.05741231,
       0.05658504, 0.03353188, 0.01334657, 0.004743  , 0.09927201,
       0.0051842 , 0.01125083, 0.01334657, 0.2351092 , 0.04367968,
       0.00948599, 0.00921024, 0.00584602, 0.1037944 , 0.00876903,
       0.03805427, 0.01411869, 0.19170527, 0.05619899, 0.03987426,
       0.01384293, 0.06083168, 0.04003971, 0.01968895, 0.03992941,
       0.00777631, 0.03171189, 0.03325612, 0.16804544, 0.02062652,
       0.03298037, 0.01384293, 0.0394882 , 0.08030002, 0.01378778,
       0.03011251, 0.10070593, 0.00739025, 0.01058901, 0.00551511,
       0.00683874, 0.01704169, 0.01544231, 0.09265387, 0.02713435,
       0.02178469, 0.63484447, 0.03562762, 0.00623208, 0.03353188,
       0.02360468, 0.00783146, 0.06358923, 0.01511141, 0.01831017,
       0.00959629, 0.01329142, 0.07224796, 0.04378998, 0.03253916,
       0.07798368, 0.07026252, 0.04616148, 0.52404589, 0.00871388,
       0.00777631, 0.01147143, 0.01180234, 0.02283256, 0.03634458,
       0.01577322, 0.02950585, 0.0101478 , 0.09022722, 0.14284139,
       0.01125083, 0.0917163 , 0.00805206, 0.00209574, 0.22887712,
       0.00595632, 0.03502096, 0.00821752, 0.06072138, 0.09728657,
       0.0150011 , 0.15938672, 0.01400838, 0.01047871, 0.02228105,
       0.00849327, 0.03904699, 0.02128833, 0.02514891, 0.05118023,
       0.14399956, 0.06243106, 0.07842488, 0.05757776, 0.01119568,
       0.01268476, 0.03926759, 0.03617913, 0.00330907, 0.11096404,
       0.0196338 , 0.12618575, 0.08879329, 0.02283256, 0.01913744,
       0.01080962, 0.01742775, 0.01560777, 0.02889918, 0.10225017,
       0.01069932, 0.01764836, 0.0100375 , 0.01257445, 0.04086698,
       0.02614163, 0.01185749, 0.03105008, 0.39383411, 0.02079197,
       0.04290757, 0.04500331, 0.0223362 , 0.00959629, 0.0075557 ,
       0.00937569, 0.01185749, 0.00772116, 0.00534966, 0.00750055,
       0.00739025, 0.00976175, 0.004743  , 0.01455989, 0.01191264,
       0.04059122, 0.01169204, 0.00490845, 0.01125083, 0.007666  ,
       0.05834988, 0.05162144, 0.07715641, 0.0245974 , 0.00827267,
       0.00595632, 0.08509817, 0.01753805, 0.20257004, 0.03353188,
       0.0445621 , 0.00419148, 0.01952349, 0.03827487, 0.02950585,
       0.00843812, 0.01742775, 0.00871388, 0.15927642, 0.1088683 ,
       0.00816236, 0.01687624, 0.00739025, 0.0098169 , 0.00716964,
       0.0347452 ])

rating_diff = (user_item_matrix - mean_user_rating[:,np.newaxis])   # np.newaxis作用：为mean_user_rating增加一个维度，实现加减操作
rating_diff

array([[-2.97816016e-03,  2.99702184e+00,  2.99702184e+00, ...,
        -2.97816016e-03, -2.97816016e-03, -2.97816016e-03],
       [-1.98544011e-02, -1.98544011e-02, -1.98544011e-02, ...,
        -1.98544011e-02, -1.98544011e-02, -1.98544011e-02],
       [-1.30156629e-02, -1.30156629e-02, -1.30156629e-02, ...,
        -1.30156629e-02, -1.30156629e-02, -1.30156629e-02],
       ...,
       [-9.81689830e-03, -9.81689830e-03, -9.81689830e-03, ...,
        -9.81689830e-03, -9.81689830e-03, -9.81689830e-03],
       [-7.16964483e-03, -7.16964483e-03, -7.16964483e-03, ...,
        -7.16964483e-03, -7.16964483e-03, -7.16964483e-03],
       [-3.47452019e-02, -3.47452019e-02, -3.47452019e-02, ...,
        -3.47452019e-02, -3.47452019e-02, -3.47452019e-02]])

user_prediction = mean_user_rating[:,np.newaxis] + user_similarity_m.dot(rating_diff) / np.array([np.abs(user_similarity_m).sum(axis=1)]).T
# 处以np.array([np.abs(item_similarity_m).sum(axis=1)]是为了可以使评分在1~5之间，使1~5的标准化

user_prediction

array([[ 8.48587738e-02,  1.11549860e-01,  7.78496257e-02, ...,
        -3.30873704e-02, -3.59785123e-02, -3.59132569e-02],
       [ 9.36489784e-02,  1.35396758e-01,  1.04357090e-01, ...,
        -1.62815182e-02, -1.93136443e-02, -1.93247190e-02],
       [ 9.44428457e-02,  1.33314515e-01,  9.83052575e-02, ...,
        -2.28228892e-02, -2.58037344e-02, -2.59258365e-02],
       ...,
       [ 9.29750987e-02,  1.27902780e-01,  9.32275326e-02, ...,
        -2.60694824e-02, -2.89101875e-02, -2.87905826e-02],
       [ 8.62056229e-02,  1.26697599e-01,  9.17810994e-02, ...,
        -2.88942031e-02, -3.19119828e-02, -3.20590645e-02],
       [ 1.17342284e-01,  1.50739909e-01,  1.17908253e-01, ...,
        -7.69495365e-05, -2.99819315e-03, -3.02101562e-03]])

# 只取数据集中有评分的数据集进行评估
from sklearn.metrics import mean_squared_error
from math import sqrt

prediction_flatten = user_prediction[user_item_matrix.nonzero()]
prediction_flatten

array([0.11154986, 0.07784963, 0.14877094, ..., 0.04236321, 0.01114962,
       0.02448394])

user_item_matrix_flatten = user_item_matrix[user_item_matrix.nonzero()]
user_item_matrix_flatten

array([3., 3., 2., ..., 4., 4., 4.])

error_test = sqrt(mean_squared_error(prediction_flatten,user_item_matrix_flatten)) # 均方根误差计算

error_test

3.390138302832629

标签：...,ratings,df,电影,title,NaN,关联,算法,id
来源： https://www.cnblogs.com/oceaneyes-gzy/p/16462988.html

基于关联规则算法实现电影推荐系统

基于关联规则算法实现电影推荐系统

Apriori算法

电影数据准备

数据预处理

缺失值处理

数据去重

数据合并

计算频繁项集和关联规则

电影推荐

推荐电影列表

推荐单部电影

协同过滤

基于user的协同过滤

训练集预测

基于关联规则算法实现电影推荐系统

基于关联规则算法实现电影推荐系统

Apriori算法

电影数据准备

数据预处理

缺失值处理

数据去重

数据合并

计算频繁项集 和关联规则

电影推荐

推荐电影列表

推荐单部电影

协同过滤

基于user的协同过滤

训练集预测

计算频繁项集和关联规则