编程语言
首页 > 编程语言> > 基于关联规则算法实现电影推荐系统

基于关联规则算法实现电影推荐系统

作者:互联网

基于关联规则算法实现电影推荐系统

image.png

Apriori算法

import pandas as pd
import matplotlib.pyplot as plt
import mlxtend
import numpy as np

电影数据准备

movie_data_file = './movie_dataset/movies_metadata.csv'
ratings_file = './movie_dataset/ratings_small.csv'
movie_data_df = pd.read_csv(movie_data_file)
ratings_df = pd.read_csv(ratings_file)
c:\users\ysilhouette\documents\pyenv\py3.6.5\lib\site-packages\IPython\core\interactiveshell.py:3072: DtypeWarning: Columns (10) have mixed types.Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
movie_data_df.head(5)
adult belongs_to_collection budget genres homepage id imdb_id original_language original_title overview ... release_date revenue runtime spoken_languages status tagline title video vote_average vote_count
0 False {'id': 10194, 'name': 'Toy Story Collection', ... 30000000 [{'id': 16, 'name': 'Animation'}, {'id': 35, '... http://toystory.disney.com/toy-story 862 tt0114709 en Toy Story Led by Woody, Andy's toys live happily in his ... ... 1995-10-30 373554033.0 81.0 [{'iso_639_1': 'en', 'name': 'English'}] Released NaN Toy Story False 7.7 5415.0
1 False NaN 65000000 [{'id': 12, 'name': 'Adventure'}, {'id': 14, '... NaN 8844 tt0113497 en Jumanji When siblings Judy and Peter discover an encha... ... 1995-12-15 262797249.0 104.0 [{'iso_639_1': 'en', 'name': 'English'}, {'iso... Released Roll the dice and unleash the excitement! Jumanji False 6.9 2413.0
2 False {'id': 119050, 'name': 'Grumpy Old Men Collect... 0 [{'id': 10749, 'name': 'Romance'}, {'id': 35, ... NaN 15602 tt0113228 en Grumpier Old Men A family wedding reignites the ancient feud be... ... 1995-12-22 0.0 101.0 [{'iso_639_1': 'en', 'name': 'English'}] Released Still Yelling. Still Fighting. Still Ready for... Grumpier Old Men False 6.5 92.0
3 False NaN 16000000 [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam... NaN 31357 tt0114885 en Waiting to Exhale Cheated on, mistreated and stepped on, the wom... ... 1995-12-22 81452156.0 127.0 [{'iso_639_1': 'en', 'name': 'English'}] Released Friends are the people who let you be yourself... Waiting to Exhale False 6.1 34.0
4 False {'id': 96871, 'name': 'Father of the Bride Col... 0 [{'id': 35, 'name': 'Comedy'}] NaN 11862 tt0113041 en Father of the Bride Part II Just when George Banks has recovered from his ... ... 1995-02-10 76578911.0 106.0 [{'iso_639_1': 'en', 'name': 'English'}] Released Just When His World Is Back To Normal... He's ... Father of the Bride Part II False 5.7 173.0

5 rows × 24 columns

movie_data_df.describe()
revenue runtime vote_average vote_count
count 4.546000e+04 45203.000000 45460.000000 45460.000000
mean 1.120935e+07 94.128199 5.618207 109.897338
std 6.433225e+07 38.407810 1.924216 491.310374
min 0.000000e+00 0.000000 0.000000 0.000000
25% 0.000000e+00 85.000000 5.000000 3.000000
50% 0.000000e+00 95.000000 6.000000 10.000000
75% 0.000000e+00 107.000000 6.800000 34.000000
max 2.787965e+09 1256.000000 10.000000 14075.000000
movie_data_df.info
<bound method DataFrame.info of        adult                              belongs_to_collection    budget  \
0      False  {'id': 10194, 'name': 'Toy Story Collection', ...  30000000   
1      False                                                NaN  65000000   
2      False  {'id': 119050, 'name': 'Grumpy Old Men Collect...         0   
3      False                                                NaN  16000000   
4      False  {'id': 96871, 'name': 'Father of the Bride Col...         0   
...      ...                                                ...       ...   
45461  False                                                NaN         0   
45462  False                                                NaN         0   
45463  False                                                NaN         0   
45464  False                                                NaN         0   
45465  False                                                NaN         0   

                                                  genres  \
0      [{'id': 16, 'name': 'Animation'}, {'id': 35, '...   
1      [{'id': 12, 'name': 'Adventure'}, {'id': 14, '...   
2      [{'id': 10749, 'name': 'Romance'}, {'id': 35, ...   
3      [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...   
4                         [{'id': 35, 'name': 'Comedy'}]   
...                                                  ...   
45461  [{'id': 18, 'name': 'Drama'}, {'id': 10751, 'n...   
45462                      [{'id': 18, 'name': 'Drama'}]   
45463  [{'id': 28, 'name': 'Action'}, {'id': 18, 'nam...   
45464                                                 []   
45465                                                 []   

                                   homepage      id    imdb_id  \
0      http://toystory.disney.com/toy-story     862  tt0114709   
1                                       NaN    8844  tt0113497   
2                                       NaN   15602  tt0113228   
3                                       NaN   31357  tt0114885   
4                                       NaN   11862  tt0113041   
...                                     ...     ...        ...   
45461  http://www.imdb.com/title/tt6209470/  439050  tt6209470   
45462                                   NaN  111109  tt2028550   
45463                                   NaN   67758  tt0303758   
45464                                   NaN  227506  tt0008536   
45465                                   NaN  461257  tt6980792   

      original_language               original_title  \
0                    en                    Toy Story   
1                    en                      Jumanji   
2                    en             Grumpier Old Men   
3                    en            Waiting to Exhale   
4                    en  Father of the Bride Part II   
...                 ...                          ...   
45461                fa                      رگ خواب   
45462                tl          Siglo ng Pagluluwal   
45463                en                     Betrayal   
45464                en          Satana likuyushchiy   
45465                en                     Queerama   

                                                overview  ... release_date  \
0      Led by Woody, Andy's toys live happily in his ...  ...   1995-10-30   
1      When siblings Judy and Peter discover an encha...  ...   1995-12-15   
2      A family wedding reignites the ancient feud be...  ...   1995-12-22   
3      Cheated on, mistreated and stepped on, the wom...  ...   1995-12-22   
4      Just when George Banks has recovered from his ...  ...   1995-02-10   
...                                                  ...  ...          ...   
45461        Rising and falling between a man and woman.  ...          NaN   
45462  An artist struggles to finish his work while a...  ...   2011-11-17   
45463  When one of her hits goes wrong, a professiona...  ...   2003-08-01   
45464  In a small town live two brothers, one a minis...  ...   1917-10-21   
45465  50 years after decriminalisation of homosexual...  ...   2017-06-09   

           revenue runtime                                   spoken_languages  \
0      373554033.0    81.0           [{'iso_639_1': 'en', 'name': 'English'}]   
1      262797249.0   104.0  [{'iso_639_1': 'en', 'name': 'English'}, {'iso...   
2              0.0   101.0           [{'iso_639_1': 'en', 'name': 'English'}]   
3       81452156.0   127.0           [{'iso_639_1': 'en', 'name': 'English'}]   
4       76578911.0   106.0           [{'iso_639_1': 'en', 'name': 'English'}]   
...            ...     ...                                                ...   
45461          0.0    90.0             [{'iso_639_1': 'fa', 'name': 'فارسی'}]   
45462          0.0   360.0                  [{'iso_639_1': 'tl', 'name': ''}]   
45463          0.0    90.0           [{'iso_639_1': 'en', 'name': 'English'}]   
45464          0.0    87.0                                                 []   
45465          0.0    75.0           [{'iso_639_1': 'en', 'name': 'English'}]   

         status                                            tagline  \
0      Released                                                NaN   
1      Released          Roll the dice and unleash the excitement!   
2      Released  Still Yelling. Still Fighting. Still Ready for...   
3      Released  Friends are the people who let you be yourself...   
4      Released  Just When His World Is Back To Normal... He's ...   
...         ...                                                ...   
45461  Released         Rising and falling between a man and woman   
45462  Released                                                NaN   
45463  Released                             A deadly game of wits.   
45464  Released                                                NaN   
45465  Released                                                NaN   

                             title  video vote_average vote_count  
0                        Toy Story  False          7.7     5415.0  
1                          Jumanji  False          6.9     2413.0  
2                 Grumpier Old Men  False          6.5       92.0  
3                Waiting to Exhale  False          6.1       34.0  
4      Father of the Bride Part II  False          5.7      173.0  
...                            ...    ...          ...        ...  
45461                       Subdue  False          4.0        1.0  
45462          Century of Birthing  False          9.0        3.0  
45463                     Betrayal  False          3.8        6.0  
45464             Satan Triumphant  False          0.0        0.0  
45465                     Queerama  False          0.0        0.0  

[45466 rows x 24 columns]>
movie_data_df.count()
adult                    45466
belongs_to_collection     4494
budget                   45466
genres                   45466
homepage                  7782
id                       45466
imdb_id                  45449
original_language        45455
original_title           45466
overview                 44512
popularity               45461
poster_path              45080
production_companies     45463
production_countries     45463
release_date             45379
revenue                  45460
runtime                  45203
spoken_languages         45460
status                   45379
tagline                  20412
title                    45460
video                    45460
vote_average             45460
vote_count               45460
dtype: int64
movie_data_df.columns
Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count'],
      dtype='object')
ratings_df.head(5)
userId movieId rating timestamp
0 1 31 2.5 1260759144
1 1 1029 3.0 1260759179
2 1 1061 3.0 1260759182
3 1 1129 2.0 1260759185
4 1 1172 4.0 1260759205
ratings_df.columns
Index(['userId', 'movieId', 'rating', 'timestamp'], dtype='object')
ratings_df.count()
userId       100004
movieId      100004
rating       100004
timestamp    100004
dtype: int64
ratings_df.shape
(100004, 4)
movie_data_df.shape
(45466, 24)

数据预处理

movie_data_df_t=movie_data_df[['title','id']]
movie_data_df_t.dtypes
title    object
id       object
dtype: object
ratings_df_s = ratings_df.drop(['timestamp'], axis=1)  #axis=0 跨列删除行 ,axis=1 跨行删除列
ratings_df_s.dtypes
userId       int64
movieId      int64
rating     float64
dtype: object
缺失值处理
# pd.to_numeric 将id列 的数据 由字符串转为数值类型, 不能转换的数据设置为NaN
pd.to_numeric(movie_data_df_t['id'],errors='coerce')
0           862.0
1          8844.0
2         15602.0
3         31357.0
4         11862.0
           ...   
45461    439050.0
45462    111109.0
45463     67758.0
45464    227506.0
45465    461257.0
Name: id, Length: 45466, dtype: float64
#np.where返回满足()内条件的数据所在的位置
np.where(pd.to_numeric(movie_data_df_t['id'], errors='coerce').isna()) #返回缺失值的位置,其中isna() 对于NaN返回True,否则返回False
(array([19730, 29503, 35587], dtype=int64),)
movie_data_df_t.iloc[19730]
title           NaN
id       1997-08-20
Name: 19730, dtype: object
movie_data_df_t.iloc[[19730,29503,35587]]
title id
19730 NaN 1997-08-20
29503 NaN 2012-09-29
35587 NaN 2014-01-01
# 将格式转换后的数据 赋值给id列
movie_data_df_t['id'] = pd.to_numeric(movie_data_df_t['id'], errors='coerce')
c:\users\ysilhouette\documents\pyenv\py3.6.5\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

movie_data_df_t.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   title   45460 non-null  object 
 1   id      45463 non-null  float64
dtypes: float64(1), object(1)
memory usage: 710.5+ KB
movie_data_df_t.iloc[[19730,29503,35587]]
title id
19730 NaN NaN
29503 NaN NaN
35587 NaN NaN
movie_data_df_t.shape
(45466, 2)
movie_data_df_t.drop(np.where(movie_data_df_t['id'].isna())[0], inplace=True)
c:\users\ysilhouette\documents\pyenv\py3.6.5\lib\site-packages\pandas\core\frame.py:4174: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
movie_data_df_t.shape
(45463, 2)
数据去重
movie_data_df_t.duplicated(['id','title']).sum()
30
movie_data_df_t.drop_duplicates(['id'],inplace=True)
c:\users\ysilhouette\documents\pyenv\py3.6.5\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
movie_data_df_t.shape
(45433, 2)
ratings_df_s.duplicated(['userId','movieId']).sum()
0
movie_data_df_t['id'] = movie_data_df_t['id'].astype(np.int64)
c:\users\ysilhouette\documents\pyenv\py3.6.5\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
movie_data_df_t.dtypes
title    object
id        int64
dtype: object
ratings_df_s.dtypes
userId       int64
movieId      int64
rating     float64
dtype: object

数据合并

# 左dataframe 和 右dataframe 根据 movieId 和 id进行合并
ratings_df_s = pd.merge(ratings_df_s,movie_data_df_t, left_on='movieId',right_on='id')
ratings_df_s.head()
userId movieId rating title id
0 1 1371 2.5 Rocky III 1371
1 4 1371 4.0 Rocky III 1371
2 7 1371 3.0 Rocky III 1371
3 19 1371 4.0 Rocky III 1371
4 21 1371 3.0 Rocky III 1371
ratings_df_s.drop(['id'],axis=1,inplace=True)
ratings_df_s
userId movieId rating title
0 1 1371 2.5 Rocky III
1 4 1371 4.0 Rocky III
2 7 1371 3.0 Rocky III
3 19 1371 4.0 Rocky III
4 21 1371 3.0 Rocky III
... ... ... ... ...
44984 652 129009 4.0 Love Is a Ball
44985 653 2103 3.0 Solaris
44986 659 167 4.0 K-PAX
44987 659 563 3.0 Starship Troopers
44988 665 129 3.0 Spirited Away

44989 rows × 4 columns

ratings_df_s.shape
(44989, 4)
# 有评分记录的电影的个数
len(ratings_df_s['title'].unique())
2794
ratings_df_s['title'].unique()
array(['Rocky III', 'Greed', 'American Pie', ..., 'K-PAX',
       'Starship Troopers', 'Spirited Away'], dtype=object)
ratings_df_s.groupby([ratings_df_s['title'],ratings_df_s['rating']]).count().reset_index()
title rating userId movieId
0 !Women Art Revolution 3.0 1 1
1 !Women Art Revolution 3.5 1 1
2 'Gator Bait 0.5 1 1
3 'Twas the Night Before Christmas 3.5 1 1
4 'Twas the Night Before Christmas 4.5 1 1
... ... ... ... ...
10263 À nos amours 4.0 5 5
10264 À nos amours 4.5 1 1
10265 À nos amours 5.0 1 1
10266 Ödipussi 4.5 1 1
10267 Şaban Oğlu Şaban 4.5 1 1

10268 rows × 4 columns

ratings_df_s.groupby(ratings_df_s['title']).count().reset_index()
title userId movieId rating
0 !Women Art Revolution 2 2 2
1 'Gator Bait 1 1 1
2 'Twas the Night Before Christmas 2 2 2
3 ...And God Created Woman 1 1 1
4 00 Schneider - Jagd auf Nihil Baxter 2 2 2
... ... ... ... ...
2789 xXx 28 28 28
2790 ¡Three Amigos! 1 1 1
2791 À nos amours 14 14 14
2792 Ödipussi 1 1 1
2793 Şaban Oğlu Şaban 1 1 1

2794 rows × 4 columns

ratings_df_s_allcounts = ratings_df_s.groupby(ratings_df_s['title'])['userId'].count().reset_index()
ratings_df_s_allcounts = ratings_df_s_allcounts.rename(columns = {'userId':'totalRatings'})
ratings_df_s_allcounts
title totalRatings
0 !Women Art Revolution 2
1 'Gator Bait 1
2 'Twas the Night Before Christmas 2
3 ...And God Created Woman 1
4 00 Schneider - Jagd auf Nihil Baxter 2
... ... ...
2789 xXx 28
2790 ¡Three Amigos! 1
2791 À nos amours 14
2792 Ödipussi 1
2793 Şaban Oğlu Şaban 1

2794 rows × 2 columns

ratings_df_s_allcounts.shape
(2794, 2)
ratings_df_s_allcounts['totalRatings'].describe()
count    2794.000000
mean       16.102004
std        31.481795
min         1.000000
25%         1.000000
50%         4.000000
75%        15.750000
max       324.000000
Name: totalRatings, dtype: float64
ratings_df_s_allcounts.hist()
array([[<AxesSubplot:title={'center':'totalRatings'}>]], dtype=object)

png

ratings_df_s_allcounts['totalRatings'].quantile(np.arange(0.6,1, 0.01)) #分位点
0.60      7.00
0.61      7.00
0.62      7.00
0.63      8.00
0.64      8.00
0.65      9.00
0.66      9.00
0.67     10.00
0.68     10.00
0.69     11.00
0.70     12.00
0.71     12.00
0.72     13.00
0.73     14.00
0.74     14.00
0.75     15.75
0.76     17.00
0.77     18.00
0.78     19.00
0.79     20.00
0.80     21.00
0.81     22.33
0.82     24.00
0.83     26.00
0.84     27.00
0.85     29.00
0.86     31.00
0.87     34.00
0.88     37.00
0.89     41.77
0.90     45.00
0.91     49.00
0.92     52.56
0.93     59.00
0.94     64.42
0.95     71.00
0.96     83.28
0.97     98.21
0.98    119.14
0.99    168.49
Name: totalRatings, dtype: float64
votes_count_threshold = 20
ratings_df_s_top=ratings_df_s_allcounts.query('totalRatings > @votes_count_threshold').reset_index()
ratings_df_s_top
index title totalRatings
0 18 20,000 Leagues Under the Sea 89
1 19 2001: A Space Odyssey 87
2 24 24 Hour Party People 22
3 26 28 Days Later 26
4 27 28 Weeks Later 47
... ... ... ...
575 2770 Young Adam 34
576 2772 Young Frankenstein 29
577 2774 Young and Innocent 193
578 2781 Zatoichi 61
579 2789 xXx 28

580 rows × 3 columns

ratings_df_s_top.drop(['index'],axis=1,inplace=True)
ratings_df_s_top.head()
title totalRatings
0 20,000 Leagues Under the Sea 89
1 2001: A Space Odyssey 87
2 24 Hour Party People 22
3 28 Days Later 26
4 28 Weeks Later 47
ratings_df_s['title']
0                Rocky III
1                Rocky III
2                Rocky III
3                Rocky III
4                Rocky III
               ...        
44984       Love Is a Ball
44985              Solaris
44986                K-PAX
44987    Starship Troopers
44988        Spirited Away
Name: title, Length: 44989, dtype: object
ratings_df_s_top['title']
0      20,000 Leagues Under the Sea
1             2001: A Space Odyssey
2              24 Hour Party People
3                     28 Days Later
4                    28 Weeks Later
                   ...             
575                      Young Adam
576              Young Frankenstein
577              Young and Innocent
578                        Zatoichi
579                             xXx
Name: title, Length: 580, dtype: object
ratings_df_s[ratings_df_s['title'].isin(ratings_df_s_top['title'])]
userId movieId rating title
0 1 1371 2.5 Rocky III
1 4 1371 4.0 Rocky III
2 7 1371 3.0 Rocky III
3 19 1371 4.0 Rocky III
4 21 1371 3.0 Rocky III
... ... ... ... ...
44507 624 3057 4.0 Frankenstein
44781 547 97936 3.0 Sweet November
44782 624 97936 3.0 Sweet November
44909 609 1450 5.0 Blood: The Last Vampire
44985 653 2103 3.0 Solaris

34552 rows × 4 columns

ratings_df_s[ratings_df_s['title'].isin(ratings_df_s_top['title'])]  #得到评分数量大于20的
userId movieId rating title
0 1 1371 2.5 Rocky III
1 4 1371 4.0 Rocky III
2 7 1371 3.0 Rocky III
3 19 1371 4.0 Rocky III
4 21 1371 3.0 Rocky III
... ... ... ... ...
44507 624 3057 4.0 Frankenstein
44781 547 97936 3.0 Sweet November
44782 624 97936 3.0 Sweet November
44909 609 1450 5.0 Blood: The Last Vampire
44985 653 2103 3.0 Solaris

34552 rows × 4 columns

ratings_df_s[~ratings_df_s['title'].isin(ratings_df_s_top['title'])] # 得到评分数量小于20的
userId movieId rating title
1714 2 248 3.0 Pocketful of Miracles
1715 36 248 2.0 Pocketful of Miracles
1716 110 248 4.0 Pocketful of Miracles
1717 239 248 4.0 Pocketful of Miracles
1718 242 248 3.0 Pocketful of Miracles
... ... ... ... ...
44983 652 127728 5.0 8:46
44984 652 129009 4.0 Love Is a Ball
44986 659 167 4.0 K-PAX
44987 659 563 3.0 Starship Troopers
44988 665 129 3.0 Spirited Away

10437 rows × 4 columns

ratings_df_s_cntD20 = ratings_df_s[ratings_df_s['title'].isin(ratings_df_s_top['title'])]
ratings_df_s_cntX20 = ratings_df_s[~ratings_df_s['title'].isin(ratings_df_s_top['title'])]
ratings_df_s_cntD20.shape
(34552, 4)
ratings_df_s_cntX20.shape
(10437, 4)
ratings_df_s_cntD20.isna().sum() #检查有无缺失值
userId     0
movieId    0
rating     0
title      0
dtype: int64
ratings_df_s_cntD20.duplicated(['userId','title']).sum()
140
ratings_df_s_cntD20=ratings_df_s_cntD20.drop_duplicates(['userId','title']) # 只保留每个用户对每个电影的一条评论记录
ratings_df_s_cntD20
userId movieId rating title
0 1 1371 2.5 Rocky III
1 4 1371 4.0 Rocky III
2 7 1371 3.0 Rocky III
3 19 1371 4.0 Rocky III
4 21 1371 3.0 Rocky III
... ... ... ... ...
44506 472 3057 3.0 Frankenstein
44507 624 3057 4.0 Frankenstein
44782 624 97936 3.0 Sweet November
44909 609 1450 5.0 Blood: The Last Vampire
44985 653 2103 3.0 Solaris

34412 rows × 4 columns

ratings_df_s_cntD20.duplicated(['userId','title']).sum()
0
# 将一个dataframe的记录数据整合成表格,而且是按照pivot(‘index=xx’,’columns=xx’,’values=xx’)来整合的。还有另外一种写法,就是pivot(‘索引列’,‘列名’,‘值’)。
ratings_df_s_cntD20_for_apriori = ratings_df_s_cntD20.pivot(index='userId',columns='title',values='rating')
ratings_df_s_cntD20_for_apriori
title 20,000 Leagues Under the Sea 2001: A Space Odyssey 24 Hour Party People 28 Days Later 28 Weeks Later 300 48 Hrs. 5 Card Stud 7 Virgins 8 Women ... Within the Woods X-Men Origins: Wolverine Y Tu Mamá También Yankee Doodle Dandy Yesterday Young Adam Young Frankenstein Young and Innocent Zatoichi xXx
userId
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN 3.0 NaN NaN NaN NaN 5.0 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN 3.0 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN 3.5 NaN NaN
4 3.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN 5.0 NaN NaN NaN NaN 5.0 NaN NaN NaN
5 NaN NaN NaN NaN NaN NaN 4.0 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN 3.5 NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
667 NaN NaN NaN NaN NaN NaN 4.0 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
668 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
669 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
670 NaN NaN NaN NaN NaN NaN 3.0 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
671 NaN NaN NaN NaN NaN NaN NaN 5.0 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN 4.0 NaN NaN

671 rows × 580 columns

ratings_df_s_cntD20_for_apriori= ratings_df_s_cntD20_for_apriori.fillna(0) #缺失值 填充0
def encode_units(x): # 有效评分规则, 1表示有效,0 表示无效
    if x <= 0:
        return 0
    if x>0:
        return 1
ratings_df_s_cntD20_for_apriori = ratings_df_s_cntD20_for_apriori.applymap(encode_units)

计算频繁项集 和关联规则

from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
ratings_df_s_cntD20_for_apriori.head()
title 20,000 Leagues Under the Sea 2001: A Space Odyssey 24 Hour Party People 28 Days Later 28 Weeks Later 300 48 Hrs. 5 Card Stud 7 Virgins 8 Women ... Within the Woods X-Men Origins: Wolverine Y Tu Mamá También Yankee Doodle Dandy Yesterday Young Adam Young Frankenstein Young and Innocent Zatoichi xXx
userId
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 1 0 0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
4 1 0 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 1 0 0 0
5 0 0 0 0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 1 0 0

5 rows × 580 columns

ratings_df_s_cntD20_for_apriori.isna().sum() #检查是否有nan值
title
20,000 Leagues Under the Sea    0
2001: A Space Odyssey           0
24 Hour Party People            0
28 Days Later                   0
28 Weeks Later                  0
                               ..
Young Adam                      0
Young Frankenstein              0
Young and Innocent              0
Zatoichi                        0
xXx                             0
Length: 580, dtype: int64
frequent_itemsets = apriori(ratings_df_s_cntD20_for_apriori, min_support=0.10, use_colnames=True)  #生成符合条件的频繁项集
frequent_itemsets.sort_values('support',ascending=False)  #support降序排列的频繁项集
support itemsets
111 0.482861 (Terminator 3: Rise of the Machines)
130 0.463487 (The Million Dollar Hotel)
105 0.454545 (Solaris)
113 0.433681 (The 39 Steps)
69 0.408346 (Monsoon Wedding)
... ... ...
1613 0.101341 (Sleepless in Seattle, 5 Card Stud, The Tunnel)
5455 0.101341 (Beauty and the Beast, Rain Man, Terminator 3:...
5454 0.101341 (The Passion of Joan of Arc, Beauty and the Be...
6769 0.101341 (The Million Dollar Hotel, The Hours, Three Co...
3108 0.101341 (The Conversation, Men in Black II, The Millio...

7327 rows × 2 columns

rules= association_rules(frequent_itemsets, metric="lift", min_threshold=1)  #生成关联规则,只保留lift>1的部分
rules
antecedents consequents antecedent support consequent support support confidence lift leverage conviction
0 (5 Card Stud) (48 Hrs.) 0.298063 0.298063 0.108793 0.365000 1.224575 0.019952 1.105413
1 (48 Hrs.) (5 Card Stud) 0.298063 0.298063 0.108793 0.365000 1.224575 0.019952 1.105413
2 (A Clockwork Orange) (48 Hrs.) 0.152012 0.298063 0.102832 0.676471 2.269559 0.057523 2.169625
3 (48 Hrs.) (A Clockwork Orange) 0.298063 0.152012 0.102832 0.345000 2.269559 0.057523 1.294638
4 (48 Hrs.) (A Nightmare on Elm Street) 0.298063 0.268256 0.156483 0.525000 1.957083 0.076526 1.540513
... ... ... ... ... ... ... ... ... ...
75531 (The Hours) (The Million Dollar Hotel, Terminator 3: Rise ... 0.301043 0.126677 0.104322 0.346535 2.735585 0.066187 1.336449
75532 (Terminator 3: Rise of the Machines) (The Million Dollar Hotel, The Hours, Rain Man... 0.482861 0.114754 0.104322 0.216049 1.882716 0.048912 1.129211
75533 (Rain Man) (The Million Dollar Hotel, The Hours, Terminat... 0.295082 0.120715 0.104322 0.353535 2.928669 0.068701 1.360143
75534 (Sissi) (The Million Dollar Hotel, The Hours, Terminat... 0.317437 0.117735 0.104322 0.328638 2.791347 0.066949 1.314143
75535 (Solaris) (The Million Dollar Hotel, The Hours, Terminat... 0.454545 0.113264 0.104322 0.229508 2.026316 0.052838 1.150870

75536 rows × 9 columns

rules.sort_values('lift',ascending=False)
antecedents consequents antecedent support consequent support support confidence lift leverage conviction
1473 (Muxmäuschenstill) (Waiter) 0.156483 0.120715 0.105812 0.676190 5.601529 0.086922 2.715438
1472 (Waiter) (Muxmäuschenstill) 0.120715 0.156483 0.105812 0.876543 5.601529 0.086922 6.832489
38208 (Titanic, Big Fish) (Psycho, Rain Man) 0.150522 0.131148 0.101341 0.673267 5.133663 0.081601 2.659215
38209 (Psycho, Rain Man) (Titanic, Big Fish) 0.131148 0.150522 0.101341 0.772727 5.133663 0.081601 3.737705
38238 (Titanic, Big Fish) (Psycho, Solaris) 0.150522 0.134128 0.102832 0.683168 5.093399 0.082642 2.732908
... ... ... ... ... ... ... ... ... ...
108 (5 Card Stud) (Men in Black II) 0.298063 0.333830 0.110283 0.370000 1.108348 0.010781 1.057413
571 (Bang, Boom, Bang) (The 39 Steps) 0.260805 0.433681 0.125186 0.480000 1.106804 0.012080 1.089075
570 (The 39 Steps) (Bang, Boom, Bang) 0.433681 0.260805 0.125186 0.288660 1.106804 0.012080 1.039159
1137 (Sissi) (License to Wed) 0.317437 0.301043 0.102832 0.323944 1.076070 0.007269 1.033874
1136 (License to Wed) (Sissi) 0.301043 0.317437 0.102832 0.341584 1.076070 0.007269 1.036675

75536 rows × 9 columns

电影推荐

推荐电影列表

all_antecedents = [list(x) for x in rules['antecedents'].values]
desired_indices = [i for i in range(len(all_antecedents)) if len(all_antecedents[i]) == 1 and all_antecedents[i][0] == 'Batman Returns'] 
apriori_recommendations =rules.iloc[desired_indices,].sort_values(by=['lift'],ascending=False)
apriori_recommendations.head()
antecedents consequents antecedent support consequent support support confidence lift leverage conviction
63981 (Batman Returns) (The Hours, Monsoon Wedding, Silent Hill, Rese... 0.298063 0.107303 0.102832 0.345 3.215208 0.070849 1.362897
36084 (Batman Returns) (Reservoir Dogs, Wag the Dog, Silent Hill) 0.298063 0.105812 0.101341 0.340 3.213239 0.069803 1.354830
63891 (Batman Returns) (Monsoon Wedding, Silent Hill, Reservoir Dogs,... 0.298063 0.107303 0.101341 0.340 3.168611 0.069358 1.352572
63351 (Batman Returns) (Monsoon Wedding, Silent Hill, Reservoir Dogs,... 0.298063 0.107303 0.101341 0.340 3.168611 0.069358 1.352572
36014 (Batman Returns) (The Hours, Reservoir Dogs, Silent Hill) 0.298063 0.116244 0.108793 0.365 3.139936 0.074145 1.391741
apriori_recommendations_list = [list(x) for x in apriori_recommendations['consequents'].values]
print("Apriori Recommendations for movie: Batman Returns\n")
for i in range(5):
    print("{0}:{1} with lift of {2}" .format(i+1, apriori_recommendations_list[i], apriori_recommendations.iloc[i,6]))
Apriori Recommendations for movie: Batman Returns

1:['The Hours', 'Monsoon Wedding', 'Silent Hill', 'Reservoir Dogs'] with lift of 3.215208333333333
2:['Reservoir Dogs', 'Wag the Dog', 'Silent Hill'] with lift of 3.2132394366197183
3:['Monsoon Wedding', 'Silent Hill', 'Reservoir Dogs', 'Sissi'] with lift of 3.168611111111111
4:['Monsoon Wedding', 'Silent Hill', 'Reservoir Dogs', 'Rain Man'] with lift of 3.168611111111111
5:['The Hours', 'Reservoir Dogs', 'Silent Hill'] with lift of 3.139935897435898

推荐单部电影

apriori_single_recommendations = apriori_recommendations.iloc[[x for x in range(len(apriori_recommendations_list)) if len(apriori_recommendations_list[x]) ==1],]
apriori_single_recommendations_list = [list(x) for x in apriori_single_recommendations['consequents'].values]
print("Apriori single-movie Recommendations for movie: Batman Returns\n")
for i in range(5):
    print("{0}: {1}, with lift of {2}".format(i+1,apriori_single_recommendations_list[i][0],apriori_single_recommendations.iloc[i,6]))
Apriori single-movie Recommendations for movie: Batman Returns

1: Reservoir Dogs, with lift of 2.6094444444444447
2: Ariel, with lift of 2.5397663551401872
3: Wag the Dog, with lift of 2.496744186046512
4: To Kill a Mockingbird, with lift of 2.478125
5: Romeo + Juliet, with lift of 2.4705000000000004

协同过滤

基于user的协同过滤

# 读取ratings_small.csv数据用于建模
ratings_small_path = "./movie_dataset/ratings_small.csv"
ratings_small_df = pd.read_csv(ratings_small_path)
ratings_small_df.shape
(100004, 4)
ratings_small_df.head()
userId movieId rating timestamp
0 1 31 2.5 1260759144
1 1 1029 3.0 1260759179
2 1 1061 3.0 1260759182
3 1 1129 2.0 1260759185
4 1 1172 4.0 1260759205
# 原始的movieId 并非从0到1 的连续值, 为方便更贱user-item矩阵, 重新排列movie_id
movie_id = ratings_small_df['movieId'].drop_duplicates()
movie_id = pd.DataFrame(movie_id)
movie_id['movieid'] = range(len(movie_id))
movie_id
movieId movieid
0 31 0
1 1029 1
2 1061 2
3 1129 3
4 1172 4
... ... ...
99131 64997 9061
99159 72380 9062
99274 129 9063
99678 4736 9064
99820 6425 9065

9066 rows × 2 columns

ratings_small_df = pd.merge(ratings_small_df, movie_id, on =['movieId'], how='left')
ratings_small_df
userId movieId rating timestamp movieid
0 1 31 2.5 1260759144 0
1 1 1029 3.0 1260759179 1
2 1 1061 3.0 1260759182 2
3 1 1129 2.0 1260759185 3
4 1 1172 4.0 1260759205 4
... ... ... ... ... ...
99999 671 6268 2.5 1065579370 7005
100000 671 6269 4.0 1065149201 4771
100001 671 6365 4.0 1070940363 1329
100002 671 6385 2.5 1070979663 1331
100003 671 6565 3.5 1074784724 2946

100004 rows × 5 columns

ratings_small_df = ratings_small_df[['userId','movieid','rating','timestamp']]  #更新 movieId ----> movieid
ratings_small_df
userId movieid rating timestamp
0 1 0 2.5 1260759144
1 1 1 3.0 1260759179
2 1 2 3.0 1260759182
3 1 3 2.0 1260759185
4 1 4 4.0 1260759205
... ... ... ... ...
99999 671 7005 2.5 1065579370
100000 671 4771 4.0 1065149201
100001 671 1329 4.0 1070940363
100002 671 1331 2.5 1070979663
100003 671 2946 3.5 1074784724

100004 rows × 4 columns

# 用户物品统计
# unique()是以 数组形式(numpy.ndarray)返回列的所有唯一值(特征的所有唯一值)
# nunique() Return number of unique elements in the object.即返回的是唯一值的个数

n_users = ratings_small_df.userId.nunique()
n_users
671
n_items = ratings_small_df.movieid.nunique()
n_items 
9066
# 拆分数据集
from sklearn.model_selection import train_test_split
#按照训练集70% 测试集30%的比例 对数据进行拆分
train_data,test_data = train_test_split(ratings_small_df,test_size= 0.3)
train_data
userId movieid rating timestamp
69526 481 329 4.0 1437001087
41670 299 917 3.5 1344188856
49260 358 288 2.0 957480147
39317 287 3582 4.0 1470168974
35991 262 2094 3.0 1433899624
... ... ... ... ...
6262 33 1095 2.0 1032769543
8504 56 367 2.0 1467005360
8540 56 1435 4.0 1467006577
77937 542 1496 1.0 1424966216
94226 624 476 3.0 1053249671

70002 rows × 4 columns

# 训练集 用户-物品 矩阵
user_item_matrix = np.zeros((n_users,n_items))
user_item_matrix.shape
(671, 9066)
# iterrows() : 将DataFrame迭代成(index ,series)
# iteritems(): 将DataFrame迭代成(列名,series)
# itertuples(): 将DataFrame迭代成元组 
for line in train_data.itertuples():
    user_item_matrix[line[1]-1,line[2]]=line[3]
user_item_matrix
array([[0., 3., 3., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])
user_item_matrix.shape
(671, 9066)
# 构建用户相似矩阵 ---采用余弦距离
from sklearn.metrics.pairwise import pairwise_distances
# 相似度计算 定义余弦距离
user_similarity_m = pairwise_distances(user_item_matrix,metric='cosine')  # 每个用户为1行数据,故此处不需要再进行转置

image.png

a=[[1,3],[2,2]]
a
[[1, 3], [2, 2]]
pairwise_distances(a,metric='euclidean')
array([[0.        , 1.41421356],
       [1.41421356, 0.        ]])
b = np.array([[1,2],[1,3],[2,1]])
b
array([[1, 2],
       [1, 3],
       [2, 1]])
pairwise_distances(b,metric='euclidean') #结果数组的第一行第二列表示 a[0]与a[1]的距离
array([[0.        , 1.        , 1.41421356],
       [1.        , 0.        , 2.23606798],
       [1.41421356, 2.23606798, 0.        ]])
pairwise_distances(b,metric='cosine')
array([[0.        , 0.01005051, 0.2       ],
       [0.01005051, 0.        , 0.29289322],
       [0.2       , 0.29289322, 0.        ]])
b.shape
(3, 2)
b[1]
array([1, 3])
b[0]
array([1, 2])
user_similarity_m.shape
(671, 671)
user_similarity_m[0:5,0:5].round(2)
array([[0.  , 1.  , 1.  , 0.94, 0.97],
       [1.  , 0.  , 0.89, 0.93, 0.92],
       [1.  , 0.89, 0.  , 0.93, 0.93],
       [0.94, 0.93, 0.93, 0.  , 0.94],
       [0.97, 0.92, 0.93, 0.94, 0.  ]])
user_similarity_m_triu = np.triu(user_similarity_m,k=1) #取得上三角数据
np.round(user_similarity_m_triu[user_similarity_m_triu.nonzero()],3)
array([1.   , 1.   , 0.938, ..., 0.934, 0.919, 0.814])
user_sim_nonzero = np.round(user_similarity_m_triu[user_similarity_m_triu.nonzero()],3)
np.percentile(user_sim_nonzero,np.arange(0,101,10))
array([0.316, 0.844, 0.885, 0.911, 0.93 , 0.947, 0.961, 0.976, 1.   ,
       1.   , 1.   ])

训练集预测

mean_user_rating = user_item_matrix.mean(axis=1)
mean_user_rating
array([0.00297816, 0.0198544 , 0.01301566, 0.06265167, 0.03027796,
       0.01196779, 0.02404589, 0.03805427, 0.01114053, 0.0147805 ,
       0.01047871, 0.01301566, 0.01615928, 0.004743  , 0.33984116,
       0.01069932, 0.0991617 , 0.0147805 , 0.11780278, 0.02294286,
       0.0443415 , 0.04936025, 0.21111846, 0.00683874, 0.00617692,
       0.04031546, 0.00816236, 0.01555261, 0.00363997, 0.29643724,
       0.02625193, 0.01080962, 0.03684094, 0.05702625, 0.0025921 ,
       0.03000221, 0.01147143, 0.03838518, 0.0196338 , 0.01301566,
       0.0592323 , 0.0196338 , 0.02172954, 0.00694904, 0.00512905,
       0.01312597, 0.01069932, 0.14212442, 0.02371498, 0.01169204,
       0.01069932, 0.01941319, 0.00893448, 0.01384293, 0.00838297,
       0.14615045, 0.06254136, 0.01753805, 0.02090227, 0.02018531,
       0.04197   , 0.0172623 , 0.02454225, 0.00739025, 0.00694904,
       0.01544231, 0.02856828, 0.03331127, 0.02327377, 0.02856828,
       0.00794176, 0.05035297, 0.42096845, 0.01544231, 0.0394882 ,
       0.00573572, 0.07346128, 0.08234061, 0.00921024, 0.0100375 ,
       0.05283477, 0.01235385, 0.04555482, 0.03424884, 0.0247077 ,
       0.05084933, 0.00650783, 0.06281712, 0.02448709, 0.01577322,
       0.04671299, 0.02779616, 0.0444518 , 0.0497463 , 0.08769027,
       0.02288771, 0.03160159, 0.02332892, 0.0497463 , 0.00661813,
       0.0173726 , 0.1978822 , 0.02437679, 0.02360468, 0.13942202,
       0.01411869, 0.00656298, 0.00783146, 0.00628723, 0.04015001,
       0.09541143, 0.00650783, 0.00959629, 0.00827267, 0.01351202,
       0.00667328, 0.01389808, 0.05592323, 0.17185087, 0.03833002,
       0.02503861, 0.00882418, 0.00937569, 0.02415619, 0.0666777 ,
       0.01808957, 0.00573572, 0.09695566, 0.00595632, 0.09254357,
       0.01433929, 0.0297816 , 0.03518641, 0.10065078, 0.00529451,
       0.01637988, 0.02018531, 0.02024046, 0.02090227, 0.01213325,
       0.00816236, 0.01103022, 0.02365983, 0.01158173, 0.01455989,
       0.02134348, 0.01312597, 0.04081182, 0.06121774, 0.09557688,
       0.01654533, 0.05664019, 0.01698654, 0.00926539, 0.01279506,
       0.01604897, 0.08697331, 0.00562541, 0.04070152, 0.03132583,
       0.02702405, 0.00915508, 0.02007501, 0.02421134, 0.10870285,
       0.01422899, 0.00761085, 0.02790646, 0.03623428, 0.00590117,
       0.01588352, 0.0051842 , 0.01169204, 0.00639753, 0.03551732,
       0.05834988, 0.07059343, 0.03259431, 0.01080962, 0.00452239,
       0.01108537, 0.04301787, 0.01433929, 0.01125083, 0.05625414,
       0.0099272 , 0.1014229 , 0.02867858, 0.03320097, 0.02150893,
       0.00739025, 0.01687624, 0.01886168, 0.01367748, 0.10533863,
       0.02702405, 0.02051621, 0.01979925, 0.12221487, 0.05846018,
       0.03910214, 0.01831017, 0.01086477, 0.00871388, 0.05542687,
       0.00948599, 0.00672844, 0.01604897, 0.00452239, 0.00683874,
       0.01610413, 0.2176263 , 0.18409442, 0.06254136, 0.01753805,
       0.02465255, 0.03430399, 0.01433929, 0.03855063, 0.07842488,
       0.00501875, 0.02658284, 0.01158173, 0.02625193, 0.00827267,
       0.00921024, 0.00750055, 0.02349437, 0.00987205, 0.03347673,
       0.00937569, 0.20416942, 0.00871388, 0.03160159, 0.04858813,
       0.06507831, 0.01075447, 0.02432164, 0.0843812 , 0.07246856,
       0.0147805 , 0.14328259, 0.08151335, 0.02195014, 0.04268696,
       0.00739025, 0.07677035, 0.03595853, 0.0051842 , 0.05581293,
       0.03628943, 0.01147143, 0.05840503, 0.03739246, 0.04252151,
       0.00650783, 0.02768586, 0.01025811, 0.00926539, 0.01235385,
       0.01047871, 0.13710567, 0.03000221, 0.01091992, 0.054379  ,
       0.00959629, 0.01158173, 0.10324289, 0.00628723, 0.06176925,
       0.02029561, 0.01158173, 0.0296713 , 0.01114053, 0.06585043,
       0.00479815, 0.01808957, 0.0147805 , 0.01136113, 0.00838297,
       0.02509376, 0.03524156, 0.05333113, 0.01433929, 0.10302228,
       0.00915508, 0.0893448 , 0.02029561, 0.0049636 , 0.02090227,
       0.02553497, 0.08018972, 0.02217075, 0.25672844, 0.06849768,
       0.00634238, 0.03662034, 0.02647253, 0.11763733, 0.01389808,
       0.00551511, 0.00750055, 0.06243106, 0.03309067, 0.00595632,
       0.16997573, 0.02029561, 0.0148908 , 0.04594088, 0.00468784,
       0.23830796, 0.07290977, 0.08112729, 0.01169204, 0.01246415,
       0.03524156, 0.00573572, 0.01588352, 0.00595632, 0.01571807,
       0.02283256, 0.01323627, 0.00700419, 0.04180454, 0.00446724,
       0.00783146, 0.02073682, 0.04649239, 0.00584602, 0.02680344,
       0.00689389, 0.00816236, 0.02503861, 0.01086477, 0.007666  ,
       0.00816236, 0.00330907, 0.01323627, 0.02950585, 0.01384293,
       0.00595632, 0.05868079, 0.01114053, 0.04500331, 0.0619347 ,
       0.09055813, 0.00650783, 0.01621443, 0.00639753, 0.0495257 ,
       0.01378778, 0.02443194, 0.1039047 , 0.01544231, 0.09039268,
       0.00419148, 0.00948599, 0.15243768, 0.01483565, 0.0098169 ,
       0.01533201, 0.03071917, 0.05404809, 0.00909993, 0.0224465 ,
       0.0097066 , 0.05217295, 0.00628723, 0.01345687, 0.03055372,
       0.0446724 , 0.00849327, 0.06165895, 0.00838297, 0.00705934,
       0.01808957, 0.00645268, 0.03750276, 0.01990955, 0.28375248,
       0.02945069, 0.07654975, 0.01544231, 0.11973307, 0.03132583,
       0.02691374, 0.09276417, 0.22865652, 0.01246415, 0.03430399,
       0.02923009, 0.00617692, 0.0125193 , 0.04511361, 0.00683874,
       0.03540702, 0.01632473, 0.01544231, 0.00595632, 0.01676594,
       0.024818  , 0.09303993, 0.00783146, 0.0098169 , 0.11675491,
       0.0270792 , 0.10699316, 0.05978381, 0.01566292, 0.00799691,
       0.00882418, 0.05129054, 0.00650783, 0.01698654, 0.00893448,
       0.02724465, 0.04114273, 0.0494154 , 0.01643503, 0.02823737,
       0.0101478 , 0.0296713 , 0.09458416, 0.00799691, 0.01588352,
       0.06507831, 0.09458416, 0.04560997, 0.00457754, 0.09618354,
       0.09303993, 0.02013016, 0.06221046, 0.05382749, 0.00606662,
       0.02161924, 0.00683874, 0.00612177, 0.05779837, 0.01367748,
       0.03568277, 0.07572248, 0.01775866, 0.00441209, 0.00540481,
       0.00904478, 0.01808957, 0.00639753, 0.00871388, 0.03943305,
       0.01599382, 0.33085153, 0.02294286, 0.0101478 , 0.00821752,
       0.01660049, 0.14179351, 0.02272226, 0.00705934, 0.08283697,
       0.15784249, 0.0121884 , 0.13335539, 0.01058901, 0.01119568,
       0.0593426 , 0.02095742, 0.30228326, 0.0048533 , 0.01869623,
       0.0569711 , 0.24652548, 0.02614163, 0.01301566, 0.14284139,
       0.01114053, 0.00490845, 0.02774101, 0.03132583, 0.1185749 ,
       0.1435032 , 0.01819987, 0.03259431, 0.00573572, 0.004743  ,
       0.0398191 , 0.04037062, 0.01781381, 0.00672844, 0.0051842 ,
       0.01875138, 0.01941319, 0.02923009, 0.02415619, 0.00617692,
       0.03309067, 0.03419369, 0.0048533 , 0.01235385, 0.05741231,
       0.05658504, 0.03353188, 0.01334657, 0.004743  , 0.09927201,
       0.0051842 , 0.01125083, 0.01334657, 0.2351092 , 0.04367968,
       0.00948599, 0.00921024, 0.00584602, 0.1037944 , 0.00876903,
       0.03805427, 0.01411869, 0.19170527, 0.05619899, 0.03987426,
       0.01384293, 0.06083168, 0.04003971, 0.01968895, 0.03992941,
       0.00777631, 0.03171189, 0.03325612, 0.16804544, 0.02062652,
       0.03298037, 0.01384293, 0.0394882 , 0.08030002, 0.01378778,
       0.03011251, 0.10070593, 0.00739025, 0.01058901, 0.00551511,
       0.00683874, 0.01704169, 0.01544231, 0.09265387, 0.02713435,
       0.02178469, 0.63484447, 0.03562762, 0.00623208, 0.03353188,
       0.02360468, 0.00783146, 0.06358923, 0.01511141, 0.01831017,
       0.00959629, 0.01329142, 0.07224796, 0.04378998, 0.03253916,
       0.07798368, 0.07026252, 0.04616148, 0.52404589, 0.00871388,
       0.00777631, 0.01147143, 0.01180234, 0.02283256, 0.03634458,
       0.01577322, 0.02950585, 0.0101478 , 0.09022722, 0.14284139,
       0.01125083, 0.0917163 , 0.00805206, 0.00209574, 0.22887712,
       0.00595632, 0.03502096, 0.00821752, 0.06072138, 0.09728657,
       0.0150011 , 0.15938672, 0.01400838, 0.01047871, 0.02228105,
       0.00849327, 0.03904699, 0.02128833, 0.02514891, 0.05118023,
       0.14399956, 0.06243106, 0.07842488, 0.05757776, 0.01119568,
       0.01268476, 0.03926759, 0.03617913, 0.00330907, 0.11096404,
       0.0196338 , 0.12618575, 0.08879329, 0.02283256, 0.01913744,
       0.01080962, 0.01742775, 0.01560777, 0.02889918, 0.10225017,
       0.01069932, 0.01764836, 0.0100375 , 0.01257445, 0.04086698,
       0.02614163, 0.01185749, 0.03105008, 0.39383411, 0.02079197,
       0.04290757, 0.04500331, 0.0223362 , 0.00959629, 0.0075557 ,
       0.00937569, 0.01185749, 0.00772116, 0.00534966, 0.00750055,
       0.00739025, 0.00976175, 0.004743  , 0.01455989, 0.01191264,
       0.04059122, 0.01169204, 0.00490845, 0.01125083, 0.007666  ,
       0.05834988, 0.05162144, 0.07715641, 0.0245974 , 0.00827267,
       0.00595632, 0.08509817, 0.01753805, 0.20257004, 0.03353188,
       0.0445621 , 0.00419148, 0.01952349, 0.03827487, 0.02950585,
       0.00843812, 0.01742775, 0.00871388, 0.15927642, 0.1088683 ,
       0.00816236, 0.01687624, 0.00739025, 0.0098169 , 0.00716964,
       0.0347452 ])
rating_diff = (user_item_matrix - mean_user_rating[:,np.newaxis])   # np.newaxis作用:为mean_user_rating增加一个维度,实现加减操作
rating_diff
array([[-2.97816016e-03,  2.99702184e+00,  2.99702184e+00, ...,
        -2.97816016e-03, -2.97816016e-03, -2.97816016e-03],
       [-1.98544011e-02, -1.98544011e-02, -1.98544011e-02, ...,
        -1.98544011e-02, -1.98544011e-02, -1.98544011e-02],
       [-1.30156629e-02, -1.30156629e-02, -1.30156629e-02, ...,
        -1.30156629e-02, -1.30156629e-02, -1.30156629e-02],
       ...,
       [-9.81689830e-03, -9.81689830e-03, -9.81689830e-03, ...,
        -9.81689830e-03, -9.81689830e-03, -9.81689830e-03],
       [-7.16964483e-03, -7.16964483e-03, -7.16964483e-03, ...,
        -7.16964483e-03, -7.16964483e-03, -7.16964483e-03],
       [-3.47452019e-02, -3.47452019e-02, -3.47452019e-02, ...,
        -3.47452019e-02, -3.47452019e-02, -3.47452019e-02]])
user_prediction = mean_user_rating[:,np.newaxis] + user_similarity_m.dot(rating_diff) / np.array([np.abs(user_similarity_m).sum(axis=1)]).T
# 处以np.array([np.abs(item_similarity_m).sum(axis=1)]是为了可以使评分在1~5之间,使1~5的标准化
user_prediction
array([[ 8.48587738e-02,  1.11549860e-01,  7.78496257e-02, ...,
        -3.30873704e-02, -3.59785123e-02, -3.59132569e-02],
       [ 9.36489784e-02,  1.35396758e-01,  1.04357090e-01, ...,
        -1.62815182e-02, -1.93136443e-02, -1.93247190e-02],
       [ 9.44428457e-02,  1.33314515e-01,  9.83052575e-02, ...,
        -2.28228892e-02, -2.58037344e-02, -2.59258365e-02],
       ...,
       [ 9.29750987e-02,  1.27902780e-01,  9.32275326e-02, ...,
        -2.60694824e-02, -2.89101875e-02, -2.87905826e-02],
       [ 8.62056229e-02,  1.26697599e-01,  9.17810994e-02, ...,
        -2.88942031e-02, -3.19119828e-02, -3.20590645e-02],
       [ 1.17342284e-01,  1.50739909e-01,  1.17908253e-01, ...,
        -7.69495365e-05, -2.99819315e-03, -3.02101562e-03]])
# 只取数据集中有评分的数据集进行评估
from sklearn.metrics import mean_squared_error
from math import sqrt
prediction_flatten = user_prediction[user_item_matrix.nonzero()]
prediction_flatten
array([0.11154986, 0.07784963, 0.14877094, ..., 0.04236321, 0.01114962,
       0.02448394])
user_item_matrix_flatten = user_item_matrix[user_item_matrix.nonzero()]
user_item_matrix_flatten
array([3., 3., 2., ..., 4., 4., 4.])
error_test = sqrt(mean_squared_error(prediction_flatten,user_item_matrix_flatten)) # 均方根误差计算
error_test
3.390138302832629

标签:...,ratings,df,电影,title,NaN,关联,算法,id
来源: https://www.cnblogs.com/oceaneyes-gzy/p/16462988.html