This project aims to build a recommendation mechanism based on Matrix Factorization. The dataset contains 3 columns (i.e. User_ID, Item_ID, and rating) over 16000 rows (i.e. around 800 items and 600 users). For the result, the rating of each user-item pair in test dataset will be predicted and the top 5 preferred items for the user with ID 18 (156-th record in test dataset) will be provided.
To begin with, I would like to perform some preliminary analyses to get an overview of the pattern and characteristics of data, which would benefit the model development.
The dataset contains below columns:
User ID
Item ID
Rating (0 to 10)
Basic descriptions of the dataset:
Data type: integer
Total number of users: 600
Let’s visualize the data related to user 9 and item 786 respectively to see if there is any noticeable pattern. The following are the histograms of their ratings.
Insight:
It is noticeable that user 9 (user_mode) rated most of the items with 0 and a few items with 1. More interestingly, all ratings of item 786 (item_mode) are 0. To investigate if the other users and items exhibit this extreme pattern, I would like to check the variance of ratings of items and ratings given by users in the training set.
Insight:
Through the histograms, we can observe that
Thus, user mean and item mean should be useful in making predictions.
Apart from the variance, I would like to have a look at the distribution of ratings to see whether users have any tendency to rate.
Insight:
In this plot, we can see that
Based on these, we can conclude that users tend to give lower ratings, especially 0.
The popularity of item and the activeness of user may also be taken into consideration as they provide useful information on the dataset. Usually, a high-quality item may easily attract many users so that the item will receive more ratings and these ratings are often quite high. On the other hand, as an active user has tasted many items before, a new item is difficult to attract the user’s attention, and thus he/she may tend to give a rating falling into a small range (i.e. between rate 3-5) instead of a wider range (i.e. between rate 1 to 10).
To analyze, I plot the average rating of each item and each user against the number of ratings received by each item and ratings given by each user.
Insight:
The item popularity scatter plot shows a pattern that the average rating increases when the number of ratings of an item increases (i.e. when an item becomes more popular). On the other hand, the user activeness scatter plot shows a pattern that when the number of ratings given by the user increases, the average rating seems to be converging to a certain value (i.e. rate 3 - 5).
More importantly, just as what we have observed in visualization 3, there are quite a number of users who give a rate 0 to all the items and plenty of items received all 0 ratings. Thus, special treatment may be taken to due with these patterns during model development.
Visualizations 1 and 4 suggest the following findings which may be useful in developing the model:
Two main models to use in this section are:
As mentioned in Section 1, the item mean and user mean would be useful to explain the data, residual based method is used to incorporate item and user mean in our SVD or SVD++ model.
Remark: In this project, the Package Surprise (i.e. a “scikit-based Python” libraries to develop recommender systems) is used to implement SVD and SVD++ model.
To begin with, let’s consider the SVD algorithm below.
The predicted rating is given as: $$ \widehat{r}_{ui} = \mu + {b}_u + {b}_i + \mathbf{p}_u^T \mathbf{q}_i, $$
where $({b}_u, \mathbf{p}_u); (u=1, \cdots, n)$ and $({b}_i, \mathbf{q}_i); (i=1,\cdots,m); \mathbf{p}_u$ and $\mathbf{q}_i$ are vectors of length ${k}$.
If user ${u}$ is unknown, then the bias ${b}_u$ and the factors ${p}_u$ are assumed to be zero. The same applies for item $i$ with ${b}_i$ and ${q}_i$.
To estimate all the unknown, we minimize the following regularized squared error: $$ \sum_{(u,i) \in \Omega} ( r_{ui} - \mu - {b}_u - {b}_i - \mathbf{p}_u^T \mathbf{q}_i )^2 + \lambda \big( {b}_u^2 + {b}_i^2 + \|\mathbf{p}_u\|_2^2 + \|\mathbf{q}_i\|_2^2 \big). $$
The minimization is performed by a very straightforward stochastic gradient descent: \begin{split}b_u &\leftarrow b_u &+ \gamma (e_{ui} - \lambda b_u)\\ b_i &\leftarrow b_i & + \gamma (e_{ui} - \lambda b_i)\\ p_u &\leftarrow p_u & + \gamma (e_{ui} \cdot q_i - \lambda p_u)\\ q_i &\leftarrow q_i & + \gamma (e_{ui} \cdot p_u - \lambda q_i)\end{split}
where $e_{ui} = r_{ui} - \hat{r}_{ui}$.
The hyperparameters of these models includes:
the length of $\mathbf{p}_u$ and $\mathbf{q}_i$, ${k}$ ($n\_factors$)
the learning rate $\gamma$ for ${b}_u$ ($lr\_bu$), for ${b}_i$ ($lr\_bi$), for ${p}_u$ ($lr\_pu$), and for ${q}_i$ ($lr\_qi$)
the regularization term for $\lambda$ ${b}_u$ ($reg\_bu$), for ${b}_i$ ($reg\_bi$), for ${p}_u$ ($reg\_pu$), and for ${q}_i$ ($reg\_qi$)
the number of iteration of the SGD procedure ($n\_epochs$)
Hyperparameters tuning -- Grid Search
To find out the optimal hyperparameters for the SVD model, I use the GridSearchCV()
in the Surprise Package to find out the best parameters based on the cross-validation method.
The parameter tuning begins with a wide and sparse grid and the grid will narrow down to the local region of parameters. If a parameter stops at the boundary of the grid after an iteration, I will enlarge the grid of that parameter to explore more possibilities. The tuning will end when the best set of parameters (i.e. generate an acceptable low 'valid rmse' value) is found.
First, the parameter tuning begins with the following grid.
param_grid = {
"n_factors": [3, 5, 10, 40, 70, 100],
"n_epochs": [200],
"lr_all": [0.0001, 0.001, 0.005, 0.01, 0.1],
"reg_all": [0.0001, 0.001, 0.1, 0.5, 1, 2],
"reg_bu": [0],
"reg_bi": [0],
}
After plenty of tuning, I finalize the parameter set as follows.
Final parameter set, valid rmse, and score in Leaderboard
param_grid = {
"n_factors": [3],
"n_epochs": [200],
"lr_all": [0.005],
"reg_all": [0.1],
"reg_bu": [0],
"reg_bi": [0],
}
The valid rmse for this attempt is 0.6839 and the score in Public Leaderboard is 0.66779.
Issue of GridSearchCV
As the SVD model involves too many hyperparameters, it would be difficult and would take too much time to tune the hyperparameters to optimal with GridSearch manually.
Improvement
Because of this issue, an automated method becomes necessary in these optimization steps. Thus, I would like to improve the hyperparameter tuning step by using a Python library called auto-surprise
to find an optimal parameter set.
Hyperparameters tuning -- Auto Surprise
Auto-Surprise is a python library that automates algorithm selection and hyperparameter optimization in a highly parallelized manner. It is built on the popular library Surprise for recommender algorithms and Hyperopt for hyperparameter tuning.
Currently, three algorithms are implemented in hyperopt:
I have tried all three algorithms to do the automated hyperparameter optimization. The valid rmse values are as follows.
Tree of Parzen Estimators (TPE) | Adaptive PTE | Random Search | |
---|---|---|---|
Max_evals = 100 | 0.7033 | 0.6740 | 0.7296 |
Remark: The maximum number of evaluations each algorithm gets for hyper parameter optimization max_evals
is seted as 100.
Final parameter set, valid rmse, and score in Leaderboard
From the above table, only the parameter set generated by APTE algorithm gives a smaller valid rmse than that of Model 1.1, I use this parameter set as the final set for Model 1.2.
best_params = {'lr_bi': 0.03566094677967572,
'lr_bu': 0.028341147137220187,
'lr_pu': 0.049421104542630157,
'lr_qi': 0.024081217368581088,
'n_epochs': 106,
'n_factors': 1,
'reg_bi': 0.009582782844206644,
'reg_bu': 0.00011494409326701549,
'reg_pu': 0.02323855631873504,
'reg_qi': 0.047742294740985826
}
The valid rmse for this attempt is 0.6740 and the score in Public Leaderboard is 0.67215.
Issue of this model
The score on Leaderboard is lower than the result of Model 1.1. I guess the reason is that max_evals = 100
setting is not large enough to search the optimal parameter set.
Improvement
Because of the above issue, I would like to enlarge the max_evals
in the following model. Also, from Model 1.2, the performance of Random Search (i.e. rmse = 0.7296) is much worse than that of TPE (i.e. rmse = 0.7033 ) and APTE (i.e. rmse = 0.6740). Therefore, I decided to only use TPE and APTE to automate hyperparameter optimization in the next model.
Hyperparameters tuning -- Auto Surprise (Update the setting of max_evals
)
I would like to first enlarge the max_evals
from 100
to 500
and see whether valid rmse will decrease.
The valid rmse values for max_evals = 500
are as follows.
Tree of Parzen Estimators (TPE) | Adaptive PTE | |
---|---|---|
Max_evals = 100 | 0.7033 | 0.6740 |
Max_evals = 500 | 0.6639 | 0.6629 |
In this table, we can see that the valid rmse values decrease when the maximum evaluation increases. To be precise, rmse of TPE and APTE algorithms decreased to 0.6639 and 0.6629 respectively, which is a good improvement compared to the result of Model 1.1 and Model 1.2.
Final parameter set, valid rmse, and score in Leaderboard
The following are the final parameter sets for TPE and APTE algorithms.
# Parameter set for TPE
best_params = {'lr_bi': 0.024300743378841992,
'lr_bu': 0.0011815948581443635,
'lr_pu':0.012482710076706212,
'lr_qi': 0.00823562527513479,
'n_epochs': 143,
'n_factors': 1,
'reg_bi':0.0010038894084862182,
'reg_bu': 0.0003472980363023683,
'reg_pu': 0.0004453249677069114,
'reg_qi':0.0003207232594835713
}
# Parameter set for ATPE
best_params = {'lr_bi': 0.019090101711133622,
'lr_bu': 0.015683258397985133,
'lr_pu': 0.025778789373718986,
'lr_qi': 0.002495939257442814,
'n_epochs': 198,
'n_factors': 1,
'reg_bi': 0.00022237810533239877,
'reg_bu': 0.0003240391997232606,
'reg_pu': 0.005122792017321656,
'reg_qi': 0.09904016788429931
}
The table below shows the valid terms and Leaderboard score for TPE and APTE.
Tree of Parzen Estimators (TPE) | Adaptive PTE | |
---|---|---|
Valid rmse | 0.6639 | 0.6629 |
Leaderboard score | 0.65762 | 0.65717 |
Apart from a decrease in the rmse value, we can also observe that there is a big improvement in terms of the leaderboard score, compared to 0.66779 in Model 1.1 and 0.67215 in Model 1.2.
\ Improvement
The decreasing trend in rmse implies that a larger max_evals
setting can improve the performance of automated hyperparameter optimization to search for a better parameter set. Therefore, I would like to further enlarge the max_evals
setting in the next model.
On the other hand, I observed that APTE always generates a parameter set with lower rmse than TPE. To lower the searching time, I decided to use APTE only to do the automated hyperparameter optimization in the next model.
Hyperparameters tuning -- Auto Surprise (Further update the setting of max_evals
)
I would like to enlarge the max_evals
from 500
to 700
, 1000
, 1300
, and 1500
and see whether valid rmse will decrease further.
The valid rmse values for different max_evals
settings are as follows.
Adaptive PTE | |
---|---|
Max_evals = 100 | 0.6740 |
Max_evals = 500 | 0.6629 |
Max_evals = 700 | 0.6618 |
Max_evals = 1000 | 0.6583 |
Max_evals = 1300 | 0.6562 |
Max_evals = 1500 | 0.6569 |
In this table, we can see that the valid rmse values decrease sharply when the maximum evaluation further increases and rmse reaches the optimal, 0.6562, when max_evals = 1300
.
Final parameter set, valid rmse, and score in Leaderboard
As mentioned above, valid rmse is the smallest when max_evals = 1300
. I would like to use the corresponding parameter set as the final set, as shown below.
best_params = {'lr_bi': 0.023726067831278404,
'lr_bu': 0.00019672517575026568,
'lr_pu': 0.004848241381931481,
'lr_qi': 0.003689737817264781,
'n_epochs': 162,
'n_factors': 1,
'reg_bi': 0.00033922470306219375,
'reg_bu': 0.0789915548366538,
'reg_pu': 0.008997208546887707,
'reg_qi': 0.004304762657110084
}
The valid rmse is 0.6562 and the score in Public Leaderboard is 0.64861.
The table below compares the valid rmse and Leaderboard score for Model 1.1 to Model 1.4.
Valid rmse | Leaderboard score | |
---|---|---|
Model 1.1 (GridSearch) | 0.6839 | 0.66779 |
Model 1.2 (Auto-Surprise, APTE, max_evals = 100) | 0.6740 | 0.67215 |
Model 1.3 (Auto-Surprise, APTE, max_evals = 500) | 0.6629 | 0.65717 |
Model 1.4 (Auto-Surprise, APTE, max_evals = 1000) | 0.6562 | 0.64861 |
From the table, we can see that Model 1.4 has a very big improvement in terms of performance in the Leaderboard.
Issue of this model
Although we have found out a high-quality parameter set in this model, Model 1.4 is not the best model to explain the dataset.
Recall part of the summary of EDA:
However, Model 1.4 doesn't takes into account the item popularity and doesn't incorporate item and user mean.
Improvement
Therefore, I would like to use another model which takes into account the item popularity. That is SVD++ Model.
Consider a more complex model, SVD++, which takes into account the item popularity, The predicted rating is given as:
$$ \widehat{r}_{ui} = \mu + b_u + b_i + \mathbf{q}^T_i \big(\mathbf{p}_u+\frac{1}{\sqrt{|\mathcal{I}_u|}}\sum_{j\in\mathcal{I}_u}\mathbf{w}_j\big), $$where $(b_u, \mathbf{p}_u); (u=1, \cdots, n)$ and $(b_i, \mathbf{q}_i); (i=1,\cdots,m); \mathbf{p}_u$ and $\mathbf{q}_i$ are vectors of length ${k}$; $\mathbf{w}_j$ terms are a new set of item factors that capture implicit ratings. (An implicit rating describes the fact that a user ${u}$ rated an item ${j}$, regardless of the rating value.)
Just as SVD model, if user ${u}$ is unknown, then the bias ${b}_u$ and the factors ${p}_u$ are assumed to be zero. The same applies for item $i$ with ${b}_i$ and ${q}_i$.
To estimate all the unknown, we minimize the following regularized squared error: $$ \sum_{(u,i) \in \Omega} \left( r_{ui} - \mu - b_u - b_i - \mathbf{q}^T_i(\mathbf{p}_u+\frac{1}{\sqrt{|\mathcal{I}_u|}}\sum_{j\in\mathcal{I}_u}\mathbf{w}_j) \right)^2 + \lambda \big( {b}_u^2 + {b}_i^2 + \|\mathbf{p}_u\|_2^2 + \|\mathbf{q}_i\|_2^2 + \sum_{j \in I_u} \|\mathbf{w}_j\|^2\big). \qquad $$
The minimization is performed by a very straightforward stochastic gradient descent: \begin{split}b_u &\leftarrow b_u &+ \gamma (e_{ui} - \lambda b_u)\\ b_i &\leftarrow b_i & + \gamma (e_{ui} - \lambda b_i)\\ p_u &\leftarrow p_u & + \gamma (e_{ui} \cdot q_i - \lambda p_u)\\ q_i &\leftarrow q_i & + \gamma (e_{ui} \cdot (p_u +\frac{1}{\sqrt{|\mathcal{I}_u|}}\sum_{j\in\mathcal{I}_u}w_j) - \lambda q_i)\\ w_j &\leftarrow w_j & + \gamma (e_{ui} \cdot \frac{1}{\sqrt{|\mathcal{I}_u|}}\cdot q_i - \lambda q_i) \end{split}
where $e_{ui} = r_{ui} - \hat{r}_{ui}$.
The hyperparameters of these models includes:
the length of $\mathbf{p}_u$ and $\mathbf{q}_i$, ${k}$ ($n\_factors$)
the learning rate $\gamma$ for ${b}_u$ ($lr\_bu$), for ${b}_i$ ($lr\_bi$), for ${p}_u$ ($lr\_pu$), for ${q}_i$ ($lr\_qi$), and for ${w}_j$ ($lr\_yj$)
the regularization term $\lambda$ for ${b}_u$ ($reg\_bu$), for ${b}_i$ ($reg\_bi$), for ${p}_u$ ($reg\_pu$), for ${q}_i$ ($reg\_qi$) and for ${w}_j$ ($reg\_yj$)
the number of iteration of the SGD procedure ($n\_epochs$)
Hyperparameters tuning -- Auto Surprise
Just as in Model 1.1 to Model 1.4, I would like to use the Auto Surprise library to search optimal hyperparameters automatically. As mentioned before, only the APTE algorithm is applied.
I first set the max_vals
equals to 100
and 200
respectively and got the following valid rmse results.
Adaptive PTE | |
---|---|
Max_evals = 100 | 0.6741 |
Max_evals = 200 | 0.6650 |
Final parameter set, valid rmse, and score in Leaderboard
From the above table, we get an acceptable value rmse, 0.6650, when we only set max_evals = 200
. However, to run 200 evaluations for hyperparameter optimization, around 3 hours of running time is needed for a SVD++ Model. Therefore, I decided to stop at 200 evaluations and use the corresponding parameter set as the final set for Model 2.1.
best_params = {'lr_bi': 0.04904279833282683,
'lr_bu': 0.003052099513601416,
'lr_pu': 0.0008506282110052645,
'lr_qi': 0.07069184202596947,
'lr_yj': None,
'n_epochs': 134,
'n_factors': 1,
'reg_bi': 0.002813741909570969,
'reg_bu': 0.01234489749886016,
'reg_pu': 0.011836209632271517,
'reg_qi': 0.0035975818910033456,
'reg_yj': None
}
The valid rmse for this attempt is 0.6650 and the score in Public Leaderboard is 0.65777.
Through this result, we can observe that SVD++ Model performs quite well in predicting ratings. For example, we can get a quite good-quality parameter set when we only increase the max_evals
to 200 and we get a score 0.65777, which is similar to the score for Model 1.3 with max_evals=500
(i.e. 0.65762 and 0.65717).
Issue of this model
The big issue of the SVD++ Model is more related to the running time. Just like what I have mentioned, SVD++ Model takes too much time to do the automated hyperparameter optimization. If I increase the max_evals
to 1000, I believe that it will take more than 12 hours to search the parameter set. Therefore, I will not use the SVD++ Model as the foundation of my final model. Instead, I will use the SVD Model.
The EDA in Section 1 implies that user mean and item mean are useful to explain some information given by the train dataset. Therefore, in the final model, I would like to use the residual-based approach to incorporate item mean and user mean.
Overview of the final model:
Apply the SVD Model with the parameter set in Model 1.4 on train data and get the train data residuals.
Apply the item mean on train data residuals, and get the residuals again
Apply the user mean on the residuals generated in step 2
The final predicted rating would be the sum of the following 3 segments:
Score in Leaderboard
After incorporating item mean and user mean, the score improved from 0.64861 (Model 1.4) to 0.64718, which is a big improvement.
Further improvement of the final model
First, through the EDA, I found that some users gave a 0 rating to all items they rated and some items received a 0 rating from all rated users. Thus, I would like to apply a special treatment for these user_id and item_id as follows:
Second, I observed that after applying the SVD model (Model 1.4) together with item mean and user mean based on the residual-based method, some predicted ratings become a very small negative number. As our minimum rating is 0, I would like to fine-tune those negative ratings to 0.
Score in Leaderboard after improvement
After making the improvements stated above, the score in Leaderboard further improved to 0.64263.
Confirmation of parameter set
In Model 1.4, I chose the corresponding parameter set generated under max_evals = 1300
as it has the lowest valid rmse. However, the valid rmse values are also quite low when we set max_evals=1000
or1500
. Thus, I would also like to check the leaderboard score when we use the parameter sets generated by these two settings in the final model.
Valid rmse | Leaderboard score | |
---|---|---|
Final Model (max_evals = 1000) | 0.6583 | 0.64100 |
Final Model (max_evals = 1300) | 0.6562 | 0.64263 |
Final Model (max_evals = 1500) | 0.6569 | 0.64232 |
Interestingly, parameter sets generated under max_evals=1000
and 1500
have a better performance in terms of the leaderboard score. I think the parameter set use in Model 1.4 may a bit overfit the train data compared to the other two. Therefore, I prefered to use the parameter set generated under max_evals=1000
as final parameter set in final model.
Conclusion of the Final Model
The final parameter set is generated by APTE algorithm with max_evals = 1000
, which is shown below:
# max_evals = 1000
best_params = {'lr_bi': 0.058334702767909,
'lr_bu': 0.00015136679318740118,
'lr_pu': 0.026987499312161187,
'lr_qi': 0.0037913998145982074,
'n_epochs': 90,
'n_factors': 1,
'reg_bi': 0.00047465760297261343,
'reg_bu': 0.01994131998600225,
'reg_pu': 0.0004814169070326491,
'reg_qi': 0.015573943840555361
}
My SID is 1155142156. The 156-th record is printed.
The top-5 preferred items are 629, 665, 551, 283, and 440.
Codes Section 1 Exploratory data analysis (EDA)
# Load and pro-processed dataset
import numpy as np
import pandas as pd
## Read in dataset
dtrain = pd.read_csv('train.csv')
dtest = pd.read_csv('test.csv')
train_rating = dtrain['rating'].values
train_pair = dtrain[['user_id', 'item_id']].values
test_pair = dtest[['user_id', 'item_id']].values
## store total number of users and items
n_user = max( max(train_pair[:,0]), max(test_pair[:,0]) ) + 1
n_item = max( max(train_pair[:,1]), max(test_pair[:,1]) ) + 1
## store user/item index set
index_item = [np.where(train_pair[:,1] == i)[0] for i in range(n_item)]
index_user = [np.where(train_pair[:,0] == u)[0] for u in range(n_user)]
## Rating's upper bound and lower bound
rating_upper_bound=max(train_rating)
rating_lower_bound=min(train_rating)
# Print basic description of the dataset
# Data type
print('Data type for user_id and item_id:', train_pair.dtype)
print('Data type for rating:', train_rating.dtype)
# Number of users
print('Total number of users:', n_user)
# Number of items
print('Total number of items:', n_item)
from statistics import mode
# Most rated user id
user_mode = mode(train_pair[:,0])
print("Most rated user's id:", user_mode)
# Most popular item id
item_mode = mode(train_pair[:,1])
print("Most popular item's id:",item_mode)
# Histogram of ratings of user_mode
import matplotlib.pyplot as plt
plt.hist(train_rating[index_user[user_mode]], edgecolor='white', align='left',
bins=np.arange(start=rating_lower_bound, stop=rating_upper_bound+2, step=1))
plt.xlabel("Rating")
plt.ylabel("Frequency")
plt.title('Histogram of Ratings given by user_mode (user_id 9)')
plt.savefig('EDA_plot_1.png')
# Histogram of ratings of item_mode
plt.hist(train_rating[index_item[item_mode]], edgecolor='white', align='left',
bins=np.arange(start=rating_lower_bound, stop=rating_upper_bound+2, step=1))
plt.xlabel("Rating")
plt.ylabel("Frequency")
plt.title('Histogram of Ratings given by item_mode (item_id 786)')
plt.savefig('EDA_plot_2.png')
# Check the variance of ratings given by each user in record
var_user = [np.var(train_rating[index_user[u]]) for u in set(train_pair[:,0])]
plt.hist(var_user, edgecolor='white')
plt.xlabel("Variance")
plt.ylabel("Count")
plt.title('Variance of ratings given by users in train data')
plt.savefig('EDA_plot_3.png')
# Check the variance of ratings of each item in record
var_item = [np.var(train_rating[index_item[i]]) for i in set(train_pair[:,1])]
plt.hist(var_item, edgecolor='white')
plt.xlabel("Variance")
plt.ylabel("Count")
plt.title('Variance of ratings of items in train data')
plt.savefig('EDA_plot_4.png')
# Distribution of ratings
plt.hist(train_rating,edgecolor='white', align='left',
bins=np.arange(start=rating_lower_bound, stop=rating_upper_bound+2, step=1))
plt.xlabel("Rating")
plt.ylabel("Frequency")
plt.title('Distribution of Ratings')
plt.savefig('EDA_plot_5.png')
# Item popularity
# Item mean against number of ratings received by item
item_set = list(set(train_pair[:,1]))
n_item_train = len(item_set)
aver_ratings = np.ones(n_item_train)
n_ratings = np.ones(n_item_train)
for i in range(n_item_train):
item_index_temp = index_item[item_set[i]]
aver_ratings[i]=np.mean(train_rating[item_index_temp])
n_ratings[i]=len(item_index_temp)
plt.scatter(n_ratings, aver_ratings)
plt.xlabel("Number of ratings")
plt.ylabel("Average rating")
plt.title('Average rating against item popularity')
plt.savefig('EDA_plot_6.png')
# User activeness
# User mean against number of ratings given by user
user_set = list(set(train_pair[:,0]))
n_user_train = len(user_set)
aver_ratings = np.ones(n_item_train)
n_ratings = np.ones(n_item_train)
for u in range(n_user_train):
user_index_temp = index_user[user_set[u]]
aver_ratings[u] = np.mean(train_rating[user_index_temp])
n_ratings[u] = len(user_index_temp)
plt.scatter(n_ratings, aver_ratings)
plt.xlabel("Number of ratings")
plt.ylabel("Average rating")
plt.title('Average rating against user activeness')
plt.savefig('EDA_plot_7.png')
Codes for Model 1
Codes for ALS in lecture - generate a set of parameters (for reference only) to get a sense of the parameter grid for Model 1.1
class user_item_MF(object):
def __init__(self, n_user, n_item, lam=.001, K=10, iterNum=50, tol=1e-4, verbose=1, user_mean=True, item_mean=True):
self.P = np.random.randn(n_user, K)
self.Q = np.random.randn(n_item, K)
self.n_user = n_user
self.n_item = n_item
self.user_mean=user_mean
self.item_mean=item_mean
if(user_mean):
self.mu=np.random.randn(n_user)
else:
self.mu=np.zeros(n_user)
if(item_mean):
self.rho=np.random.randn(n_item)
else:
self.rho=np.zeros(n_item)
self.lam = lam
self.K = K
self.iterNum = iterNum
self.tol = tol
self.verbose = verbose
def fit(self,train_pair,train_rating):
diff, tol=0, self.tol
iterNum=self.iterNum
n_item=self.n_item
n_user=self.n_user
K=self.K
lam=self.lam
index_item = [np.where(train_pair[:,1] == i)[0] for i in range(n_item)]
index_user = [np.where(train_pair[:,0] == u)[0] for u in range(n_user)]
if(self.verbose==1):
print('Fitting User-Item-MF: K: %d, lam: %.5f' %(K, lam))
for iter in range(iterNum):
obj_old=self.obj(test_pair=train_pair,test_rating=train_rating)
#udpate mu
if(self.user_mean):
for u in range(n_user):
ind_user_temp=index_user[u]
if(len(ind_user_temp)==0):
self.mu[u]=0
continue
temp= train_rating[ind_user_temp] - self.rho[train_pair[ind_user_temp,1]] - np.dot(self.Q[train_pair[ind_user_temp,1],:],self.P[u])
self.mu[u]=np.mean(temp)
#update rho
if(self.item_mean):
for i in range(n_item):
ind_item_temp=index_item[i]
if(len(ind_item_temp)==0):
self.rho[i]=0
continue
temp= train_rating[ind_item_temp] - self.mu[train_pair[ind_item_temp,0]] - np.dot(self.P[train_pair[ind_item_temp,0],:],self.Q[i])
self.rho[i]=np.mean(temp)
#update P
for u in range(n_user):
ind_user_temp=index_user[u]
sum_matrix_qi=np.zeros((K,K))
sum_qi=np.zeros(K)
if(len(ind_user_temp)==0):
self.P[u,:]=np.zeros(K)
continue
for record in ind_user_temp:
item_temp=train_pair[record,1]
rating_temp=train_rating[record]
sum_matrix_qi=sum_matrix_qi+np.outer(self.Q[item_temp,:],self.Q[item_temp,:])
sum_qi=sum_qi+(rating_temp-self.mu[u]-self.rho[item_temp])*self.Q[item_temp,:]
self.P[u,:]=np.linalg.inv( sum_matrix_qi + lam*len(train_rating)*np.identity(K)) @ sum_qi
#update Q
for i in range(n_item):
ind_item_temp=index_item[i]
sum_matrix_pu=np.zeros((K,K))
sum_pu=np.zeros(K)
if(len(ind_item_temp)==0):
self.Q[i,:]=np.zeros(K)
continue
for record in ind_item_temp:
user_temp=train_pair[record,0]
rating_temp=train_rating[record]
sum_matrix_pu=sum_matrix_pu+np.outer(self.P[user_temp,:],self.P[user_temp,:])
sum_pu=sum_pu+(rating_temp-self.mu[user_temp]-self.rho[i])*self.P[user_temp,:]
self.Q[i,:]=np.linalg.inv( sum_matrix_pu + lam*len(train_rating)*np.identity(K)) @ sum_pu
obj_new=self.obj(test_pair=train_pair,test_rating=train_rating)
diff=abs(obj_old-obj_new)
if self.verbose:
print("User-Item-MF: ite: %d; diff: %.3f Obj: %.3f" %(iter, diff, obj_new))
if(diff < tol):
break
def obj(self, test_pair, test_rating):
return self.rmse(test_pair=test_pair, test_rating=test_rating)**2 + self.lam*( np.sum((self.P)**2) + np.sum((self.Q)**2))
def rmse(self, test_pair, test_rating):
pred=self.predict(test_pair=test_pair)
return(np.sqrt(np.mean((test_rating-pred)**2)))
def predict(self,test_pair):
pred_rating=[self.mu[record[0]]+self.rho[record[1]]+np.dot(self.P[record[0],:],self.Q[record[1],:]) for record in test_pair]
return(np.array(pred_rating))
from sklearn.model_selection import KFold
import itertools
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
class MF_CV(object):
def __init__(self, n_user, n_item, cv=5,
lams=[.000001,.0001,.001,.01], # lams and Ks are vectors (our grid)
Ks=[3,5,10,20],
iterNum=10, tol=1e-4):
# self.index_item = []
# self.index_user = []
self.n_user = n_user
self.n_item = n_item
self.cv = cv
self.lams = lams
self.Ks = Ks
self.iterNum = iterNum
self.tol = tol
self.best_model = {}
self.cv_result = {'K': [], 'lam': [], 'train_rmse': [], 'valid_rmse': []} # our result
def grid_search(self, train_pair, train_rating):
## generate all comb of `K` and `lam`
kf = KFold(n_splits=self.cv, shuffle=True)
for (K,lam) in itertools.product(self.Ks, self.lams):
train_rmse_tmp, valid_rmse_tmp = 0., 0.
for train_index, valid_index in kf.split(train_pair):
# produce training/validation sets
train_pair_cv, train_rating_cv = train_pair[train_index], train_rating[train_index]
valid_pair_cv, valid_rating_cv = train_pair[valid_index], train_rating[valid_index]
# fit the model based on CV data
model_tmp = user_item_MF(self.n_user, self.n_item, K=K, lam=lam, verbose=0)
model_tmp.fit(train_pair=train_pair_cv, train_rating=train_rating_cv)
## valid model in the validation set
train_rmse_tmp_cv = model_tmp.rmse(train_pair_cv, train_rating_cv)
valid_rmse_tmp_cv = model_tmp.rmse(valid_pair_cv, valid_rating_cv)
train_rmse_tmp += train_rmse_tmp_cv/self.cv
valid_rmse_tmp += valid_rmse_tmp_cv/self.cv
print('%d-Fold CV for K: %d; lam: %.5f: train_rmse: %.3f, valid_rmse: %.3f'
%(self.cv, K, lam, train_rmse_tmp_cv, valid_rmse_tmp_cv))
self.cv_result['K'].append(K)
self.cv_result['lam'].append(lam)
self.cv_result['train_rmse'].append(train_rmse_tmp)
self.cv_result['valid_rmse'].append(valid_rmse_tmp)
self.cv_result = pd.DataFrame.from_dict(self.cv_result)
best_ind = self.cv_result['valid_rmse'].argmin()
self.best_model = self.cv_result.loc[best_ind]
def plot_grid(self, data_source='valid'):
sns.set_theme()
if data_source == 'train':
cv_pivot = self.cv_result.pivot("K", "lam", "train_rmse")
elif data_source == 'valid':
cv_pivot = self.cv_result.pivot("K", "lam", "valid_rmse")
else:
raise ValueError('data_source must be train or valid!')
sns.heatmap(cv_pivot, annot=True, fmt=".3f", linewidths=.5, cmap="YlGnBu")
plt.show()
Ks, lams = [3, 5, 10], 10**np.arange(-5, -1, .5)
MF_cv = MF_CV(n_user, n_item, cv=5, Ks=Ks, lams=lams)
MF_cv.grid_search(train_pair, train_rating)
MF_cv.plot_grid()
Insight: both K and λ are small.
# Import libraries
from surprise.model_selection import GridSearchCV
from surprise.dataset import DatasetAutoFolds
from surprise import Dataset
from surprise import Reader
from surprise import SVD
reader = Reader(rating_scale=(rating_lower_bound, rating_upper_bound))
data = Dataset.load_from_df(dtrain[['user_id', 'item_id', 'rating']], reader)
trainset = DatasetAutoFolds.build_full_trainset(data)
# GridSearch Begins with the following parameter gird
param_grid = {"n_factors": [3, 5, 10, 40, 70, 100],
"biased": ['True'],
"n_epochs": [200],
"lr_all": [0.0001, 0.001, 0.005, 0.01, 0.1],
"reg_all": [0.0001, 0.001, 0.1, 0.5, 1, 2],
"reg_bu": [0],
"reg_bi": [0]
}
gs = GridSearchCV(SVD, param_grid, cv=5,n_jobs =-1, joblib_verbose =2)
gs.fit(data)
print(gs.best_score["rmse"])
print(gs.best_params["rmse"])
param_grid = {"n_factors": [3], 'biased': ['True'],"n_epochs": [200],
"lr_all": [0.003, 0.005, 0.007],
"reg_all": [0.01, 0.1, 0.3],
"reg_bu": [0], "reg_bi": [0]
}
gs = GridSearchCV(SVD, param_grid, cv=5,n_jobs =-1, joblib_verbose =2)
gs.fit(data)
print(gs.best_score["rmse"])
print(gs.best_params["rmse"])
param_grid = {"n_factors": [3], 'biased': ['True'],"n_epochs": [200],
"lr_all": [0.004, 0.005, 0.006],
"reg_all": [0.08, 0.1, 0.15],
"reg_bu": [0], "reg_bi": [0]
}
gs = GridSearchCV(SVD, param_grid, cv=5,n_jobs =-1, joblib_verbose =2)
gs.fit(data)
print(gs.best_score["rmse"])
print(gs.best_params["rmse"])
param_grid = {"n_factors": [3], 'biased': ['True'],"n_epochs": [200],
"lr_all": [0.005], "reg_all": [0.14, 0.15,0.16],
"reg_bu": [0], "reg_bi": [0]
}
gs = GridSearchCV(SVD, param_grid, cv=5,n_jobs =-1, joblib_verbose =2)
gs.fit(data)
print(gs.best_score["rmse"])
print(gs.best_params["rmse"])
param_grid = {"n_factors": [3], 'biased': ['True'],
"n_epochs": [200,600,900,1200,1400,2000],
"lr_all": [0.005], "reg_all": [0.1,0.14, 0.15],
"reg_bu": [0], "reg_bi": [0]
}
gs = GridSearchCV(SVD, param_grid, cv=5,n_jobs =-1, joblib_verbose =2)
gs.fit(data)
print(gs.best_score["rmse"])
print(gs.best_params["rmse"])
param_grid = {"n_factors": [3,10,50], 'biased': ['True'],
"n_epochs": [200,500,1300,2000],
"lr_all": [0.005], "reg_all": [0.1],
"reg_bu": [0], "reg_bi": [0]
}
gs = GridSearchCV(SVD, param_grid, cv=5,n_jobs =-1, joblib_verbose =2)
gs.fit(data)
print(gs.best_score["rmse"])
print(gs.best_params["rmse"])
param_grid = {"n_factors": [10], 'biased': ['True'],
"n_epochs": [200,500,1300,2000], "lr_all": [0.005],
"reg_all": [0.1], "reg_bu": [0], "reg_bi": [0]
}
gs = GridSearchCV(SVD, param_grid, cv=5,n_jobs =-1, joblib_verbose =2)
gs.fit(data)
print(gs.best_score["rmse"])
print(gs.best_params["rmse"])
# Apply the finalize parameter set generated by GridSearch
svd = SVD(n_factors=3, biased=True, n_epochs=200, lr_all=0.005, reg_all=0.1, reg_bu=0, reg_bi=0)
svd.fit(trainset)
# Predict the ratings
predicted_rating = np.zeros(len(test_pair[:,1]))
for i in range(len(test_pair[:,1])):
predicted_rating[i] = svd.predict(test_pair[i,0],test_pair[i,1]).est
# Output the results
sub = pd.read_csv('sample_submission.csv')
sub['rating'] = predicted_rating
sub.to_csv('sub3.csv', index=False)
Codes for Model 1.2
# Import library
from auto_surprise.engine import Engine
import hyperopt
# Intitialize auto surprise engine
engine = Engine(verbose=True, algorithms=['svd'])
# Start the trainer
# Set maximum evaluations be 100
# TPE algorithm is applied
best_algo, best_params, best_score, tasks = engine.train(
data=data,
target_metric='test_rmse',
cpu_time_limit=60*60*2,
max_evals=100,
hpo_algo=hyperopt.tpe.suggest
)