Before actually implementing any models on this project, it is best for us to get a better know on the dataset, namely `train.csv`

and `test.csv`

, that we are going to be dealing with. Let's start off with some basic information:

- Three fields: user_id, item_id, rating
- Data types: All integer
- User range: 0 - 599
- Total number of unique users: 593
- Item range: 0 - 798
- Total number of unique items: 785
- Rating range: 0 - 10
- Number of rows with missing data: 0

We might also be interested in the distribution of the dataset. This helps us better understand which model we might actually want to apply. Nothing gives us more information than some graphs. To begin with, let's take a look at the user and item distribution in the 2 sets, along with the rating's distribution:

Both the train set and test set appears to be of a quite similar distribution, which is definitely a good sign to see as it implies that we should be able to create a nice prediction model learning from the train set alone. We can also see that the distributions of both `user_id`

and `item_id`

are somewhat uniform, with the only exception of `user_id`

's frequency falling off by the end as it approaches 599. Another interesting thing to note is that the `rating`

distribution is not uniform or normal at all, instead it is as if it were a triangular distribution centered at around 3 with a much heavy tail around 0.

Let's have a deeper dive into distributions by looking at some violin plots the frequencies of users and items, i.e. the number of times they have rated or being rated:

The plots reveals that there are a lot of user and items with few ratings, forming such an skewed looking distribution. This might also be a hint towards whether popularity of an item or the activeness of a user has anything to do with the prediction.

Here are a few more important parts in the whole data set when dealing with frequencies:

- Most rated user: user_id 9, with a count of 152
- Most rated item: item_id 786, with a count of 105
- Most popular item: 679, with an average rating of 7.729730

Some scatters plots and histogram might reveal more information:

This is an interesting discovery! Just like the plots in the lecture, the number of ratings seem to have effect on the item's average rating. The frequency of user's number of rating also looks similar, suggesting a potential integration of "user activeness"/"item popularity" in our model. These has to be a hint to prefer some matrix factorization with neighborhood effects, namely SVD and SVD++.

Let's look at some more frequency distributions:

There are two fascinating discoveries from the two scatter plots above:

- The lower the rating, the denser the points are (especially significant between 'item_id' and 'rating'), suggesting a bias towards lower ratings overall.
- There seems to be a "border" between ratings > 1, and either 'user_id' < around 80 or 'item_id' > 750, suggested by a clear empty rectangular areas in both plots. It appears to be manually configured to behave so.

Now comes the main part of the project - creating a model to generate rating predictions. Let's start with the most basic ones:

To server as a benchmark, we are going to use the following models demonstrated in the lecture. This includes the following models:

- Global Average Model,
- User Average Model,
- Item Average Model,
- User Average + Item Residual Model,
- Item Average + User Residual Model,
- Matrix Factorization Model

These models serve as a baseline for the further models to be evaluated against so as to get a feel of how good our they do. It is believed that we are already very familiar with the above models that no introductions and descriptions for them are required here.

The training RMSE is as follows, sorted by increasing error:

Model | Train RMSE |
---|---|

MatrixFactorization | 0.994387 |

ItemMeanUserResidual | 1.056764 |

UserMeanItemResidual | 1.147565 |

ItemMean | 1.611223 |

UserMean | 2.256926 |

GlobalAverage | 2.494066 |

The above shows that there are great potential for matrix factorization algorithms in our current settings. Do note that the RMSE taken here on the matrix factorization model is not tuned, unlike that of the rest requiring no room for adjustments.

During my homework attempts, I have come across prediction ratings way out of bounds of the origin rating range. This has led me to develop a "clamping" add-on to the predictions generated, by simply applying a min-max function on the prediction data. The new modified prediction results are as follows:

Model | Train RMSE |
---|---|

ClampedItemMeanUserResidual | 0.965831 |

ClampedMatrixFactorization | 0.994305 |

ClampedUserMeanItemResidual | 1.080747 |

*Note: The pure mean models, i.e. ItemMean, UserMean, and GloablAverage models, are not shown here as they require no clamping. It would not make sense to have an average rating out of range of the original bound. Only the above model could achieve such behaviors.*

There are significant improvements to the 3 models, yet the train RMSE is still very high, compared to the current top public score of 0.64100. Perhaps a more complex model would yield better results.

As suggested the section above, matrix factorization methods seem to gain a bit of favour. Our initial EDA also gave insight to a popularity/activeness-related feature, thus leading us to focus on the following two models:

- SVD
- SVD++

SVD is a matrix factorization algorithm with the objective to predict rating $\hat{r}_{ui}$:

$\hat{r}\_{ui} = \mu + b_u + b_i + q_i^\intercal p_u $

where $\mu$ is the global mean, $b_u$ is the user bias, $b_i$ is the item bias, and $q_i$ and $p_u$ are the $k$-dimensional latent factors for item and user respectively. The following regularized squared error is minimized:

$\sum_{r_{ui} \in R_{train}} \left(r_{ui} - \hat{r}_{ui} \right)^2 + \lambda\left(b_i^2 + b_u^2 + ||q_i||^2 + ||p_u||^2\right)$

Note that except $\mu$, all other variables are updated sequentially with stochastic gradient descent:

$b_u \leftarrow b_u + \gamma (e_{ui} - \lambda b_u)\\ b_i \leftarrow b_i + \gamma (e_{ui} - \lambda b_i)\\ p_u \leftarrow p_u + \gamma (e_{ui} \cdot q_i - \lambda p_u)\\ q_i \leftarrow q_i + \gamma (e_{ui} \cdot p_u - \lambda q_i)$

where $e_{ui} = r_{ui} - \hat{r}_{ui}$. This model essentially is a combination of global mean, user mean, item mean, and matrix factorization models. It is also possible to use an unbiased version where there are no the equation becomes:

$\hat{r}\_{ui} = q_i^\intercal p_u $

making it identical to pure matrix factorization but with stochastic gradient descent.

SVD performed very well on the data set. At the very beginning, a naive brute force approach on a sparse set of parameters was used to find the optimal ones. The number of epochs on stochastic gradient descent is chosen to be fixed at 200, since this number is found to be big enough for the values to converge and the relative time tradeoff is acceptable.

Parameters | Representing Variable | List of values |
---|---|---|

n_factors | $k$ | `np.arange(5, 500, 25)` |

lr_all | $\gamma$ | `10 ** (np.arange(-5, 0, 0.4))` |

reg_all | $\lambda$ | `10 ** (np.arange(-5, 0, 0.4))` |

Fitting the best parameter with a 3-fold cross validation in this preliminary round of search into SVD achieved a decent score. The final parameter and scores are:

Model | `n_factors` |
`lr_all` |
`reg_all` |
Train RMSE | Public Score RMSE | |
---|---|---|---|---|---|---|

Attempt1 | SVD | 80 | 0.03981 | 0.1 | 0.71646 | 0.66628 |

The SVD++ algorithm essentially is an upgraded version of SVD, by taking implicit ratings into account. The prediction $\hat{r}_{ui}$ becomes:

$\hat{r}_{ui} = \mu + b_u + b_i + q_i^T\left(p_u + |I_u|^{-\frac{1}{2}} \sum_{j \in I_u}y_j\right)$

where $y_j$ is a set of item factors designed to model their respective implicit ratings. SVD++ updates itself with stochastic gradient descent also.

Again, we are also making an initial attempt here. A similar setting as above is used here:

Parameters | Representing Variable | List of values |
---|---|---|

n_factors | $k$ | `np.arange(3, 250, 2)` |

n_epochs | N/A | `np.arange(100, 1901, 200)` |

lr_all | $\gamma$ | `10 ** (np.arange(-5, 0, 0.4))` |

reg_all | $\lambda$ | `10 ** (np.arange(-5, 0, 0.4))` |

Fitting the best parameter with a 3-fold cross validation in this first round of search into SVD++ achieved similar scores:

Model | `n_factors` |
`n_epochs` |
`lr_all` |
`reg_all` |
Train RMSE | Public Score RMSE | |
---|---|---|---|---|---|---|---|

Attempt2 | SVD++ | 137 | 1500 | 0.00630 | 0.1 | 0.72342 | 0.66389 |

which is not an improvement when compared to our first attempt with SVD. However, I would argue that the two public scores of these 2 attempts are very close to each other, of which its performance cannot be truly tested with such an initial sparse scan on hyperparameters.

Cross validation is a very common technique in machine learning to combat the issue of overfitting. In this project, using cross validation is a defacto setting. However, choosing the right fold for cross validation also seems to have a huge effect on model development. Throughout the entire course of this project, the number of fold in cross validation seems to have a significant impact on the public score, which is a very new discovery to me.

In my first attempt, a 3-Fold cross validation is used to pick the right hyperparameter. Surprisingly, by manipulating the number of folds, we can get a very different set of hyperparameter values as well as training error during our initial search.

Fold | `n_factor` |
`lr_all` |
`reg_all` |
Train RMSE | Public score RMSE |
---|---|---|---|---|---|

2 | 450 | 0.003162 | 0.0031622 | 0.74594 | 0.65766 |

3 | 80 | 0.039810 | 0.1 | 0.71646 | 0.66628 |

4 | 55 | 0.039810 | 0.1 | 0.70164 | 0.65893 |

5 | 137 | 0.006309 | 0.1 | 0.72342 | 0.70530 |

As we can see from the table above, the difference is actually very significant, especially on the RMSEs. Folds of 2, 3 and 4 seems to work well while 5 appears to start malfunction. This is a rather strange behavior as usually 3 and 5 folds are the go-to folds in a cross validation, yet 5 folds here actually have a very undesirable behavior. The discovery of this phenomenon has to thanks to my obstinacy. I have spent multiple attempts on models trained with 5-fol