Last updated 2014-06-05


Features

Algorithms #

We implemented three constant baselines, two memory-based collaborative filtering algorithms, and several state-of-the-art ones, mainly implemented by Matrix Factorization.
Constant Baselines
Predicting unknown ratings, this group of metrics assigns a constant estimate, regardless of the target user or item, or neither. These metrics are extremely fast to compute, but do not produce good results in general. These can be used for baselines, which should be overcome by any algorithm.

Constant Model


A constant (usually the median of possible ratings) is assigned for every prediction.

Overall Average


Each rating is predicted by average of all available ratings.

User Average


Each rating is predicted by average of the target user's available ratings on every item.

Item Average


Each rating is predicted by average of the target item's available ratings by all users.

Random


Randomly assign any rating between the minimum and the maximum possible rating score.

Memory-based Collaborative Filtering
Memory-based CF algorithms load the entire data on the memory, and make prediction based on similar users or items to the target user or item. Algorithms in this group do not explicitly build a model. Instead, they use the data for each query. Thus, they do not take learning time, but take longer time for prediction phase. Specific algorithm can be varied with choice of similarity measure. We provide Pearson Correlation, Vector Cosine, Mean Square Difference, and Mean Absolute Difference.

User-based Collaborative Filtering

[1,2]
In user-based CF algorithm, a rating for an item is estimated based on other users' ratings on the target item. For this task, a weigth factor based on user similarity is considered. That is, ratings by users who have similar interest to the target user have larger influence on determining estimation. This algorithm is one of popular traditional collaborative filtering methods, as it is relatively fast, but achieves reasonable accuracy in prediction.

Item-based Collaborative Filtering

[3]
Item-based CF algorithm first analyze the rating matrix to identify relationships between items. Then, it uses these relationships to predict ratings for unseen items. Sarwar et al. [3] argues that this algorithm is scalable to larger data as well as achieves better accuracy compared to user-based CF.

Extension: Default Voting

[4]
In real situation, it is common that the rating matrix is extremely sparse. In this case, we have very little number of ratings of common users or for common items. In order to overcome this problem, this extension assigns a default vote on every missing rating, enabling making use of every rating for estimation.

Extension: Inverse User Frequency

[4]
Another extension of memory-based CF is inverse user frequency. In this setting, weights for universally liked items are degraded, as they are less meaningful in capturing similarity.

Slope One

[5]
Slope-one is a relatively recent memory-based algorithm, aimed to easy and simple implementation with reasonable accuracy. It precomputes average difference between the ratings of one item and another for users who rated both, and use the information to predict unseen ratings in the form of f(x) = x + b. It has empiricaly shown that Slope-One achieves comparable accuracy with other slower memory-based CF algorithms.

Latent Semantic Models using Matrix Factorization
Contrast to memory-based CF algorithms, ones in this group explicitly build a model during the learning phase, by recognizing patterns in the training data. After the learning phase, they use the model to predict ratings of given queries. Many recent achievement of precise prediction used this approach. Specifically, these methods factorize the rating matrix into two low-rank matrices: user profile and item profile. They tend to take longer time compared to previous ones, but claimed to achieve better accuracy.

Regularized SVD

[6]
Regularized SVD (Singular Value Decomposition) minimizes squared error between actual ratings and predicted estimations for all available votes. In order to control overfitting issue, it adds regularization terms both for user and item profiles. For minimization process, it uses gradient descent. With a proper choice of parameters, this algorithm is known to achieve good accuracy.

Non-negative Matrix Factorization (NMF)

[7, 8]
NMF also factorizes the rating matrix into user and item profiles, but it has one more restriction: both low-rank profile matrices should have only positive values in them. This method uses multiplicative update rules for minimizing Euclidean distance or Kullback-Leibler divergence between the actual ratings and estimation. Lee and Seung [7] proved convergence of this algorithm mathematically.

Probabilistic Matrix Factorization (PMF)

[9]
PMF adopts a probabilistic linear model with Gaussian observation noise for representing latent features both for users and items. Salakhutdinov [9] argues that this can be viewed as a probabilistic extension of the SVD model.

Bayesian Probabilistic Matrix Factorization (BPMF)

[10]
This algorithm applies a fully Bayesian treatment of PMF model in which model capacity is controlled automatically by integrating over all model parameters and hyperparameters. Salakhutdinov [10] uses Markov Chain Monte Carlo (MCMC) method for training BPMF, and claims good performance on Netflix dataset.

Non-linear Probabilistic Matrix Factorization (NLPMF)

[11]
This algorithm develops a non-linear probabilistic matrix factorization using Gaussian process latent variable models. The model is optimized using stochastic gradient descent method, allownig to apply Gaussian processes to data sets with millions of observations without approximate methods.

Rank-based SVD

[19]
With matrix factorization form, this method minimizes various ranked loss functions, which preserve relative order of preference between two items rated by a same user.

Local Low-Rank Matrix Approximation
LLORMA [18] is a powerful and scalable matrix approximation tool. Under the assumption that the target matrix is locally in low-rank, it learns several local models on different anchor points. A convex combination using Nadaraya-Watson regression gives a reconstruction of the target matrix.

Singleton Parallel LLORMA


Parallel LLORMA learns each local model independently in parallel. By taking advantage of multi-cores, it achieves high accuracy without taking long time.

Singleton Global LLORMA


It minimizes squared error between the convex combination of q local models and observed entries. All local models are learned together, minimizing the squared error exactly. Although it does not take advantage of multi-cores, it may achieve better accuracy than parallel version.

Paired Global LLORMA (Local Collaborative Ranking)

[19]
It minimizes various ranked loss functions, which preserve relative order of preference between two items rated by a same user. All local models are learned together, but unlike the singleton version, it runs in parallel.

Other State-of-the-art Algorithms
Besides memory-based and latent semantic models, several other recommendation algorithms were designed and achieved good performance. We do provide some of those state-of-the-art algorithms, listed below.

Fast Nonlinear Principal Component Analysis (NPCA)

[12]
Yu [12] proposes efficient implementation of nonparametric matrix factorization methods. Nonparametric methods are appealing due to its flexibility in modeling complex data. With his scalable implementation, fast NPCA is claimed to achieve good performance on large scale data.

Rank-based Recommender

[14]
Sun et.al [14] construct a kernel density estimator for incomplete partial rankings and predict the rankings of new items which minimized the posterior loss. The probability modeling approach thus adapts to customized loss functions such as MAE and Asymmetric loss.

Evaluation Metrics #

Many researchers have suggested how to evaluate recommender systems or collaborative filtering algorithms. Gunawardana and Shani [15] categorizes them into three groups according to the main task: predicting ratings, optimizing utility, and recommending good items. As they argue, a proper metric should be used in order to evaluate the target system correctly. We do provide several popular evaluation metrics, including RMSE, MAE, rank-based metrics, as well as our own evaluation scheme.
Comparison of Predicted Ratings
Evaluation metrics in this category have heavily used for comparing accuracy of predicting unseen ratings. Note that p indicates predicted rating, while r observed true rating. rmax and rmin mean maximum and minimum observed ratings, respectively. Although these metrics have used frequently in literature, Gunawardana and Shani [13] argues that they are limited in that they do not capture the fact that the space of ratings is not actually uniform.

Root of the Mean Square Error (RMSE)



Mean Absolute Error (MAE)



Normalized Mean Absolute Error (NMAE)


NMAE normalizes MAE measure for the purpose of direct comparison between two systems using different rating scale.


Asymmetric Measures


In contrast to the standard MAE and RMSE, the asymmetric loss captures the fact that recommending bad movies as good movies is worse than recommending good movies as bad.
where the loss is asymmetric such that loss(5,1) > loss(1,5). The matrix below shows one example of the asymmetric loss when minRate is 1 and maxRate is 6. The rows are predicted ratings and the columns are real ratings. For example,


Rank-based Metrics
Unlike the previous category, Rank-based Metrics use relative order of recommended items rather than exact scores of them.

Half-Life Utility (HLU)

[4]
Suppose we have a list of recommended items, and want to know how likely the user will actually retrieve the recommended items. Assuming an exponential decay with a half-life of alpha, it evaluates how many actually preferred items are located in the front of recommended list. Overall score is in relative scale to the maximum possible score of the target user. Gunawardana and Shani [12] argues that this is a proper measure when the recommender system is aimed to optimize utility.



Normalized Discounted Cumulative Gain (NDCG)

[16]
NDCG is widely used for measuring effectiveness of a web search engine or other information retrieval applications. Discounted cumulative gain (DCG) measures the usefulness of an item based on its rank in a recommended list. NDCG normalizes the DCG in order to compensating varying length of recommended item list. For this, DCG score is divided by maximum possible gain with the given list. This is conceptually similar to HLU, but gives slightly different result.

Kendall's Tau

[17]
The kendall's Tau distance between two rankings x, y over n items is the number of discordant pairs.
where I is the indicator function and xi is the rank of item i in x.

Spearman

[17]
The Spearman's distance is the square of Euclidean distance between two rankings interpreted as vectors.


Benchmark #

We present benchmark performance of our toolkit by comparing accuracy measure and computation time with another recommender system toolkit MyMedia. We chose MyMedia to compare with since provided features (both recommendation algorithms and evaluation measures) overlap the most. Although it is originally in C#, we used Java version of MyMedia for fair comparison. We used a machine with Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz and 16GB of main memory. We used Movielens 1M dataset with random split of 80% train and 20% test. All the parameters affecting computation time (e.g, neighborhood size, learning rate, and number of iterations) were set equally for both toolkits. Some parameters adjustable only in one toolkit were set as default.
Algorithm MAE RMSE Train Time Test Time
PREA MyMedia PREA MyMedia PREA MyMedia PREA MyMedia
Constant 1.0252 1.0221 1.2630 1.2604 00:00 00:00 00:00 00:00
All Average 0.9362 0.9345 1.1198 1.1180 00:00 00:00 00:00 00:00
User Average 0.8310 0.8300 1.0381 1.0362 00:00 00:00 00:00 00:00
Item Average 0.7824 0.7824 0.9802 0.9797 00:00 00:00 00:00 00:00
Random 1.3966 3.0820 1.7089 3.2908 00:00 00:00 00:00 00:00
User-based 0.6941 0.7322 0.8915 0.9257 00:00 03:08 00:42 03:27
Item-based 0.6965 0.6880 0.8916 0.8766 00:00 02:04 27:52 13:52
Slope-one 0.7109 0.7120 0.9012 0.9019 00:32 06:16 00:12 02:26
Regularized SVD 0.6970 0.6900 0.8746 0.8682 27:05 10:34 00:01 00:01
NMF 0.7299 N/A 0.9256 N/A 03:07 N/A 00:01 N/A
PMF 0.7626 N/A 0.9439 N/A 12:08 N/A 00:01 N/A
Bayesian PMF 0.7077 N/A 0.9004 N/A 00:08 N/A 00:01 N/A
Nonlinear PMF 0.7515 N/A 0.9429 N/A 30:42 N/A 08:42 N/A
Fast NPCA 0.7776 N/A 0.9685 N/A 11:18 N/A 02:44 N/A
Rank(Mean) 0.7344 N/A 0.9578 N/A 00:00 N/A 39:44 N/A
Rank(Asymmetric) 0.7294 N/A 0.9914 N/A 00:00 N/A 39:44 N/A
LLORMA(Global, r=5) 0.7018 N/A 0.8854 N/A 60:07 N/A 00:00 N/A
LLORMA(Parallel, r=10) 0.6603 N/A 0.8453 N/A 49:15 N/A 00:00 N/A

(N/A indicates that the algorithm is Not Available in the corresponding toolkit.)

Here are some observations about the benchmark test:
  • The result indicates that MAE and RMSE values are very similar in most algorithms. This supports that our implementation is correct, assumming that MyMedia implementation is correct. (One exception is Random, where MyMedia shows much worse accuracy. We do not think this is a big deal since Random is a very simple baseline.)
  • Implementation of User-based and Item-based CF is a little bit different in each toolkit. MyMedia uses memoization technique, remembering previously used similarity values in main memory. This reduces computation time, while dramatically increases memory consumption. For example, the User-based CF cannot run without expanding memory to 2GB in Java. On the other hand, PREA supports using precalculated user/item similarity. Using this makes the computation much faster, as indicated in User-based row. If it is not used, the computation takes about twice than memoization used in MyMedia, as indicated in Item-based row.
  • PREA is faster than MyMedia for some algorithms like Slope-one, while it is the opposite for other algorithms like Regularized SVD. In general, however, the performance difference is at most about twice to three times at most. We believe that both toolkits have some room to improve performance a little bit, but it may not that big.
  • Algorithms which are not available in MyMedia do not take time much longer than Regularized SVD. This means that those algorithms are efficiently implemented for practical use.
  • We note that this table is not proper for comparing accuracy of each algorithm, since the parameters are not perfectly tuned in some algorithms. We provide this table for the purpose of verifying correct implementation and comparing computation time between the two toolkits.

References #

  1. Xiaoyuan Su and Taghi M. Khoshgoftaar, A survey of Collaborative Filtering Techniques, Advances in Artificial Intelligence Volume 2009, Article ID 421425, 2009.
  2. Gediminas Adomavicius and Alexander Tuzhilin, Toward the Next Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions, IEEE Transactions on Knowledge and Data Engineering, Vol. 17, No. 6, 2005.
  3. Badrul Sarwar et al, Item-based Collaborative Filtering Recommendation Algorithms, Proceedings of the international conference on World Wide Web 2001, Hong Kong, 2001.
  4. John S. Breese and David Heckerman and Carl Kadie, Empirical Analysis of Predictive Algorithms for Collaborative Filtering, Proceedings of the 14th conference on Uncertainty in Artificial Intelligence, 1998.
  5. Daniel Lemire and Anna Maclachlan, Slope One Predictors for Online Rating-Based Collaborative Filtering, Society for Industrial Mathematics, 05:471–480, 2005.
  6. Arkadiusz Paterek, Improving Regularized Singular Value Decomposition Collaborative Filtering, Proceedings of KDD Cup and Workshop 2007, 2007.
  7. Daniel D. Lee and H. Sebastian Seung, Learning the parts of Objects by Non-negative Matrix Factorization, Letters to Nature, Vol. 401, 1999.
  8. Daniel D. Lee and H. Sebastian Seung, Algorithms for Non-negative Matrix Factorization, Advances in Neural Information Processing Systems, 2001.
  9. Ruslan Salakhutdinov and Andriy Mnih, Probabilistic Matrix Factorization, Advances in Neural Information Processing Systems 20, Cambridge, MA: MIT Press, 2008.
  10. Ruslan Salakhutdinov and Andriy Mnih, Bayesian Probabilistic Matrix Factorization using Markov Chain Monte Carlo, Proceedings of the 25th International Conference on Machine Learning, 2008.
  11. Neil D. Lawrence and Raquel Urtasun, Non-linear Matrix Factorization with Gaussian Processes, Proceedings of the 26th International Conference on Machine Learning, 2009.
  12. Kai Yu et al, Fast Nonparametric Matrix Factorization for Large-scale Collaborative Filtering, Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, 2009.
  13. Jason D. M. Rennie and Nathan Srebro, Fast Maximum Margin Matrix Factorization, Proceedings of the 22nd International Conference on Machine Learning, 2005.
  14. M. Sun and G. Lebanon and P. Kidwell, Estimating Probabilities in Recommendation Systems, Proceedings of the fourteenth international conference on Artificial Intelligence and Statistics, 2011.
  15. Asela Gunawardana and Guy Shani, A Survey of Accuracy Evaluation Metrics of Recommendation Tasks, The Journal of Machine Learning Research, Volume 10, 2009.
  16. Kalervo Jarvelin and Jaana Kekalainen, Cumulated gain-based evaluation of IR techniques, ACM Transactions on Information Systems 20(4), 422–446, 2002.
  17. J. I. Marden, Analyzing and modeling rank data, CRC Press, 1996.
  18. Joonseok Lee and Seungyeon Kim and Guy Lebanon and Yoram Singer, Local Low-Rank Matrix Approximation, Proceedings of the 30th International Conference on Machine Learning, 2013.
  19. Joonseok Lee and Samy Bengio and Seungyeon Kim and Guy Lebanon and Yoram Singer, Local Collaborative Ranking, Proceedings of the 23rd International World Wide Web Conference (WWW), 2014.