For years MovieLens has required users to enter 15 ratings before they are allowed to get personalized recommendations. This design makes sense: how can we make recommendations for a user we know nothing about? That said, we don’t know if this provides users with the best experience. Why should users have to enter fifteen ratings, why not ten or five? What would happen if we let users into the system without any ratings? To answer these questions we need to understand how our algorithms behave for users with very few ratings.
To understand how algorithms behave for users just joining the system, we looked at historic MovieLens ratings. We trained three popular recommender algorithms: ItemItem
, UserUser
, and SVD
on this rating data. While training, we limited some users to have only a small number of ratings. We used the ratings that were not given to the algorithm to measure several things:
- How accurate are the predictions? Can the algorithm accurately predict the user’s future ratings?
- How good are the recommendations? Does the algorithm suggest movies for the user that the user would like?
- What type of recommendations does the algorithm generate? Is there a good diversity of movies? Are the movies popular, or more obscure?
We compared our algorithms against two baselines, the average rating and the user adjusted average rating. The average rating baseline is a standard non-personalized baseline which recommends the items that have been rated the highest. The user adjusted average rating adjusts the users predictions so that people who tend to rate movies highly, get higher predictions, and users who tend to rate movies lowly, get lower predictions. Because this adjustment is uniform over all movies this algorithm gives the same recommendations as the average rating. Any good personalized algorithm should beat both baselines. On most metrics even our best algorithm needed two or three ratings to outperform the average rating, and at least four ratings to outperform the user adjusted average rating.
ItemItem
performed the worst of the personalized algorithms. It needs at least 13 ratings to have more accurate predictions than the average rating baseline, and at 19 ratings its predictions and recommendations are still worse than the user adjusted average rating. UserUser
provided good predictions, but had a tendency to recommend items that few other users had seen. The recommendations from UserUser
were so uncommon that we were unable to evaluate if a user would actually like their recommendations. SVD
performed quite well, giving good predictions and recommendations. The only downside of SVD
is that its performance appeared to make the least personalized recommendations of the personalized algorithms.
These results are surprising, we use these algorithms everywhere, and we expect them to work well for all types of users. But they don’t; our algorithms have trouble with new users. This shapes how we introduce users to our recommenders. If we can make better recommendations for new users, then maybe we can build systems where the new user experience isn’t a burden new users have to bear.
If you want to learn more about this work I’ll be at RECSYS ’14 presenting a paper written by Joe and I which discusses this work in depth. For those who can’t attend, Ive been told that the technical proceedings will be live streamed, so you can catch my presentation online. More information about the live stream will be posted to the RECSYS ’14 webpage. I will also post a copy of the paper on the GroupLens publication page as soon as possible. Finally, for those who are interested in work was done, scripts for running this evaluation can be found at online at bitbucket.org/kluver/coldstartrecommendation.
Update: You can see a recording for this video online at youtube at this link