Typically, user-user collaborative filtering has used Pearson correlation to compare users. Early work tried Spearman correlation and (raw) cosine similarity, but found Pearson to work better, and the issue wasn’t revisited for quite some time.

When I was revisiting some of these algorithmic decisions for the LensKit paper, I tried cosine similarity on mean-centered vectors (sometimes called ‘Adjusted Cosine’) and found it to work better (on our offline evaluation metrics) than Pearson correlation, even without any significance weighting. So now my recommendation is to use cosine similarity over mean-centered data. But why the change, and why does it work?

There are two issues with computing similarity between users using Pearson correlation. One is the question of what to do with items that one user has rated but the other has not. The straightforward, statistically correct way to handle this is to only consider items that both users have rated, and to do this consistently. This results in the following formula, where \(I_u\) is the items rated by \(u\):

\(\mathrm{sim}(u, v) = \frac{\sum_{i \in I_u \cap I_v} (r_{ui} – \mu_u)(r_{vi} – \mu_v)}{\sqrt{\sum_{i \in I_u \cap I_v} (r_{ui} – \mu_u)^2} \sqrt{\sum_{i \in I_u \cap I_v} (r_{vi} – \mu_v)^2}}\)

\(\mu_u\) should be computed just over the ratings in \(I_u \cap I_v\), but that doesn’t seem to make a very significant difference in practice.

The other issue is the fact that users with few rated items in common will have very high similarities. The Pearson correlation over a single pair of ratings is undefined (division by 0); over 2 pairs it is 1. But if each of the users has rated many items, the fact that they have rated two of the same items is not a good basis for saying that they are perfectly similar. The typical way to deal with this is significance weighting (Herlocker et al., 1999); multiply the similarity by \(\frac{min(|I_u \cap I_v|, 50)}{50}\), decreasing similarity linearly until the users have at least 50 rated items in common. This is somewhat ad-hoc, but improves the performance of recommenders using Pearson correlation.

So what’s up with Cosine? First, cosine similarity over mean-centered vectors is very similar to Pearson correlation (\(\hat{r_{ui}}\) is the normalized rating \(r_{ui} – \mu_u\)):

\[\begin{align*}
\mathrm{sim}(u,v) & = \frac{\mathbf{\hat u} \cdot \mathbf{\hat v}}{\|\mathbf{\hat u}\|\|\mathbf{\hat v}\|} \\
& = \frac{\sum_i \hat{r_{ui}} \hat{r_{vi}}}{\sqrt{\sum_i \hat{r_{ui}}^2} \sqrt{\sum_i \hat{r_{vi}}^2}}
\end{align*}\]

I have not yet specified what values \(i\) takes in these summations. If we sum over \(I_u \cap I_v\), it is exactly the Pearson correlation. If, however, we sum over \(I_u \cup I_v\), with \(\hat{r_{ui}} = 0\) whenever \(u\) has not rated \(i\), things change. If the users have rated exactly the same items, it remains the Pearson correlation. However, if one user has rated items the other has not, those ratings drop out of the numerator (since they are multiplied by 0) but still contribute to the denominator. So this similarity function is naturally self-damping, based on \(\frac{|I_u \cap I_v|}{\sqrt{|I_u|}\sqrt{|I_v|}}\), the ratio of co-rated items to user item set sizes (it’s not a strict linear scaling, since the ratings values affect the computation, but I find this framing useful for conveying the basic idea). In offline experiments, this dynamic self-damping seems to be more effective than significance weighting and has the added benefit of not depending on the fairly arbitrary cutoff of 50.

The earliest work I know of using cosine similarity for user-user CF, Breese et al., [1998], did not mean-center the data prior to computing the similarity. I suspect this is why it did not perform as well (there could also be domain- or task-specific factors as well). Mean centering for similarity computation also didn’t have much (if any) visibility in the recommender literature until item-item collaborative filtering, and then wasn’t backported to user-user very often.

Some references, including an earlier version of the Wikipedia article on collaborative filtering, define cosine similarity to be computed exactly like Pearson correlation (considering only items in common). This does not have the self-damping benefits of considering all items.

Mean-centering and summing over all users (resulting in a dynamically attenuated correlation) is the best current way I know of to compare users for similarity in general. It’s still good to experiment with your data (and ideally users) to see what works best for any particular application, but mean-centered cosine is a good starting point.


Comments are closed.