It has been about six years since we released our previous major ratings dataset, MovieLens 10M. Today, we have released its successor, MovieLens 20M, alongside two new non-archival datasets for education and development. These datasets are available for download at http://grouplens.org/datasets/movielens/.
Our latest benchmark dataset has 20 million ratings and nearly 0.5 million tag applications. We are releasing this new version to serve as a new, more comprehensive reference dataset for recommender systems research.
We’ve included a few changes to modernize the dataset. It is formatted as .csv with a header row (previous releases used custom delimiters). It includes a file called links.csv that maps MovieLens ids to entities in The Movie Database and IMDb. And, it no longer includes pre-computed cross-folds, or scripts to generate cross-folds, since most data mining and recommender tool kits provide this functionality out of the box.
MovieLens latest and latest-small
We are also releasing two new non-archival datasets to fulfill the common request for current data that can be incorporated into courses, exploratory software, and the like. These datasets are not intended for research benchmarking. To fulfill the goal of providing datasets that include recent movie content, we plan to regenerate these datasets regularly.
Latest contains 21 million ratings and 470,000 tag applications. Unlike the benchmark datasets, this one includes all users, not just the ones with 20 ratings.
Latest-small contains 100,000 ratings and 2,500 tag applications. It is intended for applications (e.g., education, software demos, testing) where a smaller amount of data is actually better. We have also made it possible to redistribute this dataset for non-commercial applications (see the README for details).
Let us know what you think! Please take a short survey about the MovieLens datasets.