Datasets and availability

By on

Occasionally, GroupLens receives requests for datasets that we possess. In many cases, we are able to provide this data as we have with the Movielens rating datasets. One of the data collections that we have is a 10% sample of Wikipedia page requests (essentially every 10th HTTP request), since April 2007. This data accumulates at a rate of about 5 GB/day, and we currently have around 4 TB of unprocessed compressed data. This is approximately 40 TB when uncompressed. While we sometimes get requests for this data, the sheer size of it makes it difficult for us to make it available for download.

Although we cannot make this data available for download, depending on your request and our availability, we may be able to collaborate with you by performing the analysis you need on our data.

Also, we are not the only ones who have view data of Wikipedia. There are several other sources that have data on page views. Here are some of these resources and the type of data that they have available:

  • stats.grok.se – Provides data on per-page view counts by month.
  • dammit.lt/wikistats – Has files containing hourly per-page view count snapshots, with archives that currently go back to October 2009.
  • Wikipedia Page Traffic Statistics on AWS – Hourly traffic statistics for a 7 month window (October 2008 – April 2009) are available on Amazon Web Services. This data was assembled from files that were available from dammit.lt/wikistats at the time.