Jester Collaborative Filtering Dataset
Anonymous Ratings Data from
the
Jester Online Joke Recommender System
Old page below:
Collaborative Filtering Data:
4.1 Million continuous ratings (-10.00 to +10.00) of 100 jokes from 73,421 users:
collected between April 1999 - May 2003.
Freely available for research use when acknowledged with the following reference:
Eigentaste: A Constant Time Collaborative Filtering Algorithm.
Ken Goldberg, Theresa Roeder, Dhruv Gupta, and Chris Perkins.
Information Retrieval, 4(2), 133-151. July 2001.
(Aside: many papers, including ours, report Normalized Mean Absolute
Error (NMAE) rates of approx 20%. How good is this compared with
random guessing? In the Appendix to our paper, we show that if user
ratings are uniformly distributed, random guessing yields NMAE =
33%.)
As a courtesy, if you use the data,
I would appreciate knowing your name, what research group
you are in, and the publications that may result.
The Jester Dataset (save to disk, then unzip to obtain Excel files):
- jester-data-1.zip : (3.9MB) Data from
24,983 users who have rated 36 or more jokes, a matrix with dimensions
24983 X 101.
- jester-data-2.zip : (3.6MB) Data
from 23,500 users who have rated 36 or more jokes, a matrix with
dimensions 23500 X 101.
- jester-data-3.zip : (2.1MB) Data
from 24,938 users who have rated between 15 and 35 jokes, a matrix
with dimensions 24,938 X 101.
Format:
- 3 Data files contain anonymous ratings data from 73,421 users.
- Data files are in .zip format, when unzipped, they are in Excel (.xls) format
- Ratings are real values ranging from -10.00 to +10.00
(the value "99" corresponds to "null" = "not rated").
- One row per user
- The first column gives the number of jokes rated by that user.
The next 100 columns give the ratings for jokes 01 - 100.
- The sub-matrix including only columns
{5, 7, 8, 13, 15, 16, 17, 18, 19, 20} is dense.
Almost all users have rated those jokes (see discussion
of "universal queries" in the above paper).
Other Collaborative Filtering Datasets:
-
The MovieLens
Dataset : 1,000,000 integer ratings (from 1-5)
of 3500 films from 6,040 users.
-
The EachMovie
Dataset : 2,811,983 integer ratings (from 1-5)
of 1628 films from 72,916 users.
-
The BookCrossing
Dataset : 1,149,780 integer ratings (from 0-10)
of 271,379 books from 278,858 users.
Papers using the Jester Dataset (or a subset from 18,000 users):
For further information please contact:
Ken Goldberg
goldberg @ berkeley.edu
Prof of IEOR and EECS
4135 Etcheverry Hall
University of California
Berkeley, CA 94720-1777
(510) 643-9565 (phone)
(510) 642-1403 (fax)