Jester Collaborative Filtering Dataset

Anonymous Ratings Data from
the Jester Online Joke Recommender System

Please See:
the Updated Jester Collaborative Filtering Dataset

Old page below:

Collaborative Filtering Data:

4.1 Million continuous ratings (-10.00 to +10.00) of 100 jokes from 73,421 users: collected between April 1999 - May 2003.

Freely available for research use when acknowledged with the following reference:

Eigentaste: A Constant Time Collaborative Filtering Algorithm. Ken Goldberg, Theresa Roeder, Dhruv Gupta, and Chris Perkins. Information Retrieval, 4(2), 133-151. July 2001.

(Aside: many papers, including ours, report Normalized Mean Absolute Error (NMAE) rates of approx 20%. How good is this compared with random guessing? In the Appendix to our paper, we show that if user ratings are uniformly distributed, random guessing yields NMAE = 33%.)

As a courtesy, if you use the data, I would appreciate knowing your name, what research group you are in, and the publications that may result.

The Jester Dataset (save to disk, then unzip to obtain Excel files):

jester-data-1.zip : (3.9MB) Data from 24,983 users who have rated 36 or more jokes, a matrix with dimensions 24983 X 101.
jester-data-2.zip : (3.6MB) Data from 23,500 users who have rated 36 or more jokes, a matrix with dimensions 23500 X 101.
jester-data-3.zip : (2.1MB) Data from 24,938 users who have rated between 15 and 35 jokes, a matrix with dimensions 24,938 X 101.

Format:

3 Data files contain anonymous ratings data from 73,421 users.
Data files are in .zip format, when unzipped, they are in Excel (.xls) format
Ratings are real values ranging from -10.00 to +10.00 (the value "99" corresponds to "null" = "not rated").
One row per user
The first column gives the number of jokes rated by that user. The next 100 columns give the ratings for jokes 01 - 100.
The sub-matrix including only columns {5, 7, 8, 13, 15, 16, 17, 18, 19, 20} is dense. Almost all users have rated those jokes (see discussion of "universal queries" in the above paper).

Other Collaborative Filtering Datasets:

The MovieLens Dataset : 1,000,000 integer ratings (from 1-5) of 3500 films from 6,040 users.
The EachMovie Dataset : 2,811,983 integer ratings (from 1-5) of 1628 films from 72,916 users.
The BookCrossing Dataset : 1,149,780 integer ratings (from 0-10) of 271,379 books from 278,858 users.

Papers using the Jester Dataset (or a subset from 18,000 users):

Weighted Low-Rank Approximations, Nathan Srebro and Tommi Jaakkola (MIT), to appear in ICML 2003.
Collaborative Filtering with Privacy via Factor Analysis, John Canny (UC Berkeley), ACM SIGIR, Tampere Finland, August 2002.

For further information please contact:

Ken Goldberg
goldberg @ berkeley.edu
Prof of IEOR and EECS
4135 Etcheverry Hall
University of California
Berkeley, CA 94720-1777
(510) 643-9565 (phone)
(510) 642-1403 (fax)

Anonymous Ratings Data from the Jester Online Joke Recommender System

Please See: the Updated Jester Collaborative Filtering Dataset

Anonymous Ratings Data from
the Jester Online Joke Recommender System

Please See:
the Updated Jester Collaborative Filtering Dataset