Jester Collaborative Filtering Dataset

Anonymous Ratings Data from
the Jester Online Joke Recommender System

Please See:
the Updated Jester Collaborative Filtering Dataset

Old page below:

Collaborative Filtering Data:

4.1 Million continuous ratings (-10.00 to +10.00) of 100 jokes from 73,421 users: collected between April 1999 - May 2003.

Freely available for research use when acknowledged with the following reference:

Eigentaste: A Constant Time Collaborative Filtering Algorithm. Ken Goldberg, Theresa Roeder, Dhruv Gupta, and Chris Perkins. Information Retrieval, 4(2), 133-151. July 2001.

(Aside: many papers, including ours, report Normalized Mean Absolute Error (NMAE) rates of approx 20%. How good is this compared with random guessing? In the Appendix to our paper, we show that if user ratings are uniformly distributed, random guessing yields NMAE = 33%.)

As a courtesy, if you use the data, I would appreciate knowing your name, what research group you are in, and the publications that may result.

The Jester Dataset (save to disk, then unzip to obtain Excel files):

Format:

  1. 3 Data files contain anonymous ratings data from 73,421 users.
  2. Data files are in .zip format, when unzipped, they are in Excel (.xls) format
  3. Ratings are real values ranging from -10.00 to +10.00 (the value "99" corresponds to "null" = "not rated").
  4. One row per user
  5. The first column gives the number of jokes rated by that user. The next 100 columns give the ratings for jokes 01 - 100.
  6. The sub-matrix including only columns {5, 7, 8, 13, 15, 16, 17, 18, 19, 20} is dense. Almost all users have rated those jokes (see discussion of "universal queries" in the above paper).

Other Collaborative Filtering Datasets:

Papers using the Jester Dataset (or a subset from 18,000 users):

For further information please contact:

Ken Goldberg
goldberg @ berkeley.edu
Prof of IEOR and EECS
4135 Etcheverry Hall
University of California
Berkeley, CA 94720-1777
(510) 643-9565 (phone)
(510) 642-1403 (fax)