• Stars
    star
    808
  • Rank 56,429 (Top 2 %)
  • Language
    Python
  • Created about 4 years ago
  • Updated 3 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

This is a repository of public data sources for Recommender Systems (RS).

Datasets For Recommender Systems

This is a repository of public data sources for Recommender Systems (RS).

All of these recommendation datasets can convert to the atomic files defined in RecBole, which is a unified, comprehensive and efficient recommendation library.

After converting to the atomic files, you can use RecBole to test the performance of different recommender models on these datasets easily. For more information about RecBole, please refer to RecBole.

Usage

In order to use RecBole, you need to convert these original datasets to the atomic file which is a kind of data format defined by RecBole.

We provide two ways to convert these datasets into atomic files:

  1. Download the raw dataset and process it with conversion tools we provide in this repository. Please refer to conversion tools.

  2. Directly download the processed atomic files. Baidu Wangpan (Password: e272), Google Drive.

Datasets link and brief introduction

Shopping

  • Amazon: Amazon Review Data includes reviews (ratings, text, helpfulness votes) and product metadata (descriptions, category information, price, brand, and image features), which includes a previous version in 2014 and an updated version in 2018. Our processed datasets are detailed here.
    • Amazon 2014: This dataset contains product reviews and metadata from Amazon, including 24 categories and 142.8 million reviews spanning May 1996 - July 2014.
    • Amazon 2018: This Dataset is an updated version of the Amazon review dataset released in 2014. The total number of reviews is 233.1 million and the number of categories is 29 (142.8 million and 24 in 2014) and current data includes reviews in the range May 1996 - Oct 2018.
  • Alibaba-iFashion: This dataset is a fashion outfit dataset collected from Alibaba online shopping systems in the paper POG. The items from each outfit are viewed as the items being recommended to users, where each item consists of attributes such as category and title.
  • Epinions: This dataset was collected from Epinions.com, a popular online consumer review website. It contains trust relationships amongst users and spans more than a decade, from January 2001 to November 2013.
  • Yelp: This dataset was collected from Yelp. The Yelp dataset is a subset of our businesses, reviews, and user data for use in personal, educational, and academic purposes. Starting from Yelp Challenge 2018 (the original link to this competition is not found and there will not be another round of Yelp Dataset Challenge), there are four versions of Yelp datasets in total and Yelp has also posted the dataset on Kaggle, where you can also download a few earlier versions. Our processed 5 datasets are detailed here.
    • Yelp 2018: It is the first version of Yelp dataset released in Yelp Challenge 2018 including 5,261,669 reviews.
    • Yelp 2020: It is the second version of Yelp dataset released in 2020, including 8,021,122 reviews.
    • Yelp 2021: It is the first version of Yelp dataset released in 2021, including 8,635,403 reviews.
    • Yelp 2022: It is the latest version of Yelp dataset, which contains 908,915 tips by 1,987,897 users over 1.2 million business attributes like hours, parking, availability, and ambience aggregated check-ins over time for each of the 131,930 businesses.
    • Yelp-full: This is a combination dataset including four versions of yelp datasets mentioned above, where the duplicates are dropped and the number of total reviews is 28,908,240.
  • Tmall: This dataset is provided by Ant Financial Services, using in the IJCAI16 contest.
  • DIGINETICA: The dataset includes user sessions extracted from an e-commerce search engine logs, with anonymized user ids, hashed queries, hashed query terms, hashed product descriptions and meta-data, log-scaled prices, clicks, and purchases.
  • YOOCHOOSE: This dataset was constructed by YOOCHOOSE GmbH to support participants in the RecSys Challenge 2015.
  • Retailrocket: The data has been collected from a real-world ecommerce website. It is raw data, i.e. without any content transformations, however, all values are hashed due to confidential issues.
  • Ta Feng: The dataset contains a Chinese grocery store transaction data from November 2000 to February 2001.

Advertising

  • Criteo: This dataset was collected from Criteo, which consists of a portion of Criteo's traffic over a period of several days.

  • Avazu: This dataset is used in Avazu CTR prediction contest.

  • iPinYou: This dataset was provided by iPinYou, which contains all training datasets and leaderboard testing datasets of the three seasons iPinYou Global RTB(Real-Time Bidding) Bidding Algorithm Competition.

  • AliEC: Ali_Display_Ad_Click is a dataset of click rate prediction about display Ad, which is displayed on the website of Taobao. The dataset is offered by the company of Alibaba.

Check-in

  • Foursquare: This dataset contains check-ins in NYC and Tokyo collected for about 10 month. Each check-in is associated with its time stamp, its GPS coordinates and its semantic meaning.

  • Gowalla: This dataset is from a location-based social networking website where users share their locations by checking-in, and contains a total of 6,442,890 check-ins of these users over the period of Feb. 2009 - Oct. 2010.

Movies

  • MovieLens: GroupLens Research has collected and made available rating datasets from their movie website.
  • Netflix: This is the official data set used in the Netflix Prize competition.
  • Douban: Douban Movie is a Chinese website that allows Internet users to share their comments and viewpoints about movies. This dataset contains more than 2 million short comments of 28 movies in Douban Movie website.
  • Twitch: This is a dataset of users consuming streaming content on Twitch. We retrieved all streamers, and all users connected in their respective chats, every 10 minutes during 43 days.
    • Twitch-100k: Twitch-100k is a subset of 100k users for benchmark purposes. The code is available in this Github repository.
    • Twitch-full: See the Google Drive folder containing all Twitch files. Twitch-full contains the full dataset while Twitch-100k is a subset.

Music

  • Last.FM: This dataset contains social networking, tagging, and music artist listening information from a set of 2K users from Last.fm online music system.
  • LFM-1b: This dataset contains more than one billion music listening events created by more than 120,000 users of Last.FM. Each listening event is characterized by artist, album, and track name, and includes a timestamp.
  • Yahoo Music: This dataset represents a snapshot of the Yahoo! Music community's preferences for various musical artists.
  • KGRec: Music and Sound Recommendation with Knowledge Graphs are two different datasets with users, items, implicit feedback interactions between users and items, item tags, and item text descriptions are provided, one for Music Recommendation (KGRec-music), and other for Sound Recommendation (KGRec-sound).
    • KGRec-music: All the data comes from songfacts.com and last.fm websites. Items are songs, which are described in terms of textual description extracted from songfacts.com, and tags from last.fm.
    • KGRec-sound: All the data comes from Freesound.org. Items are sounds, which are described in terms of textual description and tags created by the sound creator at uploading time.

Books

  • Book-Crossing: This dataset was collected by Cai-Nicolas Ziegler in a 4-week crawl (August / September 2004) from the Book-Crossing community with kind permission from Ron Hornbaker, CTO of Humankind Systems. It contains 278,858 users (anonymized but with demographic information) providing 1,149,780 ratings (explicit / implicit) about 271,379 books.

  • GoodReads: This dataset contain reviews from the Goodreads book review website, and a variety of attributes describing the items. Critically, datasets have multiple levels of user interaction, raging from adding to a shelf, rating, and reading.

Games

  • Steam: This dataset is reviews and game information from Steam, which contains 7,793,069 reviews, 2,567,538 users, and 32,135 games. In addition to the review text, the data also includes the users' play hours in each review.

Anime

Pictures

  • Pinterest: This dataset is originally constructed by paper Learning image and user features for recommendations in social networks for evaluating content-based image recommendation, and processed by paper Neural Collaborative Filtering.

Jokes

  • Jester: This dataset contains anonymous ratings of jokes by users of the Jester Joke Recommender System.

Exercises

  • KDD2010: This dataset was released in KDD Cup 2010 Educational Data Mining Challenge, which contains the situations of students submitting exercises on the systems.

  • EndoMondo: This is a collection of workout logs from users of EndoMondo. Data includes multiple sources of sequential sensor data such as heart rate logs, speed, GPS, as well as sport type, gender and weather conditions.

Websites

  • Phishing Websites: This dataset contains 30 kinds of features of 11,055 websites and labels of whether they are phishing websites or not. The websites' features includes 12 address-bar based features, 6 abnormal based features, 5 HTML-and-JavaScript based features and 7 domain based features.

  • Behance: This is a small, anonymized, version of a larger proprietary dataset about likes and image data from the community art website Behance.

Adult

  • Adult: This dataset is extracted by Barry Becker from the 1994 Census database, which consists of a list of people's attributes and whether they make over 50k a year.

News

  • MIND: This dataset is a large-scale dataset for news recommendation research. It was collected from anonymized behavior logs of Microsoft News website. MIND contains about 160k English news articles and more than 15 million impression logs generated by 1 million users.

Food

  • DianPing: This dataset contains the user reviews as well as the detailed business meta data information crawled from a famous Chinese online review webset DianPing.com, including the 3,605,300 reviews of 510,071 users towards 209,132 businesses.

  • Food: These datasets contain recipe details and reviews from Food.com (formerly GeniusKitchen). Data includes cooking recipes and review texts.

Beverages

  • BeerAdvocate: This dataset includes beer reviews with multiple rated dimensions, covering sensory aspects such as taste, look, feel, and smell.
  • RateBeer: This dataset contains beer reviews with multiple rated dimensions, including item attributes with sensory aspects such as taste, look, feel, and smell.

Clothes

Datasets information statistics

General Datasets

SN Dataset #User #Item #Inteaction Sparsity Interaction Type TimeStamp User Context Item Context Interaction Context
1 MovieLens - - - - Rating
2 Anime 73,515 11,200 7,813,737 99.05% Rating
[-1, 1-10]
3 Epinions 116,260 41,269 188,478 99.99% Rating
[1-5]
4 Yelp
(5 versions)
- - - - Rating
[1-5]
5 Netflix 480,189 17,770 100,480,507 98.82% Rating
[1-5]
6 Book-Crossing 105,284 340,557 1,149,780 99.99% Rating
[0-10]
7 Jester 73,421 101 4,136,360 44.22% Rating
[-10, 10]
8 Douban 738,701 28 2,125,056 89.73% Rating
[0,5]
9 Yahoo Music 1,948,882 98,211 11,557,943 99.99% Rating
[0, 100]
10 KDD2010 - - - - Rating
11 Amazon
(2014 & 2018)
- - - - Rating
[0,5]
12 Pinterest 55,187 9,911 1,445,622 99.74% -
13 Gowalla 107,092 1,280,969 6,442,892 99.99% Check-in
14 Last.FM 1,892 17,632 92,834 99.72% Click
15 DIGINETICA 204,789 184,047 993,483 99.99% Click
16 Steam 2,567,538 32,135 7,793,069 99.99% Buy
17 Ta Feng 32,266 23,812 817,741 99.89% Click
18 Foursquare - - - - Check-in
19 Tmall 963,923 2,353,207 44,528,127 99.99% Click/Buy
20 YOOCHOOSE 9,249,729 52,739 34,154,697 99.99% Click/Buy
21 Retailrocket 1,407,580 247,085 2,756,101 99.99% View/Addtocart/Transaction
22 LFM-1b 120,322 3,123,496 1,088,161,692 99.71% Click
23 MIND - - - - Click
24 BeerAdvocate 33,388 66,055 1,586,614 99.9281% Rating
[0,5]
25 Behance 63,497 178,788 1,000,000 99.9912% Likes
26 DianPing 542,706 243,247 4,422,473 99.9967% Rating
[0,5]
27 EndoMondo 1,104 253,020 253,020 99.9094% Workout Logs
28 Food 226,570 231,637 1,132,367 99.9978% Rating
[0,5]
29 GoodReads 876,145 2,360,650 228,648,342 99.9889% Rating
[0,5]
30 KGRec - - - - Click
31 ModCloth 47,958 1,378 82,790 99.8747% Rating
[0,5]
32 RateBeer 29,265 110,369 2,924,163 99.9095% Overall Rating
[0,20]
33 RentTheRunway 105,571 5,850 192,544 99.9688% Rating
[0,10]
34 Twitch 15,524,309 6,161,666 474,676,929 99.9995% Click

CTR Datasets

SN Dataset #User #Item #Inteaction Sparsity Interaction Type TimeStamp User Context Item Context Interaction Context
1 Criteo - - 45,850,617 - Click
2 Avazu - - 40,428,967 - Click
[0, 1]
3 iPinYou 19,731,660 163 24,637,657 99.23% View/Click
4 Phishing websites - - 11,055 -
5 Adult - - 32,561 - income>=50k
[0, 1]
6 Alibaba-iFashion 3,569,112 4,463,302 191,394,393 99.9988% Click
7 AliEC 491,647 240,130 1,366,056 99.9988% Click

Knowledge-aware Datasets

These knowledge-aware recommender datasets are based on KB4Rec, which associate items from recommender systems with entities from Freebase. Note that Amazon-book dataset is the version released in 2014.

Raw datasets information

SN Dataset #Items #Linked-Items #Users #Interactions
1 MovieLens 27,278 25,503 138,493 20,000,263
2 Amazon-book 2,370,605 108,515 8,026,324 22,507,155
3 LFM-1b (tracks) 31,634,450 1,254,923 120,322 319,951,294

After filtering by 5-core (And filter out the tracks that are listened to less than 10 times in LFM-1b)

SN Dataset #Items #Linked-Items #Users #Interactions
1 MovieLens 18,345 18,057 138,493 19,984,024
2 Amazon-book 367,982 34,476 603,668 8,898,041
3 LFM-1b (tracks) 615,823 337,349 79,133 15,765,756

More Repositories

1

LLMSurvey

The official GitHub page for the survey paper "A Survey of Large Language Models".
Python
10,176
star
2

RecBole

A unified, comprehensive and efficient recommendation library
Python
3,387
star
3

TextBox

TextBox 2.0 is a text generation library with pre-trained language models
Python
1,073
star
4

Awesome-RSPapers

Recommender System Papers
937
star
5

LLMBox

A comprehensive library for implementing LLMs, including a unified training pipeline and comprehensive model evaluation.
Python
599
star
6

CRSLab

CRSLab is an open-source toolkit for building Conversational Recommender System (CRS).
Python
496
star
7

HaluEval

This is the repository of HaluEval, a large-scale hallucination evaluation benchmark for Large Language Models.
Python
392
star
8

Top-conference-paper-list

A collection of classified and organized top conference paper list.
360
star
9

LLMRank

[ECIR'24] Implementation of "Large Language Models are Zero-Shot Rankers for Recommender Systems"
Python
229
star
10

DenseRetrieval

200
star
11

Negative-Sampling-Paper

This repository collects 100 papers related to negative sampling methods.
185
star
12

RecBole2.0

An up-to-date, comprehensive and flexible recommendation library
180
star
13

RecBole-GNN

Efficient and extensible GNNs enhanced recommender library based on RecBole.
Python
170
star
14

UniSRec

[KDD'22] Official PyTorch implementation for "Towards Universal Sequence Representation Learning for Recommender Systems".
Python
163
star
15

NCL

[WWW'22] Official PyTorch implementation for "Improving Graph Collaborative Filtering with Neighborhood-enriched Contrastive Learning".
Python
117
star
16

RSPapers

Must-read papers on Recommender System. 推荐系统相关论文整理(内含40篇论文,并持续更新中)
89
star
17

RecBole-CDR

This is a library built upon RecBole for cross-domain recommendation algorithms
Python
85
star
18

MVP

This repository is the official implementation of our paper MVP: Multi-task Supervised Pre-training for Natural Language Generation.
68
star
19

VQ-Rec

[WWW'23] PyTorch implementation for "Learning Vector-Quantized Item Representation for Transferable Sequential Recommenders".
Python
62
star
20

RecBole-PJF

Python
51
star
21

Language-Specific-Neurons

Python
42
star
22

ChatCoT

The official repository of "ChatCoT: Tool-Augmented Chain-of-Thought Reasoning on Chat-based Large Language Models"
Python
41
star
23

CORE

[SIGIR'22] Official PyTorch implementation for "CORE: Simple and Effective Session-based Recommendation within Consistent Representation Space".
Python
37
star
24

BAMBOO

Python
32
star
25

JiuZhang3.0

The code and data for the paper JiuZhang3.0
Python
32
star
26

Multi-View-Co-Teaching

Code for our CIKM 2020 paper "Learning to Match Jobs with Resumes from Sparse Interaction Data using Multi-View Co-Teaching Network"
Python
29
star
27

JiuZhang

Our code will be public soon .
Python
26
star
28

ELMER

This repository is the official implementation of our EMNLP 2022 paper ELMER: A Non-Autoregressive Pre-trained Language Model for Efficient and Effective Text Generation
Python
26
star
29

RecBole-DA

Python
20
star
30

CARP

Python
16
star
31

SAFE

The pytorch implementation of the SAFE model presented in NAACL-Findings-2022
Python
16
star
32

Erya

14
star
33

RecBole-TRM

Python
13
star
34

MML

Python
12
star
35

Context-Tuning

This is the repository for COLING 2022 paper "Context-Tuning: Learning Contextualized Prompts for Natural Language Generation".
11
star
36

UniWeb

The official repository for our ACL 2023 Findings paper: The Web Can Be Your Oyster for Improving Language Models
10
star
37

FIGA

[ICLR 2024] This is the official implementation for the paper: "Beyond imitation: Leveraging fine-grained quality signals for alignment"
Python
8
star
38

PPGM

[ICDM'22] PyTorch implementation for "Privacy-Preserved Neural Graph Similarity Learning".
Python
6
star
39

Social-Datasets

A collection of social datasets for RecBole-GNN.
6
star
40

Contrastive-Curriculum-Learning

Python
5
star
41

LIVE

The official repository our ACL 2023 paper: "Learning to Imagine: Visually-Augmented Natural Language Generation"."
Python
5
star
42

ALLO

The official repository of "Low-Redundant Optimization for Large Language Model Alignment''
Python
5
star
43

M3SRec

4
star
44

Data-CUBE

3
star
45

Div-Ref

The official repository of "Not All Metrics Are Guilty: Improving NLG Evaluation Diversifying References".
Python
3
star
46

GenRec

Python
1
star
47

ETRec

Python
1
star
48

xLSTM-LSR

Python
1
star
49

MoL-TSR

Python
1
star
50

L2P-CSR

The implementation code of the TASLP 2023 paper "Learning to Perturb for Contrastive Learning of Unsupervised Sentence Representations"
Python
1
star