zygmuntz/goodbooks-10k

Stars
788
Rank 57,762 (Top 2 %)
Language
Jupyter Notebook
License
Other
Created about 7 years ago
Updated over 1 year ago

zygmuntz/goodbooks-10k

zygmuntz

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Ten thousand books, six million ratings

goodbooks-10k

This dataset contains six million ratings for ten thousand most popular (with most ratings) books. There are also:

books marked to read by the users
book metadata (author, year, etc.)
tags/shelves/genres

Access

Some of these files are quite large, so GitHub won't show their contents online. See samples/ for smaller CSV snippets.

Open the notebook for a quick look at the data. Download individual zipped files from releases.

The dataset is accessible from Spotlight, recommender software based on PyTorch.

Contents

ratings.csv contains ratings sorted by time. It is 69MB and looks like that:

user_id,book_id,rating
1,258,5
2,4081,4
2,260,5
2,9296,5
2,2318,3

Ratings go from one to five. Both book IDs and user IDs are contiguous. For books, they are 1-10000, for users, 1-53424.

to_read.csv provides IDs of the books marked "to read" by each user, as user_id,book_id pairs, sorted by time. There are close to a million pairs.

books.csv has metadata for each book (goodreads IDs, authors, title, average rating, etc.). The metadata have been extracted from goodreads XML files, available in books_xml.

Tags

book_tags.csv contains tags/shelves/genres assigned by users to books. Tags in this file are represented by their IDs. They are sorted by goodreads_book_id ascending and count descending.

In raw XML files, tags look like this:

<popular_shelves>
	<shelf name="science-fiction" count="833"/>
	<shelf name="fantasy" count="543"/>
	<shelf name="sci-fi" count="542"/>
	...
	<shelf name="for-fun" count="8"/>
	<shelf name="all-time-favorites" count="8"/>
	<shelf name="science-fiction-and-fantasy" count="7"/>	
</popular_shelves>

Here, each tag/shelf is given an ID. tags.csv translates tag IDs to names.

goodreads IDs

Each book may have many editions. goodreads_book_id and best_book_id generally point to the most popular edition of a given book, while goodreads work_id refers to the book in the abstract sense.

You can use the goodreads book and work IDs to create URLs as follows:

https://www.goodreads.com/book/show/2767052
https://www.goodreads.com/work/editions/2792775

Note that book_id in ratings.csv and to_read.csv maps to work_id, not to goodreads_book_id, meaning that ratings for different editions are aggregated.

hyperband

Tuning hyperparams fast with Hyperband

phraug

A set of simple Python scripts for pre-processing large files

phraug2

A new version of phraug, which is a set of simple Python scripts for pre-processing large files

numer.ai

Validation and prediction code for numer.ai

kaggle-blackbox

Deep learning made easy

classifying-text

Classifying text with bag-of-words

adversarial-validation

Creating a better validation set when test examples differ from training examples

evaluating-recommenders

Compute and plot NDCG for a recommender system

time-series-classification

Classifying time series using feature extraction

classifier-calibration

Reliability diagrams, Platt's scaling, isotonic regression

kaggle-advertised-salaries

Predicting job salaries from ads - a Kaggle competition

the-secret-of-the-big-guys

k-means + a linear model = good results

pointer-networks-experiments

Sorting numbers with pointer networks

kaggle-cats-and-dogs

Classifying images with OverFeat

kaggle-stackoverflow

Predicting closed questions on Stack Overflow

gaussrank

Preparing continuous features for neural networks with GaussRank

kaggle-happiness

Predicting happiness from demographics and poll answers

kaggle-cifar

Code for the CIFAR-10 competition at Kaggle, uses cuda-convnet

sofia-ml-mod

sofia-kmeans with sparse RBF cluster mapping

pylearn2-practice

Pylearn2 in practice

kaggle-burn-cpu

Code for the "Burn CPU, burn" competition at Kaggle. Uses Extreme Learning Machines and hyperopt.

kaggle-amazon

Amazon access control challenge

pybrain-practice

A regression example for PyBrain

wine-quality

Predicting wine quality

dimensionality-reduction-for-sparse-binary-data

convert a lot of zeros and ones to fewer real numbers

cubert

How to make those 3D data visualizations

kaggle-gender

A Kaggle competition: discriminate gender based on handwriting

msda-denoising

Using a very fast denoising autoencoder

kaggle-solar

Code for Solar Energy Prediction Contest at Kaggle

nonlinear-vowpal-wabbit

How to use automatic polynomial features and neural network mode in VW

metric-learning-for-regression

Applying metric learning to kin8nm

kaggle-avito

Code for the Avito competition

kaggle-rossmann

Predicting sales with Pandas

spearmint

tuning hyperparams automatically with spearmint

kaggle-accelerometer

Code for Accelerometer Biometric Competition at Kaggle

large-scale-linear-learners

VW, Liblinear and StreamSVM compared on webspam

r-libsvm-format-read-write

R code for reading and writing files in libsvm format

stardose

A recommender system for GitHub repositories

running-external-programs-from-python

feature-selection

Selecting features for classification with MRMR

kaggle-merck

Merck challenge at Kaggle

kaggle-stumbleupon

bag of words + sparsenn

project-rhubarb

predicting mortality in England using air quality data

kaggle-bestbuy_big

Code for the Best Buy competition at Kaggle

kaggle-digits

Some code for the Digits competition at Kaggle, incl. pylearn2's maxout

misc

Jupyter Notebook

kaggle-poker-hands

Code for the Poker Rule Induction competition

kaggle-bestbuy_small

AlpacaGPT

How to train your own ChatGPT, Alpaca style

kaggle-jobs

Some auxiliary code for Kaggle job recommendation challenge