• Stars
    star
    188
  • Rank 205,563 (Top 5 %)
  • Language
    Jupyter Notebook
  • Created over 4 years ago
  • Updated over 4 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

AI/ML citation graph with postgres + graphql

papergraph

papergraph is a rust library and binary to build and manage a citation graph of Semantic Scholar, focused on AI/ML papers (for now). Data is stored in a postgres database with a Hasura GraphQL backend (schema) on top for easy graph queries. It comes with Jupyter notebooks that show you how to analyze and visualize the data.

Live version at https://papergraph.dbz.dev

Thanks to @ArtirKel for the useful feedback and ideas.

Notebooks

The folllowing notebooks work out of the box using a publicly available API endpoint for the data. You can run them locally, or in the cloud via Google Colab. Please read the caveats about the public endpoint below!

Use Cases

  • Finding landmark papers - Papers with a large citations may be considered landmark papers. The ideas in such papers often form the foundation for incremental improvements. Given some arbitrary paper you're interested in, you may want to know which landmark papers you should study for the required background knowledge.
  • Reference research - When writing a paper, you don't want to miss prior work. Looking through the citation graph for a related paper can help you find potentially interesting papers to read and cite.
  • Graph Analysis - Run sophisticated graph algorithms on the dataset to gain insights

Graph Example

IMPORTANT! Using the public endpoint

The database is publicly available at http://34.107.246.233/v1/graphql, so please be gentle with your queries! This is running on a small postgres server that I'm paying for, so please don't overload it with automated scripts. Be nice :) As long as you're running queries by hand through notebooks everything should be fine.

If you want to do lots of queries you should clone this repo and build the database yourself locally or in the cloud. Instructions for this are below. If you are running Kubernetes, you can also use the scripts in deploy/.

Building the database from a postgresql snapshot

TODO. See this issue

Building the database from scratch

Requirements:

  • Docker

If you want to build the database from scratch, you must download the full S2 research corpus. The total compressed size is currently around ~120GB.

Clone the repo

git clone https://github.com/dennybritz/papergraph
cd papergraph
aws s3 sync --no-sign-request s3://ai2-s2-research-public/open-corpus/2020-04-10/ data/s2-research-corpus

Start up an empty postgres database server and create the schema

export DATABASE_URL=postgres://papergraph:papergraph@postgres:5432/papergraph
export RUST_LOG=info

# Run the postgres docker container
docker-compose up postgres

# Setup the datase and run migrations
docker run --rm --network papergraph_default \
  -e DATABASE_URL \
  dennybritz/papergraph \
  diesel database setup

Now that we have a postgres server with the right database schema running, we need to insert the data:

# Assuming you downloaded the data into /data 
# as shown in the AWS command above
DATA_PATH=data/s2-research-corpus/s2-corpus-017.gz

# Repeat this for all files you want to insert
# This will take a while. On my laptop, each file takes around 1min.
docker run --rm -it --network papergraph_default \
  -e DATABASE_URL -e RUST_LOG \
  -v `pwd`/${DATA_PATH}:/data/${DATA_PATH} \
  dennybritz/papergraph \
  papergraph insert -d /data/${DATA_PATH}

Now that have seeded the database, we can also start Hasura to serve the graphql API. Stop the postgres docker process with ctrl+c and run

docker-compose up

You should now be able to access the API via http://localhost:8080.

Freshness

papergraph is updated when new data snapshots become available. This typically happens once a month. This means it will not contain all the latest papers.

Misc

Generating postgres database dumps

pg_dump -h localhost -p 15432 -F tar -U papergraph papergraph > pg_dump.tar

Build docker image

docker build -t dennybritz/papergraph .

Export graphql schema

gq http://34.107.246.233/v1/graphql --introspect > hasura/schema.graphql  

More Repositories

1

reinforcement-learning

Implementation of Reinforcement Learning Algorithms. Python, OpenAI Gym, Tensorflow. Exercises and Solutions to accompany Sutton's Book and David Silver's course.
Jupyter Notebook
20,284
star
2

cnn-text-classification-tf

Convolutional Neural Network for Text Classification in Tensorflow
Python
5,635
star
3

deeplearning-papernotes

Summaries and notes on Deep Learning research papers
4,385
star
4

nn-from-scratch

Implementing a Neural Network from Scratch
Jupyter Notebook
1,811
star
5

chatbot-retrieval

Dual LSTM Encoder for Dialog Response Generation
Jupyter Notebook
1,579
star
6

neal-react

Startup Landing Page Components for React.js
JavaScript
1,378
star
7

rnn-tutorial-rnnlm

Recurrent Neural Network Tutorial, Part 2 - Implementing a RNN in Python and Theano
Jupyter Notebook
891
star
8

rails_startup_template

A startup template for Ruby on Rails 4 applications
Ruby
668
star
9

rnn-tutorial-gru-lstm

Language Model GRU with Python and Theano
Python
492
star
10

tf-rnn

Practical Examples for RNNs in Tensorflow
Jupyter Notebook
484
star
11

startupreadings

Reading list for all things startup-related
423
star
12

booknotes

Notes I'm taking when reading books
137
star
13

neal-sample

Sample page for neal-react
JavaScript
69
star
14

nn-theano

Speed up your Neural Network with Theano and the GPU
Python
61
star
15

bella

Labeling and Evaluation Tool for NLP Tasks
JavaScript
55
star
16

papergraph-ui

Browse the CS/AI/ML research paper graph
Svelte
52
star
17

sentiment-analysis

Japanese Sentiment Analysis
Jupyter Notebook
42
star
18

deepdive

Scala
36
star
19

akka-cluster-deploy

Akka cluster + Docker + CoreOS
Scala
26
star
20

url-metadata-extractor

API that extracts metadata from a URL.
HTML
26
star
21

linkedin-extractor

Given a Linkedin profie URL returns structured metadata.
HTML
25
star
22

representation-learning

Unsupervised Deep Learning and Representation Learning Tutorial
Jupyter Notebook
12
star
23

pandoc-graphql

Turn your local documents into a GraphQL API using pandoc
Rust
12
star
24

ablog-content

Jupyter Notebook
9
star
25

probability-monads

Haskell
9
star
26

s3-cors-upload-rails

Ruby
8
star
27

visual-pagedump

An API to create visual-semantic mappings between web pages and DOM nodes
JavaScript
6
star
28

boilerpipe-api

Extract main article text from HTML pages
Scala
5
star
29

elk-playground

Playground for Elasticsearch, Logstash and Kibana using Docker
Shell
5
star
30

scpd-downloader

Stanford SCPD Lecture Downloader
Ruby
4
star
31

twitter-collect

Monitors Twitter for interesting articles
JavaScript
3
star
32

crawler-scala

Scala
2
star
33

hndump

Watches HackerNews and dumps raw events for later analysis.
JavaScript
1
star
34

sampler

Gibbs sampler for factor graphs
Scala
1
star
35

rnn-bptt

Latex files for RNN BPTT post
TeX
1
star
36

rust-analyzer-issue3667

Rust
1
star
37

phonenumber-api

Extracts phone numbers from raw text
Scala
1
star
38

twitter-smileys

Crawling twitter for smiley distant supervision
TeX
1
star
39

actioncrawler

Crawling wrapper for webdriver
JavaScript
1
star
40

lang_classifier

Python language classifier
Python
1
star