• Stars
    star
    1,669
  • Rank 28,006 (Top 0.6 %)
  • Language
    Python
  • License
    Other
  • Created about 4 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Platform for designing and evaluating Graph Neural Networks (GNN)

GraphGym

GraphGym is a platform for designing and evaluating Graph Neural Networks (GNN). GraphGym is proposed in Design Space for Graph Neural Networks, Jiaxuan You, Rex Ying, Jure Leskovec, NeurIPS 2020 Spotlight.

Please also refer to PyG for a tightly integrated version of GraphGym and PyG.

Highlights

1. Highly modularized pipeline for GNN

  • Data: Data loading, data splitting
  • Model: Modularized GNN implementation
  • Tasks: Node / edge / graph level GNN tasks
  • Evaluation: Accuracy, ROC AUC, ...

2. Reproducible experiment configuration

  • Each experiment is fully described by a configuration file

3. Scalable experiment management

  • Easily launch thousands of GNN experiments in parallel
  • Auto-generate experiment analyses and figures across random seeds and experiments.

4. Flexible user customization

  • Easily register your own modules in graphgym/contrib/, such as data loaders, GNN layers, loss functions, etc.

News

  • GraphGym 0.3.0 has been released. Now you may install stable version of GraphGym via pip install graphgym.
  • GraphGym 0.2.0 has been released. Now GraphGym supports Pytorch Geometric backend, in addition to the default DeepSNAP backend. You may try it out in run_single_pyg.sh.
cd run
bash run_single_pyg.sh 

Example use cases

Why GraphGym?

TL;DR: GraphGym is great for GNN beginners, domain experts and GNN researchers.

Scenario 1: You are a beginner to GNN, who wants to understand how GNN works.

You probably have read many exciting papers on GNN, and try to write your own GNN implementation. Using existing packages for GNN, you still have to code up the essential pipeline on your own. GraphGym is a perfect place for your to start learning standardized GNN implementation and evaluation.


Figure 1: Modularized GNN implementation.

Scenario 2: You want to apply GNN to your exciting applications.

You probably know that there are hundreds of possible GNN models, and selecting the best model is notoriously hard. Even worse, we have shown in our paper that the best GNN designs for different tasks differ drastically. GraphGym provides a simple interface to try out thousands of GNNs in parallel and understand the best designs for your specific task. GraphGym also recommends a "go-to" GNN design space, after investigating 10 million GNN model-task combinations.


Figure 2: A guideline for desirable GNN design choices.

(Sampling from 10 million GNN model-task combinations.)

Scenario 3: You are a GNN researcher, who wants to innovate GNN models / propose new GNN tasks.

Say you have proposed a new GNN layer ExampleConv. GraphGym can help you convincingly argue that ExampleConv is better than say GCNConv: when randomly sample from 10 million possible model-task combinations, how often ExampleConv will outperform GCNConv, when everything else is fixed (including the computational cost). Moreover, GraphGym can help you easily do hyper-parameter search, and visualize what design choices are better. In sum, GraphGym can greatly facilitate your GNN research.


Figure 3: Evaluation of a given GNN design dimension
(BatchNorm here).

Installation

Requirements

  • CPU or NVIDIA GPU, Linux, Python3
  • PyTorch, various Python packages; Instructions for installing these dependencies are found below

1. Python environment (Optional): We recommend using Conda package manager

conda create -n graphgym python=3.7
source activate graphgym

2. Pytorch: Install PyTorch. We have verified GraphGym under PyTorch 1.8.0, and GraphGym should work with PyTorch 1.4.0+. For example:

# CUDA versions: cpu, cu92, cu101, cu102, cu101, cu111
pip install torch==1.8.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html

3. Pytorch Geometric: Install PyTorch Geometric, follow their instructions. For example:

# CUDA versions: cpu, cu92, cu101, cu102, cu101, cu111
# TORCH versions: 1.4.0, 1.5.0, 1.6.0, 1.7.0, 1.8.0
CUDA=cu101
TORCH=1.8.0
pip install torch-scatter -f https://pytorch-geometric.com/whl/torch-${TORCH}+${CUDA}.html
pip install torch-sparse -f https://pytorch-geometric.com/whl/torch-${TORCH}+${CUDA}.html
pip install torch-cluster -f https://pytorch-geometric.com/whl/torch-${TORCH}+${CUDA}.html
pip install torch-spline-conv -f https://pytorch-geometric.com/whl/torch-${TORCH}+${CUDA}.html
pip install torch-geometric

4. GraphGym and other dependencies:

git clone https://github.com/snap-stanford/GraphGym
cd GraphGym
pip install -r requirements.txt
pip install -e .  # From latest verion
pip install graphgym # (Optional) From pypi stable version

5. Test the installation

Run a single experiment. Run a test GNN experiment using GraphGym run_single.sh. Configurations are specified in example.yaml. The experiment is about node classification on Cora dataset (random 80/20 train/val split).

cd run
bash run_single.sh # run a single experiment

Run a batch of experiments. Run a batch of GNN experiments using GraphGym run_batch.sh. Configurations are specified specified in example.yaml (controls the basic architecture) and example.txt (controls how to do grid search). The experiment examines 96 models in the recommended GNN design space, on 2 graph classification datasets. Each experiment is repeated 3 times, and we set that 8 jobs can be concurrently run. Depending on your infrastructure, finishing all the experiments may take a long time; you can quit the experiment by Ctrl-C (GraphGym will properly kill all the processes).

cd run
bash run_batch.sh # run a batch of experiments 

(Optional) Run GraphGym with CPU backend. GraphGym supports cpu backend as well -- you only need to add one line device: cpu to the .yaml file. Here we provide an example.

cd run
bash run_single_cpu.sh # run a single experiment using CPU backend

(Optional) Run GraphGym with PyG backend. Run GraphGym with Pytorch Geometric (PyG) backend run_single_pyg.sh and run_batch_pyg.sh, instead of the default DeepSNAP backend. The PyG backend follows the native PyG implementation, and is slightly more efficient than the DeepSNAP backend. Currently the PyG backend only supports user-provided dataset splits, such as PyG native datasets or OGB datasets.

cd run
bash run_single_pyg.sh # run a single experiment using PyG backend
bash run_batch_pyg.sh # run a batch of experiments using PyG backend 

GraphGym In-depth Usage

1 Run a single GNN experiment

A full example is specified in run/run_single.sh.

1.1 Specify a configuration file. In GraphGym, an experiment is fully specified by a .yaml file. Unspecified configurations in the .yaml file will be populated by the default values in graphgym/config.py. For example, in run/configs/example.yaml, there are configurations on dataset, training, model, GNN, etc. Concrete description for each configuration is described in graphgym/config.py.

1.2 Launch an experiment. For example, in run/run_single.sh:

python main.py --cfg configs/example.yaml --repeat 3

You can specify the number of different random seeds to repeat via --repeat.

1.3 Understand the results. Experimental results will be automatically saved in directory run/results/${CONFIG_NAME}/; in the example above, it is run/results/example/. Results for different random seeds will be saved in different subdirectories, such as run/results/example/2. The aggregated results over all the random seeds are automatically generated into run/results/example/agg, including the mean and standard deviation _std for each metric. Train/val/test results are further saved into subdirectories, such as run/results/example/agg/val; here, stats.json stores the results after each epoch aggregated across random seeds, best.json stores the results at the epoch with the highest validation accuracy.

2 Run a batch of GNN experiments

A full example is specified in run/run_batch.sh.

2.1 Specify a base file. GraphGym supports running a batch of experiments. To start, a user needs to select a base architecture --config. The batch of experiments will be created by perturbing certain configurations of the base architecture.

2.2 (Optional) Specify a base file for computational budget. Additionally, GraphGym allows a user to select a base architecture to control the computational budget for the grid search, --config_budget. The computational budget is currently measured by the number of trainable parameters; the control is achieved by auto-adjust the hidden dimension size for GNN. If no --config_budget is provided, GraphGym will not control the computational budget.

2.3 Specify a grid file. A grid file describes how to perturb the base file, in order to generate the batch of the experiments. For example, the base file could specify an experiment of 3-layer GCN for Cora node classification. Then, the grid file specifies how to perturb the experiment along different dimension, such as number of layers, model architecture, dataset, level of task, etc.

2.4 Generate config files for the batch of experiments, based on the information specified above. For example, in run/run_batch.sh:

python configs_gen.py --config configs/${DIR}/${CONFIG}.yaml \
  --config_budget configs/${DIR}/${CONFIG}.yaml \
  --grid grids/${DIR}/${GRID}.txt \
  --out_dir configs

2.5 Launch the batch of experiments. For example, in run/run_batch.sh:

bash parallel.sh configs/${CONFIG}_grid_${GRID} $REPEAT $MAX_JOBS

Each experiment will be repeated for $REPEAT times. We implemented a queue system to sequentially launch all the jobs, with $MAX_JOBS concurrent jobs running at the same time. In practice, our system works great when handling thousands of jobs.

2.6 Understand the results. Experimental results will be automatically saved in directory run/results/${CONFIG_NAME}_grid_${GRID_NAME}/; in the example above, it is run/results/example_grid_example/. After running each experiment, GraphGym additionally automatically averages across different models, saved in run/results/example_grid_example/agg. There, val.csv represents validation accuracy for each model configuration at the final epoch; val_best.csv represents the results at the epoch with the highest average validation error; val_best_epoch.csv represents the results at the epoch with the highest validation error, averaged over different random seeds. When test set split is provided, test.csv represents test accuracy for each model configuration at the final epoch; test_best.csv represents the test set results at the epoch with the highest average validation error; test_best_epoch.csv represents the test set results at the epoch with the highest validation error, averaged over different random seeds.

3 Analyze the results

We provides a handy tool to automatically provide an overview of a batch of experiments in analysis/example.ipynb.

cd analysis
jupyter notebook
example.ipynb   # automatically provide an overview of a batch of experiments

4 User customization

A highlight of GraphGym is that it allows users to easily register their customized modules. The supported customized modules are provided in directory graphgym/contrib/, including:

Within each directory, (at least) an example is provided, showing how to register user customized modules. Note that new user customized modules may result in new configurations; in these cases, new configuration fields can be registered at graphgym/contrib/config/.

Note: Applying to your own datasets. A common use case will be applying GraphGym to your favorite datasets. To do so, you may follow our example in graphgym/contrib/loader/example.py. GraphGym currently accepts a list of NetworkX graphs or PyG datasets.

Use case: Design Space for Graph Neural Networks (NeurIPS 2020 Spotlight)

Reproducing experiments in Design Space for Graph Neural Networks, Jiaxuan You, Rex Ying, Jure Leskovec, NeurIPS 2020 Spotlight. You may refer to the paper or project webpage for more details.

# NOTE: We include the raw results with GraphGym
# If you run the following code, the results will be overridden.
cd run/scripts/design/
bash run_design_round1.sh   # first round experiments, on a design space of 315K GNN designs
bash run_design_round2.sh   # second round experiments, on a design space of 96 GNN designs
cd ../analysis
jupyter notebook
design_space.ipynb   # reproducing all the analyses in the paper

Figure 4: Overview of the proposed GNN design space and task space.

Use case: Identity-aware Graph Neural Networks (AAAI 2021)

Reproducing experiments in Identity-aware Graph Neural Networks, Jiaxuan You, Jonathan Gomes-Selman, Rex Ying, Jure Leskovec, AAAI 2021. You may refer to the paper or project webpage for more details.

# NOTE: We include the raw results for ID-GNN in analysis/idgnn.csv
cd run/scripts/IDGNN/
bash run_idgnn_node.sh   # Reproduce ID-GNN node-level results
bash run_idgnn_edge.sh   # Reproduce ID-GNN edge-level results
bash run_idgnn_graph.sh   # Reproduce ID-GNN graph-level results

Figure 5: Overview of Identity-aware Graph Neural Networks (ID-GNN).

Use case: Relational Multi-Task Learning: Modeling Relations between Data and Tasks (ICLR 2022 Spotlight)

Reproducing experiments in Relational Multi-Task Learning: Modeling Relations between Data and Tasks, Kaidi Cao*, Jiaxuan You*, Jure Leskovec, ICLR 2022.

# NOTE: We include the raw results for ID-GNN in analysis/idgnn.csv
git checkout meta_link
cd run/scripts/MetaLink/
bash run_metalink.sh.sh   # Reproduce MetaLink results for graph classification

Figure 5: Overview of Identity-aware Graph Neural Networks (ID-GNN).

Use case: ROLAND: Graph Learning Framework for Dynamic Graphs (KDD 2022)

ROLAND: Graph Learning Framework for Dynamic Graphs, Jiaxuan You, Tianyu Du, Jure Leskovec, KDD 2022. ROLAND forks GraphGym implementation. Please checkout the corresponding repository for ROLAND.

Contributors

Jiaxuan You initiates the project and majorly contributes to the entire GraphGym platform. Rex Ying contributes to the feature augmentation modules. Jonathan Gomes Selman enables GraphGym to have OGB support.

GraphGym is inspired by the framework of pycls. GraphGym adopts DeepSNAP as the default data representation. Part of GraphGym relies on Pytorch Geometric functionalities.

Contributing

We warmly welcome the community to contribute to GraphGym. GraphGym is particularly designed to enable contribution / customization in a simple way. For example, you may contribute your modules to graphgym/contrib/ by creating pull requests.

Citing GraphGym

If you find GraphGym or our paper useful, please cite our paper:

@InProceedings{you2020design,
  title = {Design Space for Graph Neural Networks},
  author = {You, Jiaxuan and Ying, Rex and Leskovec, Jure},
  booktitle = {NeurIPS},
  year = {2020}
}

More Repositories

1

snap

Stanford Network Analysis Platform (SNAP) is a general purpose network analysis and graph mining library.
C++
2,167
star
2

ogb

Benchmark datasets, data loaders, and evaluators for graph machine learning
Python
1,906
star
3

pretrain-gnns

Strategies for Pre-training Graph Neural Networks
Python
955
star
4

deepsnap

Python library assists deep learning on graphs
Python
546
star
5

GraphRNN

Python
408
star
6

med-flamingo

Python
375
star
7

neural-subgraph-learning-GNN

Jupyter Notebook
327
star
8

stark

STaRK: Benchmarking LLM Retrieval on Textual and Relational Knowledge Bases (NeurIPS D&B 2024)
Python
297
star
9

snap-python

SNAP Python code, SWIG related files
C++
294
star
10

cs224w-notes

CS224W Course Notes
CSS
292
star
11

KGReasoning

Multi-Hop Logical Reasoning in Knowledge Graphs
Python
274
star
12

GreaseLM

[ICLR 2022 spotlight]GreaseLM: Graph REASoning Enhanced Language Models for Question Answering
Python
229
star
13

MLAgentBench

Python
224
star
14

relbench

RelBench: Relational Deep Learning Benchmark
Python
193
star
15

GEARS

GEARS is a geometric deep learning model that predicts outcomes of novel multi-gene perturbations
Python
189
star
16

distance-encoding

Distance Encoding for GNN Design
Jupyter Notebook
181
star
17

graphwave

Jupyter Notebook
169
star
18

UCE

UCE is a zero-shot foundation model for single-cell gene expression data
Python
158
star
19

covid-mobility

Jupyter Notebook
148
star
20

roland

Jupyter Notebook
125
star
21

GIB

Graph Information Bottleneck (GIB) for learning minimal sufficient structural and feature information using GNNs
Jupyter Notebook
123
star
22

mars

Discovering novel cell types across heterogenous single-cell experiments
Jupyter Notebook
119
star
23

comet

[ICLR 2021] Concept Learners for Few-Shot Learning
Python
111
star
24

SATURN

Jupyter Notebook
103
star
25

orca

[ICLR 2022] Open-World Semi-Supervised Learning
Python
85
star
26

prodigy

Python
75
star
27

CAW

Python
72
star
28

snapvx

Python
65
star
29

conformalized-gnn

Uncertainty Quantification over Graph with Conformalized Graph Neural Networks (NeurIPS 2023)
Python
64
star
30

multiscale-interactome

Python
62
star
31

plato

Python
61
star
32

miner-data

Python
60
star
33

stellar

Jupyter Notebook
58
star
34

mambo

Jupyter Notebook
37
star
35

lamp

[ICLR23] First deep learning-based surrogate model that jointly learns the evolution model and optimizes computational cost via remeshing
Python
36
star
36

crust

[NeurIPS 2020] Coresets for Robust Training of Neural Networks against Noisy Labels
Python
33
star
37

bc-emb

Python
32
star
38

csr

Python
30
star
39

zeroc

ZeroC is a neuro-symbolic method that trained with elementary visual concepts and relations, can zero-shot recognize and acquire more complex, hierarchical concepts, even across domains
Jupyter Notebook
28
star
40

masa

Motif-Aware State Assignment in Noisy Time Series Data
Python
24
star
41

le_pde

LE-PDE accelerates PDEs' forward simulation and inverse optimization via latent global evolution, achieving significant speedup with SOTA accuracy
Jupyter Notebook
21
star
42

ConE

Python
20
star
43

BioDiscoveryAgent

BioDiscoveryAgent is an LLM-based AI agent for closed-loop design of genetic perturbation experiments
Python
19
star
44

F-FADE

Python
17
star
45

MetroMaps

MetroMaps Release
Python
16
star
46

MAG

Programs for Microsoft Academic Graph
Python
16
star
47

snap-dev

SNAP repository for Ringo
C++
14
star
48

exposure-segregation

Python
13
star
49

ringo

Next generation graph processing platform
Python
12
star
50

planet

PlaNet: Predicting population response to drugs via clinical knowledge graph
Python
12
star
51

covid-mobility-tool

Jupyter Notebook
10
star
52

llm-social-network

Jupyter Notebook
10
star
53

reddit-processing

preprocessing of Reddit data
Python
7
star
54

ViRel

ViRel: Unsupervised Visual Relations Discovery with Graph-level Analogy
Python
7
star
55

news-search

search Internet news archive
Java
7
star
56

snap-python-64

C++
6
star
57

snap-dev-64

64-bit SNAP (in development, not intended for general use)
C++
6
star
58

snapworld

Python
6
star
59

lego

5
star
60

yperf

Simple performance monitor for Linux
Python
4
star
61

pebble-fit

become less sedentary with pebble
C
4
star
62

dec2vec

Python
3
star
63

caml

Python
3
star
64

SnapTimeTF

Python
2
star
65

covid-spillovers

Jupyter Notebook
2
star
66

curis-2012

Summer 2012 Curis Project
JavaScript
2
star
67

snaptime

Python
2
star
68

GNN-reading-group

1
star
69

supply-chains

Jupyter Notebook
1
star
70

relbench-user-study

Python
1
star
71

AutoTransfer

Python
1
star
72

hash

C++
1
star