GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training
Original implementation for paper GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training.
GCC is a contrastive learning framework that implements unsupervised structural graph representation pre-training and achieves state-of-the-art on 10 datasets on 3 graph mining tasks.
- GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training
Installation
Requirements
- Linux with Python ≥ 3.6
- PyTorch ≥ 1.4.0
- 0.5 > DGL ≥ 0.4.3
pip install -r requirements.txt
- Install RDKit with
conda install -c conda-forge rdkit=2019.09.2
.
Quick Start
Pretraining
Pre-training datasets
python scripts/download.py --url https://drive.google.com/open?id=1JCHm39rf7HAJSp-1755wa32ToHCn2Twz --path data --fname small.bin
# For regions where Google is not accessible, use
# python scripts/download.py --url https://cloud.tsinghua.edu.cn/f/b37eed70207c468ba367/?dl=1 --path data --fname small.bin
E2E
Pretrain E2E with K = 255
:
bash scripts/pretrain.sh <gpu> --batch-size 256
MoCo
Pretrain MoCo with K = 16384; m = 0.999
:
bash scripts/pretrain.sh <gpu> --moco --nce-k 16384
Download Pretrained Models
Instead of pretraining from scratch, you can download our pretrained models.
python scripts/download.py --url https://drive.google.com/open?id=1lYW_idy9PwSdPEC7j9IH5I5Hc7Qv-22- --path saved --fname pretrained.tar.gz
# For regions where Google is not accessible, use
# python scripts/download.py --url https://cloud.tsinghua.edu.cn/f/cabec37002a9446d9b20/?dl=1 --path saved --fname pretrained.tar.gz
Downstream Tasks
Downstream datasets
python scripts/download.py --url https://drive.google.com/open?id=12kmPV3XjVufxbIVNx5BQr-CFM9SmaFvM --path data --fname downstream.tar.gz
# For regions where Google is not accessible, use
# python scripts/download.py --url https://cloud.tsinghua.edu.cn/f/2535437e896c4b73b6bb/?dl=1 --path data --fname downstream.tar.gz
Generate embeddings on multiple datasets with
bash scripts/generate.sh <gpu> <load_path> <dataset_1> <dataset_2> ...
For example:
bash scripts/generate.sh 0 saved/Pretrain_moco_True_dgl_gin_layer_5_lr_0.005_decay_1e-05_bsz_32_hid_64_samples_2000_nce_t_0.07_nce_k_16384_rw_hops_256_restart_prob_0.8_aug_1st_ft_False_deg_16_pos_32_momentum_0.999/current.pth usa_airport kdd imdb-binary
Node Classification
Unsupervised (Table 2 freeze)
Run baselines on multiple datasets with bash scripts/node_classification/baseline.sh <hidden_size> <baseline:prone/graphwave> usa_airport h-index
.
Evaluate GCC on multiple datasets:
bash scripts/generate.sh <gpu> <load_path> usa_airport h-index
bash scripts/node_classification/ours.sh <load_path> <hidden_size> usa_airport h-index
Supervised (Table 2 full)
Finetune GCC on multiple datasets:
bash scripts/finetune.sh <load_path> <gpu> usa_airport
Note this finetunes the whole network and will take much longer than the freezed experiments above.
Graph Classification
Unsupervised (Table 3 freeze)
bash scripts/generate.sh <gpu> <load_path> imdb-binary imdb-multi collab rdt-b rdt-5k
bash scripts/graph_classification/ours.sh <load_path> <hidden_size> imdb-binary imdb-multi collab rdt-b rdt-5k
Supervised (Table 3 full)
bash scripts/finetune.sh <load_path> <gpu> imdb-binary
Similarity Search (Table 4)
Run baseline (graphwave) on multiple datasets with bash scripts/similarity_search/baseline.sh <hidden_size> graphwave kdd_icdm sigir_cikm sigmod_icde
.
Run GCC:
bash scripts/generate.sh <gpu> <load_path> kdd icdm sigir cikm sigmod icde
bash scripts/similarity_search/ours.sh <load_path> <hidden_size> kdd_icdm sigir_cikm sigmod_icde
❗ Common Issues
"XXX file not found" when running pretraining/downstream tasks.
Please make sure you've downloaded the pretraining dataset or downstream task datasets according to GETTING_STARTED.md.
Server crashes/hangs after launching pretraining experiments.
In addition to GPU, our pretraining stage requires a lot of computation resources, including CPU and RAM. If this happens, it usually means the CPU/RAM is exhausted on your machine. You can decrease `--num-workers` (number of dataloaders using CPU) and `--num-copies` (number of datasets copies residing in RAM). With the lowest profile, try `--num-workers 1 --num-copies 1`.
If this still fails, please upgrade your machine :). In the meanwhile, you can still download our pretrained model and evaluate it on downstream tasks.
Citing GCC
If you use GCC in your research or wish to refer to the baseline results, please use the following BibTeX.
@article{qiu2020gcc,
title={GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training},
author={Qiu, Jiezhong and Chen, Qibin and Dong, Yuxiao and Zhang, Jing and Yang, Hongxia and Ding, Ming and Wang, Kuansan and Tang, Jie},
journal={arXiv preprint arXiv:2006.09963},
year={2020}
}
Acknowledgements
Part of this code is inspired by Yonglong Tian et al.'s CMC: Contrastive Multiview Coding.