• Stars
    star
    134
  • Rank 269,682 (Top 6 %)
  • Language
    Python
  • License
    MIT License
  • Created about 2 years ago
  • Updated about 2 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Open MatSci ML Toolkit is a framework for prototyping and scaling out deep learning models for materials discovery supporting widely used materials science datasets, and built on top of PyTorch Lightning, the Deep Graph Library, and PyTorch Geometric.

Open MatSci ML Toolkit : A Broad, Multi-Task Benchmark for Solid-State Materials Modeling

matsciml-preprint hpo-paper lightning pytorch dgl pyg License: MIT

This is the implementation of the MatSci ML benchmark, which includes ~1.5 million ground-state materials collected from various datasets, as well as integration of the OpenCatalyst dataset supporting diverse data format (point cloud, DGL graphs, PyG graphs), learning methods (single task, multi-task, multi-data) and deep learning models. Primary project contributors include: Santiago Miret (Intel Labs), Kin Long Kelvin Lee (Intel AXG), Carmelo Gonzales (Intel Labs), Mikhail Galkin (Intel Labs), Marcel Nassar (Intel Labs), Matthew Spellings (Vector Institute).

News

  • [2023/09/27] Release of pre-packaged lmdb-based datasets from v1.0.0 via Zenodo.
  • [2023/08/31] Initial release of the MatSci ML Benchmark with integration of ~1.5 million ground state materials.
  • [2023/07/31] The Open MatSci ML Toolkit : A Flexible Framework for Deep Learning on the OpenCatalyst Dataset paper is accepted into TMLR. See previous version for code related to the benchmark.

Introduction

The MatSci ML Benchmark contains diverse sets of tasks (energy prediction, force prediction, property prediction) across a broad range of datasets (OpenCatalyst Project [1], Materials Project [2], LiPS [3], OQMD [4], NOMAD [5], Carolina Materials Database [6]). Most of the data is related to energy prediction task, which is the most common property tracked for most materials systems in the literature. The codebase support single-task learning, as well as multi-task (training one model for multiple tasks within a dataset) and multi-date (training a model across multiple datsets with a common property). Additionally, we provide a generative materials pipeline that applies diffusion models (CDVAE [7]) to generate new unit cells.

The package follows the original design principles of the Open MatSci ML Toolkit, including:

  • Ease of use for new ML researchers and practitioners that want get started on interacting with the OpenCatalyst dataset.
  • Scalable computation of experiments leveraging PyTorch Lightning across different computation capabilities (laptop, server, cluster) and hardware platforms (CPU, GPU, XPU) without sacrificing performance in the compute and modeling.
  • Integrating support for DGL and PyTorch Geometric for rapid GNN development.

The examples outlined in the next section how to get started with Open MatSci ML Toolkit using simple Python scripts, Jupyter notebooks, or the PyTorch Lightning CLI for a simple training on a portable subset of the original dataset (dev-set) that can be run on a laptop. Subsequently, we scale our example python script to large compute systems, including distributed data parallel training (multiple GPU on a single node) and multi-node training (multiple GPUs across multiple nodes) in a computing cluster. Leveraging both PyTorch Lightning and DGL capabilities, we can enable the compute and experiment scaling with minimal additional complexity.

Installation

  • Docker: We provide a Dockerfile inside the docker that can be run to install a container using standard docker commands.
  • Conda: We have included a conda specification that provides a complete installation including XPU support for PyTorch. Run conda env create -n matsciml --file conda.yml, and in the newly created environment, run pip install './[all]' to install all of the dependencies.
  • pip: In some cases, you might want to install matsciml to an existing environment. Due to how DGL distributes wheels, you will need to add an extra index URL when installing via pip. As an example: pip install -f https://data.dgl.ai/wheels/repo.html './[all]' will install all the matsciml dependencies, in addition to telling pip where to look for CPU-only DGL wheels for your particular platform and Python version. Please consult the DGL documentation for additional help.

Additionally, for a development install, one can specify the extra packages like black and pytest with pip install './[dev]'. These can be added to the commit workflow by running pre-commit install to generate git hooks.

Examples

The examples folder contains simple, unit scripts that demonstrate how to use the pipeline in specific ways:

Get started with different datasets with "devsets"
# Materials project
python examples/datasets/materials_project/single_task_devset.py

# Carolina materials database
python examples/datasets/carolina_db/single_task_devset.py

# NOMAD
python examples/datasets/nomad/single_task_devset.py

# OQMD
python examples/datasets/oqmd/single_task_devset.py
Representation learning with symmetry pretraining
# uses the devset for synthetic point group point clouds
python examples/tasks/symmetry/single_symmetry_example.py
Example notebook-based development and testing
jupyter notebook examples/devel-example.ipynb

For more advanced use cases:

Checkout materials generation with CDVAE

CDVAE [7] is a latent diffusion model that trains a VAE on the reconstruction objective, adds Gaussian noise to the latent variable, and learns to predict the noise. The noised and generated features inlcude lattice parameters, atoms composition, and atom coordinates. The generation process is based on the annealed Langevin dynamics.

CDVAE is implemented in the GenerationTask and we provide a custom data split from the Materials Project bounded by 25 atoms per structure. The process is split into 3 parts with 3 respective scripts found in examples/model_demos/cdvae/.

  1. Training CDVAE on the reconstruction and denoising objectives: cdvae.py
  2. Sampling the structures (from scratch or reconstruct the test set): cdvae_inference.py
  3. Evaluating the sampled structures: cdvae_metrics.py

The sampling procedure takes some time (about 5-8 hours for 10000 structures depending on the hardware) due to the Langevin dynamics. The default hyperparameters of CDVAE components correspond to that from the original paper and can be found in cdvae_configs.py.

# training
python examples/model_demos/cdvae/cdvae.py --data_path <path/to/splits>

# sampling 10,000 structures from scratch
python examples/model_demos/cdvae/cdvae_inference.py --model_path <path/to/checkpoint> --data_path <path/to/splits> --tasks gen

# evaluating the sampled structures
python examples/model_demos/cdvae/cdvae_metrics.py --root_path <path/to/generated_samples> --data_path <path/to/splits> --tasks gen
Multiple tasks trained using the same dataset
# this script requires modification as you'll need to download the materials
# project dataset, and point L24 to the folder where it was saved
python examples/tasks/multitask/single_data_multitask_example.py

Utilizes Materials Project data to train property regression and material classification jointly

Multiple tasks trained using multiple datasets
python examples/tasks/multitask/three_datasets.py

Train regression tasks against IS2RE, S2EF, and LiPS datasets jointly

Data Pipeline

In the scripts folder you will find two scripts needed to download and preprocess datasets: the download_datasets.py can be used to obtain Carolina DB, Materials Project, NOMAD, and OQMD datasets, while the download_ocp_data.py preserves the original Open Catalyst script.

In the current release, we have implemented interfaces to a number of large scale materials science datasets. Under the hood, the data structures pulled from each dataset have been homogenized, and the only real interaction layer for users is through the MatSciMLDataModule, a subclass of LightningDataModule.

from matsciml.lightning.data_utils import MatSciMLDataModule

# no configuration needed, although one can specify the batch size and number of workers
devset_module = MatSciMLDataModule.from_devset(dataset="MaterialsProjectDataset")

This will let you springboard into development without needing to worry about how to wrangle with the datasets; just grab a batch and go! With the exception of Open Catalyst, datasets will typically return point cloud representations; we provide a flexible transform interface to interconvert between representations and frameworks:

From point clouds to DGL graphs
from matsciml.datasets.transforms import PointCloudToGraphTransform

# make the materials project dataset emit DGL graphs, based on a atom-atom distance cutoff of 10
devset = MatSciMLDataModule.from_devset(
    dataset="MaterialsProjectDataset",
    dset_kwargs={"transforms": [PointCloudToGraphTransform(backend="dgl", cutoff_dist=10.)]}
)
But I want to use PyG?
from matsciml.datasets.transforms import PointCloudToGraphTransform

# change the backend argument to obtain PyG graphs
devset = MatSciMLDataModule.from_devset(
    dataset="MaterialsProjectDataset",
    dset_kwargs={"transforms": [PointCloudToGraphTransform(backend="pyg", cutoff_dist=10.)]}
)
What else can I configure with `MatSciMLDataModule`?

Datasets beyond devsets can be configured through class arguments:

devset = MatSciMLDataModule(
    dataset="MaterialsProjectDataset",
    train_path="/path/to/training/lmdb/folder",
    batch_size=64,
    num_workers=4,     # configure data loader instances
    dset_kwargs={"transforms": [PointCloudToGraphTransform(backend="pyg", cutoff_dist=10.)]},
    val_split="/path/to/val/lmdb/folder"
)

In particular, val_split and test_split can point to their LMDB folders, or just a float between [0,1] to do quick, uniform splits. The rest, including distributed sampling, will be taken care of for you under the hood.

How do I compose multiple datasets?

Given the amount of configuration involved, composing multiple datasets takes a little more work but we have tried to make it as seamless as possible. The main difference from the single dataset case is replacing MatSciMLDataModule with MultiDataModule from matsciml.lightning.data_utils, configuring each dataset manually, and passing them collectively into the data module:

from matsciml.datasets import MaterialsProjectDataset, OQMDDataset, MultiDataset
from matsciml.lightning.data_utils import MultiDataModule

# configure training only here, but same logic extends to validation/test splits
train_dset = MultiDataset(
  [
    MaterialsProjectDataset("/path/to/train/materialsproject"),
    OQMDDataset("/path/to/train/oqmd")
  ]
)

# this configures the actual data module passed into Lightning
datamodule = MultiDataModule(
  batch_size=32,
  num_workers=4,
  train_dataset=train_dset
)

While it does require a bit of extra work, this was to ensure flexibility in how you can compose datasets. We welcome feedback on the user experience! ๐Ÿ˜ƒ

Task abstraction

In Open MatSci ML Toolkit, tasks effective form learning objectives: at a high level, a task takes an encoding model/backbone that ingests a structure to predict one or several properties, or classify a material. In the single task case, there may be multiple targets and the neural network architecture may be fluid, but there is only one optimizer. Under this definition, multi-task learning comprises multiple tasks and optimizers operating jointly through a single embedding.

References

  • [1] Chanussot, L., Das, A., Goyal, S., Lavril, T., Shuaibi, M., Riviere, M., Tran, K., Heras-Domingo, J., Ho, C., Hu, W. and Palizhati, A., 2021. Open catalyst 2020 (OC20) dataset and community challenges. Acs Catalysis, 11(10), pp.6059-6072.
  • [2] Jain, A., Ong, S.P., Hautier, G., Chen, W., Richards, W.D., Dacek, S., Cholia, S., Gunter, D., Skinner, D., Ceder, G. and Persson, K.A., 2013. Commentary: The Materials Project: A materials genome approach to accelerating materials innovation. APL materials, 1(1).
  • [3] Batzner, S., Musaelian, A., Sun, L., Geiger, M., Mailoa, J.P., Kornbluth, M., Molinari, N., Smidt, T.E. and Kozinsky, B., 2022. E (3)-equivariant graph neural networks for data-efficient and accurate interatomic potentials. Nature communications, 13(1), p.2453.
  • [4] Kirklin, S., Saal, J.E., Meredig, B., Thompson, A., Doak, J.W., Aykol, M., Rรผhl, S. and Wolverton, C., 2015. The Open Quantum Materials Database (OQMD): assessing the accuracy of DFT formation energies. npj Computational Materials, 1(1), pp.1-15.
  • [5] Draxl, C. and Scheffler, M., 2019. The NOMAD laboratory: from data sharing to artificial intelligence. Journal of Physics: Materials, 2(3), p.036001.
  • [6] Zhao, Y., Alโ€Fahdi, M., Hu, M., Siriwardane, E.M., Song, Y., Nasiri, A. and Hu, J., 2021. Highโ€throughput discovery of novel cubic crystal materials using deep generative neural networks. Advanced Science, 8(20), p.2100566.
  • [7] Xie, T., Fu, X., Ganea, O.E., Barzilay, R. and Jaakkola, T.S., 2021, October. Crystal Diffusion Variational Autoencoder for Periodic Material Generation. In International Conference on Learning Representations.

Citations

If you use Open MatSci ML Toolkit in your technical work or publication, we would appreciate it if you cite the Open MatSci ML Toolkit paper in TMLR:

Miret, S.; Lee, K. L. K.; Gonzales, C.; Nassar, M.; Spellings, M. The Open MatSci ML Toolkit: A Flexible Framework for Machine Learning in Materials Science. Transactions on Machine Learning Research, 2023.
@article{openmatscimltoolkit,
  title = {The Open {{MatSci ML}} Toolkit: {{A}} Flexible Framework for Machine Learning in Materials Science},
  author = {Miret, Santiago and Lee, Kin Long Kelvin and Gonzales, Carmelo and Nassar, Marcel and Spellings, Matthew},
  year = {2023},
  journal = {Transactions on Machine Learning Research},
  issn = {2835-8856}
}

If you use v1.0.0, please cite our paper:

Lee, K. L. K., Gonzales, C., Nassar, M., Spellings, M., Galkin, M., & Miret, S. (2023). MatSciML: A Broad, Multi-Task Benchmark for Solid-State Materials Modeling. arXiv preprint arXiv:2309.05934.
@article{lee2023matsciml,
  title={MatSciML: A Broad, Multi-Task Benchmark for Solid-State Materials Modeling},
  author={Lee, Kin Long Kelvin and Gonzales, Carmelo and Nassar, Marcel and Spellings, Matthew and Galkin, Mikhail and Miret, Santiago},
  journal={arXiv preprint arXiv:2309.05934},
  year={2023}
}

Please cite datasets used in your work as well. You can find additional descriptions and details regarding each dataset here.

More Repositories

1

distiller

Neural Network Distiller by Intel AI Lab: a Python package for neural network compression research. https://intellabs.github.io/distiller
Jupyter Notebook
4,332
star
2

nlp-architect

A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks
Python
2,936
star
3

coach

Reinforcement Learning Coach by Intel AI Lab enables easy experimentation with state of the art Reinforcement Learning algorithms
Python
2,321
star
4

control-flag

A system to flag anomalous source code expressions by learning typical expressions from training data
C++
1,241
star
5

fastRAG

Efficient Retrieval Augmentation and Generation Framework
Python
1,194
star
6

flrc

Haskell Research Compiler
Standard ML
814
star
7

RiverTrail

An API for data parallelism in JavaScript
JavaScript
748
star
8

kAFL

A fuzzer for full VM kernel/driver targets
Makefile
636
star
9

bayesian-torch

A library for Bayesian neural network layers and uncertainty estimation in Deep Learning extending the core of PyTorch
Python
503
star
10

academic-budget-bert

Repository containing code for "How to Train BERT with an Academic Budget" paper
Python
308
star
11

ParallelAccelerator.jl

The ParallelAccelerator package, part of the High Performance Scripting project at Intel Labs
Julia
294
star
12

RAGFoundry

Framework for enhancing LLMs for RAG tasks using fine-tuning.
Python
289
star
13

SkimCaffe

Caffe for Sparse Convolutional Neural Network
C++
238
star
14

pWord2Vec

Parallelizing word2vec in shared and distributed memory
C++
191
star
15

causality-lab

Causal discovery algorithms and tools for implementing new ones
Jupyter Notebook
167
star
16

Model-Compression-Research-Package

A library for researching neural networks compression and acceleration methods.
Python
134
star
17

riscv-vector

Vector Acceleration IP core for RISC-V*
Scala
131
star
18

IntelNeuromorphicDNSChallenge

Intel Neuromorphic DNS Challenge
Jupyter Notebook
126
star
19

MMPano

Official implementation of L-MAGIC
Python
122
star
20

rnnlm

Recurrent Neural Network Language Modeling (RNNLM) Toolkit
C++
121
star
21

HPAT.jl

High Performance Analytics Toolkit (HPAT) is a Julia-based framework for big data analytics on clusters.
Julia
120
star
22

FP8-Emulation-Toolkit

PyTorch extension for emulating FP8 data formats on standard FP32 Xeon/GPU hardware.
Python
90
star
23

ScalableVectorSearch

C++
88
star
24

VL-InterpreT

Visual Language Transformer Interpreter - An interactive visualization tool for interpreting vision-language transformers
Python
84
star
25

vdms

VDMS: Your Favorite Visual Data Management System
C++
82
star
26

SpMP

sparse matrix pre-processing library
C++
81
star
27

SLIDE_opt_ia

C++
74
star
28

CLNeRF

Python
63
star
29

baa-ngp

This repository contains the official Implementation for "BAA-NGP: Bundle-Adjusting Accelerated Neural Graphics Primitives".
Python
56
star
30

autonomousmavs

Framework for Autonomous Navigation of Micro Aerial Vehicles
C++
56
star
31

Latte.jl

A high-performance DSL for deep neural networks in Julia
Julia
52
star
32

AVUC

Code to accompany the paper 'Improving model calibration with accuracy versus uncertainty optimization'.
Python
51
star
33

multimodal_cognitive_ai

research work on multimodal cognitive ai
Python
51
star
34

GraVi-T

Graph learning framework for long-term video understanding
Python
49
star
35

PreSiFuzz

Pre-Silicon Hardware Fuzzing Toolkit
Rust
47
star
36

pmgd

Persistent Memory Graph Database
C++
43
star
37

TSAD-Evaluator

Intel Labs open source repository for time series anomaly detection evaluator
C++
41
star
38

Open-Omics-Acceleration-Framework

Intel lab's open sourced data science framework for accelerating digital biology
Jupyter Notebook
36
star
39

Auto-Steer

Auto-Steer
Python
36
star
40

FloorSet

Jupyter Notebook
34
star
41

SAR

Python
34
star
42

kafl.fuzzer

kAFL Fuzzer
Python
32
star
43

CompilerTools.jl

The CompilerTools package, part of the High Performance Scripting project at Intel Labs
Julia
30
star
44

TinyGarble2.0

C++
29
star
45

t2sp

Productive and portable performance programming across spatial architectures (FPGAs, etc.) and vector architectures (GPUs, etc.)
C++
29
star
46

DyNAS-T

Dynamic Neural Architecture Search Toolkit
Jupyter Notebook
28
star
47

ParallelJavaScript

A collection of example workloads for Parallel JavaScript
HTML
26
star
48

kafl.targets

Target components for kAFL/Nyx Fuzzer
C
25
star
49

continuallearning

Python
25
star
50

iHRC

Intel Heterogeneous Research Compiler (iHRC)
C++
25
star
51

scenario_execution

Scenario Execution for Robotics
Python
25
star
52

flrc-lib

Pillar compiler, Pillar runtime, garbage collector.
C++
23
star
53

lvlm-interpret

Python
23
star
54

iACT

C++
22
star
55

OSCAR

Object Sensing and Cognition for Adversarial Robustness
Jupyter Notebook
20
star
56

mat2qubit

Python
19
star
57

MICSAS

MISIM: A Neural Code Semantics Similarity System Using the Context-Aware Semantics Structure
Python
19
star
58

csg

IV 2020 "CSG: Critical Scenario Generation from Real Traffic Accidents"
Python
18
star
59

Sparso

Julia package for accelerating sparse matrix applications.
Julia
18
star
60

open-omics-alphafold

Python
17
star
61

MART

Modular Adversarial Robustness Toolkit
Python
16
star
62

Trans-Omics-Acceleration-Library

HTML
15
star
63

Hardware-Aware-Automated-Machine-Learning

Jupyter Notebook
15
star
64

kafl.linux

Linux kernel branches for confidential compute research
15
star
65

c3-simulator

C3-Simulator is a Simics-based functional simulator for the X86 C3 processor, including library and kernel support for pointer and data encryption, stack unwinding support for C++ exception handling, debugger enabling, and scripting for running tests.
C++
14
star
66

VectorSearchDatasets

Python
11
star
67

ais-benchmarks

A framework, based on python and numpy, for evaluation of sampling methods
Python
10
star
68

ALTO

A template-based implementation of the Adaptive Linearized Tensor Order (ALTO) format for storing and processing sparse tensors.
C++
10
star
69

flrc-benchmarks

Benchmarks for use with IntelLabs/flrc.
Haskell
10
star
70

hec-p-isa-tools

Intelโ€™s HERACLES accelerator introduces a new set of fundamental instructions, the Polynomial Instructions Set Architecture (P-ISA) that operates directly on polynomials requiring a completely new programming environment. This open-source project aims at developing the building blocks for a compiler toolchain for HERACLES.
Python
10
star
71

PyTorchALFI

Application Level Fault Injection for Pytorch
Python
9
star
72

RiverTrail-interactive

An interactive shell in your browser for writing and running River Trail programs
JavaScript
8
star
73

gma

Linux Client & Server Software to support Generic Multi-Access Network Virtualization
C++
8
star
74

dfm

DFM (Deep Feature Modeling) is an efficient and principled method for out-of-distribution detection, novelty and anomaly detection.
Python
7
star
75

SOI_FFT

Segment-of-interest low-communication FFT algorithm
C
7
star
76

vcl

DEPRECATED - No longer maintained. Updates are will be provided through the VDMS project
C++
6
star
77

DATSA

DATSA
C++
6
star
78

Hybrid-Quantum-Classical-Library

Hybrid Quantum-Classical Library (HQCL)
C++
6
star
79

spic

Semantic Preserving Image Compression
Python
6
star
80

generative-ai

Intel Generative Image Model Benchmark
Jupyter Notebook
6
star
81

Optimized-Implementation-of-Word-Movers-Distance

C++
6
star
82

token_elimination

Python
6
star
83

NeuroCounterfactuals

Jupyter Notebook
5
star
84

c3-glibc

C
5
star
85

PolarFly

Source code repository for paper being presented at Super Computing 22 Conference.
C++
5
star
86

aspect-extraction

Pattern Based Aspect Term Extraction
Python
5
star
87

networkgym

NetworkGym is a Simulation-aaS framework to support Network AI algorithm development by providing high-fidelity full-stack e2e network simulation in cloud and allowing AI developers to interact with the simulated network environment through open APIs.
C++
5
star
88

Latte.py

Python
5
star
89

HDFIT

HDFIT (Hardware Design Fault Injection Toolkit) Github documentation pages.
5
star
90

TME-MK-Fine-Grained-Encryption-Integrity

Makefile
5
star
91

EquiTriton

EquiTriton is a project that seeks to implement high-performance kernels for commonly used building blocks in equivariant neural networks, enabling compute efficient training and inference.
Python
4
star
92

Incremental-Neural-Videos-with-PyTorch

Incremental-Neural-Videos-with-PyTorch*
Python
4
star
93

kafl.qemu

4
star
94

simics-plus-rtl

This project contains the Chisel code for a CRC32 datapath alongside a skeleton PCI component in Simics DML which connects to the C++ conversion of the CRC32 datapath.
Scala
4
star
95

Chisel-cocotb-Examples

This project contains generic example hardware modules and their testbenches written in Chisel and cocotb to demonstrate an agile hardware development methodology.
Python
4
star
96

LogReplicationRocksDB

C++
4
star
97

emp-ot

C++
3
star
98

kafl.libxdc

C
3
star
99

kafl.actions

Github actions for KAFL
Python
3
star
100

emp-tool

C++
3
star