• Stars
    star
    161
  • Rank 233,470 (Top 5 %)
  • Language
  • License
    Apache License 2.0
  • Created over 4 years ago
  • Updated about 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Get started with CORD-19

The COVID-19 Open Research Dataset (CORD-19)

CORD-19 is a corpus of academic papers about COVID-19 and related coronavirus research. It's curated and maintained by the Semantic Scholar team at the Allen Institute for AI to support text mining and NLP research. Please read our paper for an in-depth description of how it was created: https://www.aclweb.org/anthology/2020.nlpcovid19-acl.1/

The final version of CORD-19 was released on June 2, 2022. Since we launched the dataset on March 13, 2020, we have released an updated version of the dataset almost every week. Starting from around 40K articles in its first version, the dataset has grown to index over 1M papers, and includes full text content for nearly 370K papers. We thank you for your support and feedback throughout this process. For more information, please see this blog post. A list of alternate data resources are provided under Other resources.

Updates

  • 2022-06-02 - Final release of CORD-19
  • 2021-03-01 - Review article published in Briefings in Bioinformatics
  • 2020-07-09 - CORD-19 presented at the NLP-COVID workshop.
  • 2020-03-13 - CORD-19 initial release

Important notes

We have performed some data cleaning that is sufficient to fuel most text mining & NLP research efforts. But we do not intend to provide sufficient cleaning for this data to be usable for directly consuming (reading) papers about COVID-19 or coronaviruses. There will always be some amount of error, which will make CORD-19 more/less usable for certain applications than others. We leave it up to the user to make this determination, though please feel free to consult us for recommendations.

While CORD-19 was initially released on 2020-03-13, the current schema is defined base on an update on 2020-05-26. Older versions of CORD-19 will not necessarily adhere to exactly the schema defined in this README. Please reach out for help on this if working with old CORD-19 versions.

Download

All versions of CORD-19 can be found HERE.

First published version (2020-03-13): Download Link (size: 0.3Gb, md5: a36fe181, sha1: 8fbea927)

Last published version (2022-06-02): Download Link (size: 18.7Gb, md5: c557069e, sha1: dd2c32bc)

Dataset Versions Used for TREC-COVID Shared Task

TREC-COVID Shared Task Website: https://ir.nist.gov/covidSubmit/index.html

TREC-COVID Date Changelog Link to download md5 sha1
Round 1 2020-04-10 link cord-19_2020-04-10.tar.gz (1.5GB) f4c3e742 4980d8ee
Round 2 2020-05-01 link cord-19_2020-05-01.tar.gz (1.7GB) e8c56920 dc22dbc9
Round 3 2020-05-19 link cord-19_2020-05-19.tar.gz (2.8GB) 6424de9c 1781b935
Round 4 2020-06-19 link cord-19_2020-06-19.tar.gz (3.3GB) 47b61215 fdd0490e
Round 5 2020-07-16 link cord-19_2020-07-16.tar.gz (3.7GB) 018c4bc4 7adcf31a

Dataset Versions Used for EPIC-QA Shared Task

EPIC-QA Shared Task Website: https://bionlp.nlm.nih.gov/epic_qa/

EPIC-QA Date Changelog Link to download md5 sha1
Preliminary round 2020-06-19 link cord-19_2020-06-19.tar.gz (3.3GB) 47b61215 fdd0490e
Primary round 2020-10-22 link cord-19_2020-10-22.tar.gz (5.3GB) 7cb9e743 7efe285f

Overview

CORD-19 is released weekly. Each version of the corpus is tagged with a datestamp (e.g. 2020-05-26). Releases look like:

|-- 2020-05-26/
    |-- changelog
    |-- cord_19_embeddings.tar.gz
    |-- document_parses.tar.gz
    |-- metadata.csv
|-- 2020-05-27/
|-- ...

The files in each version are:

  • changelog: A text file summarizing changes between this and the previous version.
  • cord_19_embeddings.tar.gz: A collection of precomputed SPECTER document embeddings for each CORD-19 paper
  • document_parses.tar.gz: A collection of JSON files that contain full text parses of a subset of CORD-19 papers
  • metadata.csv: Metadata for all CORD-19 papers.

When cord_19_embeddings.tar.gz is uncompressed, it is a 769-column CSV file, where the first column is the cord_uid and the remaining columns correspond to a 768-dimensional document embedding. For example:

ug7v899j,-2.939983606338501,-6.312200546264648,-1.0459030866622925,5.164162635803223,-0.32564637064933777,-2.507413387298584,1.735608696937561,1.9363566637039185,0.622501015663147,1.5613162517547607,...

When document_parses.tar.gz is uncompressed, it is a directory:

|-- document_parses/
    |-- pdf_json/
        |-- 80013c44d7d2d3949096511ad6fa424a2c740813.json
        |-- bfe20b3580e7c539c16ce4b1e424caf917d3be39.json
        |-- ...
    |-- pmc_json/
        |-- PMC7096781.xml.json
        |-- PMC7118448.xml.json
        |-- ...

Example usage

We recommend everyone primarily use metadata.csv & augment data when needed with full text in document_parses/. For example, let's say we wanted to collect a bunch of Titles, Abstracts, and Introductions of papers. In Python, such a script might look like:

import csv
import os
import json
from collections import defaultdict

cord_uid_to_text = defaultdict(list)

# open the file
with open('metadata.csv') as f_in:
    reader = csv.DictReader(f_in)
    for row in reader:

        # access some metadata
        cord_uid = row['cord_uid']
        title = row['title']
        abstract = row['abstract']
        authors = row['authors'].split('; ')

        # access the full text (if available) for Intro
        introduction = []
        if row['pdf_json_files']:
            for json_path in row['pdf_json_files'].split('; '):
                with open(json_path) as f_json:
                    full_text_dict = json.load(f_json)

                    # grab introduction section from *some* version of the full text
                    for paragraph_dict in full_text_dict['body_text']:
                        paragraph_text = paragraph_dict['text']
                        section_name = paragraph_dict['section']
                        if 'intro' in section_name.lower():
                            introduction.append(paragraph_text)

                    # stop searching other copies of full text if already got introduction
                    if introduction:
                        break

        # save for later usage
        cord_uid_to_text[cord_uid].append({
            'title': title,
            'abstract': abstract,
            'introduction': introduction
        })

metadata.csv overview

We recommend everyone work with metadata.csv as the starting point. This file is comma-separated with the following columns:

  • cord_uid: A str-valued field that assigns a unique identifier to each CORD-19 paper. This is not necessariy unique per row, which is explained in the FAQs.
  • sha: A List[str]-valued field that is the SHA1 of all PDFs associated with the CORD-19 paper. Most papers will have either zero or one value here (since we either have a PDF or we don't), but some papers will have multiple. For example, the main paper might have supplemental information saved in a separate PDF. Or we might have two separate PDF copies of the same paper. If multiple PDFs exist, their SHA1 will be semicolon-separated (e.g. '4eb6e165ee705e2ae2a24ed2d4e67da42831ff4a; d4f0247db5e916c20eae3f6d772e8572eb828236')
  • source_x: A List[str]-valued field that is the names of sources from which we received this paper. Also semicolon-separated. For example, 'ArXiv; Elsevier; PMC; WHO'. There should always be at least one source listed.
  • title: A str-valued field for the paper title
  • doi: A str-valued field for the paper DOI
  • pmcid: A str-valued field for the paper's ID on PubMed Central. Should begin with PMC followed by an integer.
  • pubmed_id: An int-valued field for the paper's ID on PubMed.
  • license: A str-valued field with the most permissive license we've found associated with this paper. Possible values include: 'cc0', 'hybrid-oa', 'els-covid', 'no-cc', 'cc-by-nc-sa', 'cc-by', 'gold-oa', 'biorxiv', 'green-oa', 'bronze-oa', 'cc-by-nc', 'medrxiv', 'cc-by-nd', 'arxiv', 'unk', 'cc-by-sa', 'cc-by-nc-nd'
  • abstract: A str-valued field for the paper's abstract
  • publish_time: A str-valued field for the published date of the paper. This is in yyyy-mm-dd format. Not always accurate as some publishers will denote unknown dates with future dates like yyyy-12-31
  • authors: A List[str]-valued field for the authors of the paper. Each author name is in Last, First Middle format and semicolon-separated.
  • journal: A str-valued field for the paper journal. Strings are not normalized (e.g. BMJ and British Medical Journal can both exist). Empty string if unknown.
  • mag_id: Deprecated, but originally an int-valued field for the paper as represented in the Microsoft Academic Graph.
  • who_covidence_id: A str-valued field for the ID assigned by the WHO for this paper. Format looks like #72306.
  • arxiv_id: A str-valued field for the arXiv ID of this paper.
  • pdf_json_files: A List[str]-valued field containing paths from the root of the current data dump version to the parses of the paper PDFs into JSON format. Multiple paths are semicolon-separated. Example: document_parses/pdf_json/4eb6e165ee705e2ae2a24ed2d4e67da42831ff4a.json; document_parses/pdf_json/d4f0247db5e916c20eae3f6d772e8572eb828236.json
  • pmc_json_files: A List[str]-valued field. Same as above, but corresponding to the full text XML files downloaded from PMC, parsed into the same JSON format as above.
  • url: A List[str]-valued field containing all URLs associated with this paper. Semicolon-separated.
  • s2_id: A str-valued field containing the Semantic Scholar ID for this paper. Can be used with the Semantic Scholar API (e.g. s2_id=9445722 corresponds to http://api.semanticscholar.org/corpusid:9445722)

Questions about CORD-19

Why can the same cord_uid appear in multiple rows?

This is a very tricky issue, and we have not decided on the best way forward. To explain, let’s take example cord_uid=hox2xwjg. Examining their respective rows in the metadata file, we see that they are the same paper, but sent from different sources (Elsevier, PMC). The Elsevier row has DOI and PDF, but the PMC row doesn’t. Furthermore, the PMC ID, publication date, and URL for each of these rows is different.

Technically all of this data is representative of paper hox2xwjg so we don’t want to remove any of it. But combining them into one cluster would require a schema change to the data, which would break a lot of people’s code. Hopefully this is not too big an issue because there are only a small percentage of papers affected, but know that this issue exists and we’re debating what’s the best way forward.

Why do the PMC JSONs not contain any abstracts, yet the PDF JSONs contain abstracts?

Abstracts in the metadata.csv file are β€œgold” provided directly from publishers or digital archives. Because PMC is very consistent at providing us β€œgold” abstracts, we do not bother with parsing the PMC XMLs for abstract text (it’s already in the metadata.csv). As such, the PMC JSONs do not contain abstracts. This is not the case for PDF JSONs. We often obtain PDFs through crawling, and in this manner, we would not have β€œgold” abstracts provided to us. As such, we still opt to parse the PDF for abstract text, which is why that field exists.

Why do the title/authors in the JSON look different from what’s in the metadata file?

The most likely reason is PDF parsing errors. Occasionally, publishers will have different metadata from what is actually displayed on the PDF itself (e.g. slight differences in author names). We encourage users to use fields in the metadata file by default and only fall back on the JSON when it is missing.

Why is the JSON missing certain metadata, like publication dates?

The JSONs are only meant for representing the full text of the PDF in a structured, machine-readable format. Many metadata fields like dates and venues don’t commonly appear on the PDF. Please defer to the metadata file for all such fields, since these come from the publishers directly.

How do you handle paper objects like tables, figures, equations?

Many papers in CORD-19 include HTML table parses. These table parses are available in the document parse files under ref_entries of type table. Note: not all tables will have HTML parses. These parses leverage IBM Watson Discovery capabilities (more details can be found in our paper).

Figure images are currently not available. We’re currently looking into how to best support these. As for equations, we do not do anything special here – the symbols are treated as text and should be included in the text blobs.

What should we do if both PDF and PMC JSONs exist? Or if there are multiple PDF JSONs?

We view these as different attempts/views to represent the same paper/document. Some are going to be higher quality than others. Treat these are separate representations of the same document – you can choose to use one, both, neither (i.e. just use the metadata fields). On average, we believe the PMC JSONs are cleaner than the PDF JSONs but that’s not necessarily true.

Why can the same sha appear for different cord_uid?

Let’s take a look at examples cord_uid=d9v5xtx7 and cord_uid=8avkjc84. They both share PDF sha=5d0d0bd116976e1412c10a84902894999df4a342. These are two papers we sourced from Elsevier. If you follow the URLs, you’ll notice that they actually retrieve the same PDF despite different having different DOIs. This is an upstream error from the publisher, which we can’t necessarily do anything about. Hopefully the number of these cases is small.

Contact

Mailing list

Subscribe to notifications about CORD-19 at: https://share.hsforms.com/1cM7MMF68RqCdbBKTcyN7VQ3ioxm

Email

Please email [email protected] and [email protected] for any questions or concerns.

Citing CORD-19

Our paper was accepted to the NLP-COVID workshop at ACL 2020. See the reviews on OpenReview: https://openreview.net/forum?id=0gLzHrE_t3z. The paper is available in the ACL Anthology (BibTeX below): https://www.aclweb.org/anthology/2020.nlpcovid19-acl.1

@inproceedings{wang-etal-2020-cord,
    title = "{CORD-19}: The {COVID-19} Open Research Dataset",
    author = "Wang, Lucy Lu  and Lo, Kyle  and Chandrasekhar, Yoganand  and Reas, Russell  and Yang, Jiangjiang  and Burdick, Doug  and Eide, Darrin  and Funk, Kathryn  and Katsis, Yannis  and Kinney, Rodney Michael  and Li, Yunyao  and Liu, Ziyang  and Merrill, William  and Mooney, Paul  and Murdick, Dewey A.  and Rishi, Devvret  and Sheehan, Jerry  and Shen, Zhihong  and Stilson, Brandon  and Wade, Alex D.  and Wang, Kuansan  and Wang, Nancy Xin Ru  and Wilhelm, Christopher  and Xie, Boya  and Raymond, Douglas M.  and Weld, Daniel S.  and Etzioni, Oren  and Kohlmeier, Sebastian",
    booktitle = "Proceedings of the 1st Workshop on {NLP} for {COVID-19} at {ACL} 2020",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.nlpcovid19-acl.1"
}

Projects using CORD-19

This is a Google Sheet tracking systems and demos that use CORD-19. Projects are listed in random order. Our focus here is to collect community efforts that might not be discoverable because systems and demos don't always translate to papers (which we can find via citations of CORD-19).

Missing yours or incomplete data? Let us know using this Google Form or email us!

Other resources

S2ORC-doc2json: We use this library to process PDFs and PubMed JATS XML into the format released in CORD-19. This library can be adapted to produce your own versions of the dataset. Source code and instructions for using the library can be found here.

Semantic Scholar API: Metadata, paper abstracts, and citation information for papers we index are available through our API. Documentation here.

S2ORC: A dataset of millions of full text papers processed in the same way as CORD-19, but covering many different fields of science. Not regularly updated; intended for offline research, like model development. Available here.

PubMed Central: The National Library of Medicine (NLM) continues to collaborate with publishers to make COVID-19 and coronavirus-related publications and associated data immediately accessible in PubMed Central (PMC) in human- and machine-readable forms. Available here.

LitCovid: NLM continues to update its LitCovid dataset of COVID-19 related publications to facilitate text mining. Available here.

More Repositories

1

allennlp

An open-source NLP research library, built on PyTorch.
Python
11,751
star
2

OLMo

Modeling, training, eval, and inference code for OLMo
Python
4,535
star
3

RL4LMs

A modular RL library to fine-tune language models to human preferences
Python
2,101
star
4

longformer

Longformer: The Long-Document Transformer
Python
2,022
star
5

bilm-tf

Tensorflow implementation of contextualized word representations from bi-directional language models
Python
1,621
star
6

scispacy

A full spaCy pipeline and models for scientific/biomedical documents.
Python
1,618
star
7

bi-att-flow

Bi-directional Attention Flow (BiDAF) network is a multi-stage hierarchical process that represents context at different levels of granularity and uses a bi-directional attention flow mechanism to achieve a query-aware context representation without early summarization.
Python
1,533
star
8

scibert

A BERT model for scientific text.
Python
1,495
star
9

open-instruct

Python
1,185
star
10

ai2thor

An open-source platform for Visual AI.
C#
1,160
star
11

dolma

Data and tools for generating and inspecting OLMo pre-training data.
Python
961
star
12

XNOR-Net

ImageNet classification using binary Convolutional Neural Networks
Lua
839
star
13

s2orc

S2ORC: The Semantic Scholar Open Research Corpus: https://www.aclweb.org/anthology/2020.acl-main.447/
Python
817
star
14

mmc4

MultimodalC4 is a multimodal extension of c4 that interleaves millions of images with text.
Python
793
star
15

scitldr

Python
734
star
16

objaverse-xl

πŸͺ Objaverse-XL is a Universe of 10M+ 3D Objects. Contains API Scripts for Downloading and Processing!
Python
701
star
17

papermage

library supporting NLP and CV research on scientific papers
Python
692
star
18

natural-instructions

Expanding natural instructions
Python
690
star
19

visprog

Official code for VisProg (CVPR 2023 Best Paper!)
Python
686
star
20

science-parse

Science Parse parses scientific papers (in PDF form) and returns them in structured form.
Java
611
star
21

pdffigures2

Given a scholarly PDF, extract figures, tables, captions, and section titles.
Scala
593
star
22

writing-code-for-nlp-research-emnlp2018

A companion repository for the "Writing code for NLP Research" Tutorial at EMNLP 2018
Python
558
star
23

tango

Organize your experiments into discrete steps that can be cached and reused throughout the lifetime of your research project.
Python
528
star
24

allennlp-models

Officially supported AllenNLP models
Python
521
star
25

specter

SPECTER: Document-level Representation Learning using Citation-informed Transformers
Python
506
star
26

dont-stop-pretraining

Code associated with the Don't Stop Pretraining ACL 2020 paper
Python
488
star
27

unified-io-2

Python
471
star
28

macaw

Multi-angle c(q)uestion answering
Python
451
star
29

lumos

Code and data for "Lumos: Learning Agents with Unified Data, Modular Design, and Open-Source LLMs"
Python
433
star
30

document-qa

Python
420
star
31

scholarphi

An interactive PDF reader.
Python
418
star
32

deep_qa

A deep NLP library, based on Keras / tf, focused on question answering (but useful for other NLP too)
Python
404
star
33

acl2018-semantic-parsing-tutorial

Materials from the ACL 2018 tutorial on neural semantic parsing
402
star
34

unifiedqa

UnifiedQA: Crossing Format Boundaries With a Single QA System
Python
384
star
35

pawls

Software that makes labeling PDFs easy.
Python
380
star
36

OLMoE

OLMoE: Open Mixture-of-Experts Language Models
Jupyter Notebook
374
star
37

kb

KnowBert -- Knowledge Enhanced Contextual Word Representations
Python
359
star
38

PeerRead

Data and code for Kang et al., NAACL 2018's paper titled "A Dataset of Peer Reviews (PeerRead): Collection, Insights and NLP Applications"
Python
354
star
39

reward-bench

RewardBench: the first evaluation tool for reward models.
Python
346
star
40

naacl2021-longdoc-tutorial

Python
342
star
41

openie-standalone

Quality information extraction at web scale. Edit
Scala
327
star
42

Holodeck

CVPR 2024: Language Guided Generation of 3D Embodied AI Environments.
Python
319
star
43

python-package-template

A template repo for Python packages
Python
318
star
44

allenact

An open source framework for research in Embodied-AI from AI2.
Python
316
star
45

ir_datasets

Provides a common interface to many IR ranking datasets.
Python
314
star
46

s2orc-doc2json

Parsers for scientific papers (PDF2JSON, TEX2JSON, JATS2JSON)
Python
302
star
47

acl2022-zerofewshot-tutorial

291
star
48

OLMo-Eval

Evaluation suite for LLMs
Python
280
star
49

procthor

🏘️ Scaling Embodied AI by Procedurally Generating Interactive 3D Houses
Python
257
star
50

fm-cheatsheet

Website for hosting the Open Foundation Models Cheat Sheet.
JavaScript
255
star
51

FineGrainedRLHF

Python
243
star
52

beaker-cli

A collaborative platform for rapid and reproducible research.
Go
230
star
53

comet-atomic-2020

Python
228
star
54

spv2

Science-parse version 2
Python
225
star
55

scifact

Data and models for the SciFact verification task.
Python
217
star
56

objaverse-rendering

πŸ“· Scripts for rendering Objaverse
Python
206
star
57

ScienceWorld

ScienceWorld is a text-based virtual environment centered around accomplishing tasks from the standardized elementary science curriculum.
Scala
197
star
58

unified-io-inference

Jupyter Notebook
196
star
59

allennlp-demo

Code for the AllenNLP demo.
TypeScript
191
star
60

citeomatic

A citation recommendation system that allows users to find relevant citations for their paper drafts. The tool is backed by Semantic Scholar's OpenCorpus dataset.
Jupyter Notebook
189
star
61

cartography

Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics
Jupyter Notebook
188
star
62

savn

Learning to Learn how to Learn: Self-Adaptive Visual Navigation using Meta-Learning (https://arxiv.org/abs/1812.00971)
Python
175
star
63

vampire

Variational Methods for Pretraining in Resource-limited Environments
Python
173
star
64

vila

Incorporating VIsual LAyout Structures for Scientific Text Classification
Python
172
star
65

s2-folks

Public space for the user community of Semantic Scholar APIs to share scripts, report issues, and make suggestions.
171
star
66

hidden-networks

Python
164
star
67

mmda

multimodal document analysis
Jupyter Notebook
158
star
68

PRIMER

The official code for PRIMERA: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization
Python
150
star
69

catwalk

This project studies the performance and robustness of language models and task-adaptation methods.
Python
141
star
70

dnw

Discovering Neural Wirings (https://arxiv.org/abs/1906.00586)
Python
139
star
71

deepfigures-open

Companion code to the paper "Extracting Scientific Figures with Distantly Supervised Neural Networks" πŸ€–
Python
133
star
72

tpu_pretrain

LM Pretraining with PyTorch/TPU
Python
132
star
73

allentune

Hyperparameter Search for AllenNLP
Python
128
star
74

SciREX

Data/Code Repository for https://api.semanticscholar.org/CorpusID:218470122
Python
128
star
75

scidocs

Dataset accompanying the SPECTER model
Python
127
star
76

lm-explorer

interactive explorer for language models
Python
127
star
77

pdffigures

Command line tool to extract figures, tables, and captions from scholarly documents in PDF form.
C++
125
star
78

OpenBookQA

Code for experiments on OpenBookQA from the EMNLP 2018 paper "Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering"
Python
121
star
79

peS2o

Pretraining Efficiently on S2ORC!
120
star
80

gooaq

Question-answers, collected from Google
Python
116
star
81

allennlp-as-a-library-example

A simple example for how to build your own model using AllenNLP as a dependency.
Python
113
star
82

embodied-clip

Official codebase for EmbCLIP
Python
111
star
83

multimodalqa

Python
109
star
84

alexafsm

With alexafsm, developers can model dialog agents with first-class concepts such as states, attributes, transition, and actions. alexafsm also provides visualization and other tools to help understand, test, debug, and maintain complex FSM conversations.
Python
108
star
85

allennlp-semparse

A framework for building semantic parsers (including neural module networks) with AllenNLP, built by the authors of AllenNLP
Python
107
star
86

scicite

Repository for NAACL 2019 paper on Citation Intent prediction
Python
106
star
87

ai2thor-rearrangement

πŸ”€ Visual Room Rearrangement
Python
104
star
88

commonsense-kg-completion

Python
102
star
89

medicat

Dataset of medical images, captions, subfigure-subcaption annotations, and inline textual references
Python
102
star
90

real-toxicity-prompts

Jupyter Notebook
101
star
91

s2search

The Semantic Scholar Search Reranker
Python
99
star
92

aristo-mini

Aristo mini is a light-weight question answering system that can quickly evaluate Aristo science questions with an evaluation web server and the provided baseline solvers.
Python
96
star
93

gpv-1

A task-agnostic vision-language architecture as a step towards General Purpose Vision
Jupyter Notebook
92
star
94

flex

Few-shot NLP benchmark for unified, rigorous eval
Python
91
star
95

elastic

Python
91
star
96

manipulathor

ManipulaTHOR, a framework that facilitates visual manipulation of objects using a robotic arm
Jupyter Notebook
88
star
97

spoc-robot-training

SPOC: Imitating Shortest Paths in Simulation Enables Effective Navigation and Manipulation in the Real World
Python
85
star
98

S2AND

Semantic Scholar's Author Disambiguation Algorithm & Evaluation Suite
Python
85
star
99

propara

ProPara (Process Paragraph Comprehension) dataset and models
Python
82
star
100

ARC-Solvers

ARC Question Solvers
Python
82
star