• Stars
    star
    1,566
  • Rank 28,734 (Top 0.6 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 5 years ago
  • Updated 6 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A full spaCy pipeline and models for scientific/biomedical documents.

This repository contains custom pipes and models related to using spaCy for scientific documents.

In particular, there is a custom tokenizer that adds tokenization rules on top of spaCy's rule-based tokenizer, a POS tagger and syntactic parser trained on biomedical data and an entity span detection model. Separately, there are also NER models for more specific tasks.

Just looking to test out the models on your data? Check out our demo (Note: this demo is running an older version of scispaCy and may produce different results than the latest version).

Installation

Installing scispacy requires two steps: installing the library and intalling the models. To install the library, run:

pip install scispacy

to install a model (see our full selection of available models below), run a command like the following:

pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.3/en_core_sci_sm-0.5.3.tar.gz

Note: We strongly recommend that you use an isolated Python environment (such as virtualenv or conda) to install scispacy. Take a look below in the "Setting up a virtual environment" section if you need some help with this. Additionally, scispacy uses modern features of Python and as such is only available for Python 3.6 or greater.

Setting up a virtual environment

Conda can be used set up a virtual environment with the version of Python required for scispaCy. If you already have a Python environment you want to use, you can skip to the 'installing via pip' section.

  1. Follow the installation instructions for Conda.

  2. Create a Conda environment called "scispacy" with Python 3.9 (any version >= 3.6 should work):

    conda create -n scispacy python=3.9
  3. Activate the Conda environment. You will need to activate the Conda environment in each terminal in which you want to use scispaCy.

    source activate scispacy

Now you can install scispacy and one of the models using the steps above.

Once you have completed the above steps and downloaded one of the models below, you can load a scispaCy model as you would any other spaCy model. For example:

import spacy
nlp = spacy.load("en_core_sci_sm")
doc = nlp("Alterations in the hypocretin receptor 2 and preprohypocretin genes produce narcolepsy in some animals.")

Note on upgrading

If you are upgrading scispacy, you will need to download the models again, to get the model versions compatible with the version of scispacy that you have. The link to the model that you download should contain the version number of scispacy that you have.

Available Models

To install a model, click on the link below to download the model, and then run

pip install </path/to/download>

Alternatively, you can install directly from the URL by right-clicking on the link, selecting "Copy Link Address" and running

pip install CMD-V(to paste the copied URL)
Model Description Install URL
en_core_sci_sm A full spaCy pipeline for biomedical data with a ~100k vocabulary. Download
en_core_sci_md A full spaCy pipeline for biomedical data with a ~360k vocabulary and 50k word vectors. Download
en_core_sci_lg A full spaCy pipeline for biomedical data with a ~785k vocabulary and 600k word vectors. Download
en_core_sci_scibert A full spaCy pipeline for biomedical data with a ~785k vocabulary and allenai/scibert-base as the transformer model. You may want to use a GPU with this model. Download
en_ner_craft_md A spaCy NER model trained on the CRAFT corpus. Download
en_ner_jnlpba_md A spaCy NER model trained on the JNLPBA corpus. Download
en_ner_bc5cdr_md A spaCy NER model trained on the BC5CDR corpus. Download
en_ner_bionlp13cg_md A spaCy NER model trained on the BIONLP13CG corpus. Download

Additional Pipeline Components

AbbreviationDetector

The AbbreviationDetector is a Spacy component which implements the abbreviation detection algorithm in "A simple algorithm for identifying abbreviation definitions in biomedical text.", (Schwartz & Hearst, 2003).

You can access the list of abbreviations via the doc._.abbreviations attribute and for a given abbreviation, you can access it's long form (which is a spacy.tokens.Span) using span._.long_form, which will point to another span in the document.

Example Usage

import spacy

from scispacy.abbreviation import AbbreviationDetector

nlp = spacy.load("en_core_sci_sm")

# Add the abbreviation pipe to the spacy pipeline.
nlp.add_pipe("abbreviation_detector")

doc = nlp("Spinal and bulbar muscular atrophy (SBMA) is an \
           inherited motor neuron disease caused by the expansion \
           of a polyglutamine tract within the androgen receptor (AR). \
           SBMA can be caused by this easily.")

print("Abbreviation", "\t", "Definition")
for abrv in doc._.abbreviations:
	print(f"{abrv} \t ({abrv.start}, {abrv.end}) {abrv._.long_form}")

>>> Abbreviation	 Span	    Definition
>>> SBMA 		 (33, 34)   Spinal and bulbar muscular atrophy
>>> SBMA 	   	 (6, 7)     Spinal and bulbar muscular atrophy
>>> AR   		 (29, 30)   androgen receptor

Note If you want to be able to serialize your doc objects, load the abbreviation detector with make_serializable=True, e.g. nlp.add_pipe("abbreviation_detector", config={"make_serializable": True})

EntityLinker

The EntityLinker is a SpaCy component which performs linking to a knowledge base. The linker simply performs a string overlap - based search (char-3grams) on named entities, comparing them with the concepts in a knowledge base using an approximate nearest neighbours search.

Currently (v2.5.0), there are 5 supported linkers:

  • umls: Links to the Unified Medical Language System, levels 0,1,2 and 9. This has ~3M concepts.
  • mesh: Links to the Medical Subject Headings. This contains a smaller set of higher quality entities, which are used for indexing in Pubmed. MeSH contains ~30k entities. NOTE: The MeSH KB is derived directly from MeSH itself, and as such uses different unique identifiers than the other KBs.
  • rxnorm: Links to the RxNorm ontology. RxNorm contains ~100k concepts focused on normalized names for clinical drugs. It is comprised of several other drug vocabularies commonly used in pharmacy management and drug interaction, including First Databank, Micromedex, and the Gold Standard Drug Database.
  • go: Links to the Gene Ontology. The Gene Ontology contains ~67k concepts focused on the functions of genes.
  • hpo: Links to the Human Phenotype Ontology. The Human Phenotype Ontology contains 16k concepts focused on phenotypic abnormalities encountered in human disease.

You may want to play around with some of the parameters below to adapt to your use case (higher precision, higher recall etc).

  • resolve_abbreviations : bool = True, optional (default = False) Whether to resolve abbreviations identified in the Doc before performing linking. This parameter has no effect if there is no AbbreviationDetector in the spacy pipeline.
  • k : int, optional, (default = 30) The number of nearest neighbours to look up from the candidate generator per mention.
  • threshold : float, optional, (default = 0.7) The threshold that a mention candidate must reach to be added to the mention in the Doc as a mention candidate.
  • no_definition_threshold : float, optional, (default = 0.95) The threshold that a entity candidate must reach to be added to the mention in the Doc as a mention candidate if the entity candidate does not have a definition.
  • filter_for_definitions: bool, default = True Whether to filter entities that can be returned to only include those with definitions in the knowledge base.
  • max_entities_per_mention : int, optional, default = 5 The maximum number of entities which will be returned for a given mention, regardless of how many are nearest neighbours are found.

This class sets the ._.kb_ents attribute on spacy Spans, which consists of a List[Tuple[str, float]] corresponding to the KB concept_id and the associated score for a list of max_entities_per_mention number of entities.

You can look up more information for a given id using the kb attribute of this class:

print(linker.kb.cui_to_entity[concept_id])

Example Usage

import spacy
import scispacy

from scispacy.linking import EntityLinker

nlp = spacy.load("en_core_sci_sm")

# This line takes a while, because we have to download ~1GB of data
# and load a large JSON file (the knowledge base). Be patient!
# Thankfully it should be faster after the first time you use it, because
# the downloads are cached.
# NOTE: The resolve_abbreviations parameter is optional, and requires that
# the AbbreviationDetector pipe has already been added to the pipeline. Adding
# the AbbreviationDetector pipe and setting resolve_abbreviations to True means
# that linking will only be performed on the long form of abbreviations.
nlp.add_pipe("scispacy_linker", config={"resolve_abbreviations": True, "linker_name": "umls"})

doc = nlp("Spinal and bulbar muscular atrophy (SBMA) is an \
           inherited motor neuron disease caused by the expansion \
           of a polyglutamine tract within the androgen receptor (AR). \
           SBMA can be caused by this easily.")

# Let's look at a random entity!
entity = doc.ents[1]

print("Name: ", entity)
>>> Name: bulbar muscular atrophy

# Each entity is linked to UMLS with a score
# (currently just char-3gram matching).
linker = nlp.get_pipe("scispacy_linker")
for umls_ent in entity._.kb_ents:
	print(linker.kb.cui_to_entity[umls_ent[0]])


>>> CUI: C1839259, Name: Bulbo-Spinal Atrophy, X-Linked
>>> Definition: An X-linked recessive form of spinal muscular atrophy. It is due to a mutation of the
  				gene encoding the ANDROGEN RECEPTOR.
>>> TUI(s): T047
>>> Aliases (abbreviated, total: 50):
         Bulbo-Spinal Atrophy, X-Linked, Bulbo-Spinal Atrophy, X-Linked, ....

>>> CUI: C0541794, Name: Skeletal muscle atrophy
>>> Definition: A process, occurring in skeletal muscle, that is characterized by a decrease in protein content,
                fiber diameter, force production and fatigue resistance in response to ...
>>> TUI(s): T046
>>> Aliases: (total: 9):
         Skeletal muscle atrophy, ATROPHY SKELETAL MUSCLE, skeletal muscle atrophy, ....

>>> CUI: C1447749, Name: AR protein, human
>>> Definition: Androgen receptor (919 aa, ~99 kDa) is encoded by the human AR gene.
                This protein plays a role in the modulation of steroid-dependent gene transcription.
>>> TUI(s): T116, T192
>>> Aliases (abbreviated, total: 16):
         AR protein, human, Androgen Receptor, Dihydrotestosterone Receptor, AR, DHTR, NR3C4, ...

Hearst Patterns (v0.3.0 and up)

This component implements Automatic Aquisition of Hyponyms from Large Text Corpora using the SpaCy Matcher component.

Passing extended=True to the HyponymDetector will use the extended set of hearst patterns, which include higher recall but lower precision hyponymy relations (e.g X compared to Y, X similar to Y, etc).

This component produces a doc level attribute on the spacy doc: doc._.hearst_patterns, which is a list containing tuples of extracted hyponym pairs. The tuples contain:

  • The relation rule used to extract the hyponym (type: str)
  • The more general concept (type: spacy.Span)
  • The more specific concept (type: spacy.Span)

Usage:

import spacy
from scispacy.hyponym_detector import HyponymDetector

nlp = spacy.load("en_core_sci_sm")
nlp.add_pipe("hyponym_detector", last=True, config={"extended": False})

doc = nlp("Keystone plant species such as fig trees are good for the soil.")

print(doc._.hearst_patterns)
>>> [('such_as', Keystone plant species, fig trees)]

Citing

If you use ScispaCy in your research, please cite ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing. Additionally, please indicate which version and model of ScispaCy you used so that your research can be reproduced.

@inproceedings{neumann-etal-2019-scispacy,
    title = "{S}cispa{C}y: {F}ast and {R}obust {M}odels for {B}iomedical {N}atural {L}anguage {P}rocessing",
    author = "Neumann, Mark  and
      King, Daniel  and
      Beltagy, Iz  and
      Ammar, Waleed",
    booktitle = "Proceedings of the 18th BioNLP Workshop and Shared Task",
    month = aug,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/W19-5034",
    doi = "10.18653/v1/W19-5034",
    pages = "319--327",
    eprint = {arXiv:1902.07669},
    abstract = "Despite recent advances in natural language processing, many statistical models for processing text perform extremely poorly under domain shift. Processing biomedical and clinical text is a critically important application area of natural language processing, for which there are few robust, practical, publicly available models. This paper describes scispaCy, a new Python library and models for practical biomedical/scientific text processing, which heavily leverages the spaCy library. We detail the performance of two packages of models released in scispaCy and demonstrate their robustness on several tasks and datasets. Models and code are available at https://allenai.github.io/scispacy/.",
}

ScispaCy is an open-source project developed by the Allen Institute for Artificial Intelligence (AI2). AI2 is a non-profit institute with the mission to contribute to humanity through high-impact AI research and engineering.

More Repositories

1

allennlp

An open-source NLP research library, built on PyTorch.
Python
11,691
star
2

OLMo

Modeling, training, eval, and inference code for OLMo
Python
3,949
star
3

RL4LMs

A modular RL library to fine-tune language models to human preferences
Python
2,020
star
4

longformer

Longformer: The Long-Document Transformer
Python
1,955
star
5

bilm-tf

Tensorflow implementation of contextualized word representations from bi-directional language models
Python
1,621
star
6

bi-att-flow

Bi-directional Attention Flow (BiDAF) network is a multi-stage hierarchical process that represents context at different levels of granularity and uses a bi-directional attention flow mechanism to achieve a query-aware context representation without early summarization.
Python
1,524
star
7

scibert

A BERT model for scientific text.
Python
1,432
star
8

ai2thor

An open-source platform for Visual AI.
C#
1,010
star
9

open-instruct

Python
932
star
10

XNOR-Net

ImageNet classification using binary Convolutional Neural Networks
Lua
839
star
11

mmc4

MultimodalC4 is a multimodal extension of c4 that interleaves millions of images with text.
Python
793
star
12

s2orc

S2ORC: The Semantic Scholar Open Research Corpus: https://www.aclweb.org/anthology/2020.acl-main.447/
Python
745
star
13

scitldr

Python
734
star
14

natural-instructions

Expanding natural instructions
Python
690
star
15

dolma

Data and tools for generating and inspecting OLMo pre-training data.
Python
678
star
16

visprog

Official code for VisProg (CVPR 2023 Best Paper!)
Python
642
star
17

papermage

library supporting NLP and CV research on scientific papers
Python
605
star
18

science-parse

Science Parse parses scientific papers (in PDF form) and returns them in structured form.
Java
566
star
19

writing-code-for-nlp-research-emnlp2018

A companion repository for the "Writing code for NLP Research" Tutorial at EMNLP 2018
Python
558
star
20

pdffigures2

Given a scholarly PDF, extract figures, tables, captions, and section titles.
Scala
514
star
21

allennlp-models

Officially supported AllenNLP models
Python
512
star
22

tango

Organize your experiments into discrete steps that can be cached and reused throughout the lifetime of your research project.
Python
507
star
23

objaverse-xl

🪐 Objaverse-XL is a Universe of 10M+ 3D Objects. Contains API Scripts for Downloading and Processing!
Python
490
star
24

dont-stop-pretraining

Code associated with the Don't Stop Pretraining ACL 2020 paper
Python
488
star
25

specter

SPECTER: Document-level Representation Learning using Citation-informed Transformers
Python
485
star
26

unified-io-2

Python
471
star
27

macaw

Multi-angle c(q)uestion answering
Python
451
star
28

document-qa

Python
420
star
29

scholarphi

An interactive PDF reader.
Python
410
star
30

deep_qa

A deep NLP library, based on Keras / tf, focused on question answering (but useful for other NLP too)
Python
405
star
31

acl2018-semantic-parsing-tutorial

Materials from the ACL 2018 tutorial on neural semantic parsing
402
star
32

unifiedqa

UnifiedQA: Crossing Format Boundaries With a Single QA System
Python
384
star
33

kb

KnowBert -- Knowledge Enhanced Contextual Word Representations
Python
359
star
34

pawls

Software that makes labeling PDFs easy.
Python
356
star
35

PeerRead

Data and code for Kang et al., NAACL 2018's paper titled "A Dataset of Peer Reviews (PeerRead): Collection, Insights and NLP Applications"
Python
354
star
36

naacl2021-longdoc-tutorial

Python
343
star
37

openie-standalone

Quality information extraction at web scale. Edit
Scala
329
star
38

python-package-template

A template repo for Python packages
Python
318
star
39

acl2022-zerofewshot-tutorial

293
star
40

allenact

An open source framework for research in Embodied-AI from AI2.
Python
293
star
41

ir_datasets

Provides a common interface to many IR ranking datasets.
Python
291
star
42

s2orc-doc2json

Parsers for scientific papers (PDF2JSON, TEX2JSON, JATS2JSON)
Python
290
star
43

beaker-cli

A collaborative platform for rapid and reproducible research.
Go
230
star
44

Holodeck

CVPR 2024: Language Guided Generation of 3D Embodied AI Environments.
Python
220
star
45

procthor

🏘️ Scaling Embodied AI by Procedurally Generating Interactive 3D Houses
Python
214
star
46

comet-atomic-2020

Python
212
star
47

FineGrainedRLHF

Python
209
star
48

fm-cheatsheet

Website for hosting the Open Foundation Models Cheat Sheet.
Python
207
star
49

spv2

Science-parse version 2
Python
206
star
50

scifact

Data and models for the SciFact verification task.
Python
206
star
51

OLMo-Eval

Evaluation suite for LLMs
Python
200
star
52

unified-io-inference

Jupyter Notebook
196
star
53

allennlp-demo

Code for the AllenNLP demo.
TypeScript
191
star
54

lumos

Code and data for "Lumos: Learning Agents with Unified Data, Modular Design, and Open-Source LLMs"
Python
190
star
55

citeomatic

A citation recommendation system that allows users to find relevant citations for their paper drafts. The tool is backed by Semantic Scholar's OpenCorpus dataset.
Jupyter Notebook
182
star
56

cartography

Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics
Jupyter Notebook
180
star
57

savn

Learning to Learn how to Learn: Self-Adaptive Visual Navigation using Meta-Learning (https://arxiv.org/abs/1812.00971)
Python
175
star
58

vampire

Variational Methods for Pretraining in Resource-limited Environments
Python
173
star
59

objaverse-rendering

📷 Scripts for rendering Objaverse
Python
169
star
60

hidden-networks

Python
164
star
61

ScienceWorld

ScienceWorld is a text-based virtual environment centered around accomplishing tasks from the standardized elementary science curriculum.
Scala
156
star
62

vila

Incorporating VIsual LAyout Structures for Scientific Text Classification
Python
155
star
63

mmda

multimodal document analysis
Jupyter Notebook
154
star
64

cord19

Get started with CORD-19
149
star
65

PRIMER

The official code for PRIMERA: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization
Python
145
star
66

dnw

Discovering Neural Wirings (https://arxiv.org/abs/1906.00586)
Python
139
star
67

tpu_pretrain

LM Pretraining with PyTorch/TPU
Python
129
star
68

deepfigures-open

Companion code to the paper "Extracting Scientific Figures with Distantly Supervised Neural Networks" 🤖
Python
129
star
69

catwalk

This project studies the performance and robustness of language models and task-adaptation methods.
Python
129
star
70

allentune

Hyperparameter Search for AllenNLP
Python
128
star
71

lm-explorer

interactive explorer for language models
Python
127
star
72

pdffigures

Command line tool to extract figures, tables, and captions from scholarly documents in PDF form.
C++
125
star
73

SciREX

Data/Code Repository for https://api.semanticscholar.org/CorpusID:218470122
Python
125
star
74

s2-folks

Public space for the user community of Semantic Scholar APIs to share scripts, report issues, and make suggestions.
125
star
75

scidocs

Dataset accompanying the SPECTER model
Python
124
star
76

gooaq

Question-answers, collected from Google
Python
116
star
77

OpenBookQA

Code for experiments on OpenBookQA from the EMNLP 2018 paper "Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering"
Python
113
star
78

allennlp-as-a-library-example

A simple example for how to build your own model using AllenNLP as a dependency.
Python
113
star
79

alexafsm

With alexafsm, developers can model dialog agents with first-class concepts such as states, attributes, transition, and actions. alexafsm also provides visualization and other tools to help understand, test, debug, and maintain complex FSM conversations.
Python
108
star
80

allennlp-semparse

A framework for building semantic parsers (including neural module networks) with AllenNLP, built by the authors of AllenNLP
Python
107
star
81

scicite

Repository for NAACL 2019 paper on Citation Intent prediction
Python
106
star
82

peS2o

Pretraining Efficiently on S2ORC!
105
star
83

multimodalqa

Python
102
star
84

commonsense-kg-completion

Python
102
star
85

real-toxicity-prompts

Jupyter Notebook
101
star
86

ai2thor-rearrangement

🔀 Visual Room Rearrangement
Python
97
star
87

embodied-clip

Official codebase for EmbCLIP
Python
97
star
88

aristo-mini

Aristo mini is a light-weight question answering system that can quickly evaluate Aristo science questions with an evaluation web server and the provided baseline solvers.
Python
96
star
89

s2search

The Semantic Scholar Search Reranker
Python
93
star
90

elastic

Python
91
star
91

reward-bench

RewardBench: the first evaluation tool for reward models.
Python
90
star
92

flex

Few-shot NLP benchmark for unified, rigorous eval
Python
89
star
93

gpv-1

A task-agnostic vision-language architecture as a step towards General Purpose Vision
Jupyter Notebook
89
star
94

manipulathor

ManipulaTHOR, a framework that facilitates visual manipulation of objects using a robotic arm
Jupyter Notebook
86
star
95

medicat

Dataset of medical images, captions, subfigure-subcaption annotations, and inline textual references
Python
85
star
96

propara

ProPara (Process Paragraph Comprehension) dataset and models
Python
82
star
97

allennlp-guide

Code and material for the AllenNLP Guide
Python
81
star
98

hierplane

A tool for visualizing trees, tailored specifically to the analysis of parse trees.
JavaScript
81
star
99

S2AND

Semantic Scholar's Author Disambiguation Algorithm & Evaluation Suite
Python
78
star
100

ARC-Solvers

ARC Question Solvers
Python
78
star