• Stars
    star
    262
  • Rank 156,136 (Top 4 %)
  • Language
  • Created over 6 years ago
  • Updated almost 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A corpus of Biomedical papers annotated with mentions of UMLS entities.

MedMentions: A UMLS Annotated Dataset

This is a preliminary release of the MedMentions dataset, a corpus of Biomedical papers annotated with mentions of UMLS entities. CZI Meta is releasing this data to promote NLP research on Biomedical text.

This data is being released under the CC0 license. The papers in the corpus were selected from those available from PubMed® / Medline®. Users are referred to that source for the most current and accurate version of the text for the corresponding papers.

Introduction

Corpus: The MedMentions corpus consists of 4,392 papers (Titles and Abstracts) randomly selected from among papers released on PubMed in 2016, that were in the biomedical field, published in the English language, and had both a Title and an Abstract.

Annotators: We recruited a team of professional annotators with rich experience in biomedical content curation to exhaustively annotate all UMLS® (2017AA full version) entity mentions in these papers.

Annotation quality: We did not collect stringent IAA (Inter-annotator agreement) data. To gain insight on the annotation quality of MedMentions, we randomly selected eight papers from the annotated corpus, containing a total of 469 concepts. Two biologists ('Reviewer') who did not participate in the annotation task then each reviewed four papers. The agreement between Reviewers and Annotators, an estimate of the Precision of the annotations, was 97.3%.

The Full Dataset, and Subsets

  • full: This is the full dataset
  • ST21pv: This is the ST21pv subset, containing a subset of the full annotations, targeting information retrieval.

The PubTator format

The annotated data is published in PubTator format:

Each paper or document ends with a blank line, and is represented as (without the spaces):

PMID | t | Title text
PMID | a | Abstract text
PMID TAB StartIndex TAB EndIndex TAB MentionTextSegment TAB SemanticTypeID TAB EntityID
...

The first two lines present the Title and Abstract texts (no line-breaks or tabs in the text). Subsequent lines present the mentions, one per line. The StartIndex and EndIndex are 0-based character indices into the document text, constructed by concatenating the Title and Abstract, separated by a SPACE character. The MentionTextSegment is the actual mention between those character positions. The EntityID is the UMLS entity (concept) id, and the SemanticTypeID is the id for the Semantic Type that entity is linked to in UMLS. If the UMLS entity is linked to more than one semantic type, then this field contains a comma-separated list of all these type IDs. All UMLS concepts that are not in the 2017-AA Active release are linked to the special semantic type UnknownType.

Here is an example:

25763772|t|DCTN4 as a modifier of chronic Pseudomonas aeruginosa infection in cystic fibrosis
25763772|a|Pseudomonas aeruginosa (Pa) infection in cystic fibrosis (CF) patients is associated with worse long-term pulmonary disease and shorter survival, and chronic Pa infection (CPA) is associated with reduced lung function, faster rate of lung decline, increased rates of exacerbations and shorter survival. By using exome sequencing and extreme phenotype design, it was recently shown that isoforms of dynactin 4 (DCTN4) may influence Pa infection in CF, leading to worse respiratory disease. The purpose of this study was to investigate the role of DCTN4 missense variants on Pa infection incidence, age at first Pa infection and chronic Pa infection incidence in a cohort of adult CF patients from a single centre. Polymerase chain reaction and direct sequencing were used to screen DNA samples for DCTN4 variants. A total of 121 adult CF patients from the Cochin Hospital CF centre have been included, all of them carrying two CFTR defects: 103 developed at least 1 pulmonary infection with Pa, and 68 patients of them had CPA. DCTN4 variants were identified in 24% (29/121) CF patients with Pa infection and in only 17% (3/18) CF patients with no Pa infection. Of the patients with CPA, 29% (20/68) had DCTN4 missense variants vs 23% (8/35) in patients without CPA. Interestingly, p.Tyr263Cys tend to be more frequently observed in CF patients with CPA than in patients without CPA (4/68 vs 0/35), and DCTN4 missense variants tend to be more frequent in male CF patients with CPA bearing two class II mutations than in male CF patients without CPA bearing two class II mutations (P = 0.06). Our observations reinforce that DCTN4 missense variants, especially p.Tyr263Cys, may be involved in the pathogenesis of CPA in male CF.
25763772        0       5       DCTN4   T116,T123    C4308010
25763772        23      63      chronic Pseudomonas aeruginosa infection        T047    C0854135
25763772        67      82      cystic fibrosis T047    C0010674
25763772        83      120     Pseudomonas aeruginosa (Pa) infection   T047    C0854135
...

In this example, the Title is 82 characters long. The first mention is for the UMLS concept "DCTN4 protein, human" whose UMLS id is C4308010. This entity is linked to two semantic types: "Amino Acid, Peptide, or Protein" (T116) and "Biologically Active Substance" (T123).

How to cite

If you use MedMentions, please cite the following paper:

Sunil Mohan and Donghui Li. 2019. MedMentions: A Large Biomedical Corpus Annotated with UMLS Concepts. In Proceedings of the 2019 Conference on Automated Knowledge Base Construction (AKBC 2019). Amherst, Massachusetts, USA. May 2019. [Preprint]

Our Latest Model

Our model achieves SOTA results (2021) on UMLS recognition (ST21pv subset): a lower bound F1 score of 0.570 for mention level entity recognition (detection and linking), and an F1 score of 0.657 for recognizing UMLS concepts at the document level. For details, please see the following paper:

Sunil Mohan, Rico Angell, Nicholas Monath, Andrew McCallum. 2021. Low Resource Recognition and Linking of Biomedical Concepts from a Large Ontology. In Proceedings of the ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM-BCB), 2021. [doi] [Preprint]

Other papers on MedMentions

Shikhar Murty, Patrick Verga, Luke Vilnis, Irena Radovanovic and Andrew McCallum. 2018. Hierarchical Losses and New Resources for Fine-grained Entity Typing and Linking. The 56th Annual Meeting of the Association for Computational Linguistics (ACL). Melbourne, Australia. July 2018.

Feedback, Questions

If you have any comments, questions or issues, please post a note in GitHub issues.

More Repositories

1

sorbet-rails

A set of tools to make the Sorbet typechecker work with Ruby on Rails seamlessly.
Ruby
636
star
2

cellxgene

An interactive explorer for single-cell transcriptomics data
JavaScript
622
star
3

czi-prosemirror

Rich Text Editor built with React and ProseMirror
JavaScript
331
star
4

fogg

Manage Infrastructure as Code with less pain.
Go
269
star
5

shasta

[MOVED] Moved to paoloshasta/shasta. De novo assembly from Oxford Nanopore reads
C++
269
star
6

miniwdl

Workflow Description Language developer tools & local runner
Python
175
star
7

cellxgene-census

CZ CELLxGENE Discover Census
Jupyter Notebook
82
star
8

czid-web

Infectious Disease Sequencing Platform
Ruby
72
star
9

software-mentions

Jupyter Notebook
66
star
10

single-cell-data-portal

The data portal supporting the submission, exploration, and management of projects and datasets to cellxgene.
TypeScript
63
star
11

cztack

The CZI infrastructure stack.
HCL
61
star
12

blessclient

Go client to negotiate SSH certificates
Go
60
star
13

axe-storybook-testing

Command line interface for testing Storybook stories for accessibility.
TypeScript
52
star
14

napari-hub

Discover, install, and share napari plugins
TypeScript
51
star
15

ChemDisGene

Bio relation extraction labeled dataset
Python
40
star
16

s3parcp

Faster than s3cp
Go
37
star
17

single-cell-curation

Code and documentation for the curation of cellxgene datasets
Python
37
star
18

czid-workflows

Portable WDL workflows for CZ ID production pipelines
Python
36
star
19

redis-memo

A Redis-based version addressable caching system. Memoize pure functions, aggregated database queries, and 3rd party API calls.
Ruby
33
star
20

redcord

A Ruby ORM like Active Record, but for Redis.
Ruby
32
star
21

idseq-workflows

Portable WDL workflows for IDseq production pipelines
Python
31
star
22

taxoniq

Taxon Information Query - fast, offline querying of NCBI Taxonomy and related data
Python
30
star
23

sorbet-coerce

A type coercion lib works with Sorbet's static type checker and type definitions
Ruby
29
star
24

ExpressionMatrix2

Software for exploration of gene expression data from single-cell RNA sequencing.
C++
29
star
25

edu-design-system

Design system for Education Projects
TypeScript
28
star
26

chalice-app-template

An AWS Lambda serverless app template with Terraform deployment management
Python
28
star
27

czid-dag

Please see https://github.com/chanzuckerberg/czid-workflows for the latest version of CZ ID workflows.
Python
27
star
28

sci-components

2021 Science Design System Component Library
TypeScript
25
star
29

scRNA-python-workshop

All the action is on the course webpage -->
Jupyter Notebook
24
star
30

alhazen

AI agents + toolkits for scientific knowledge
Jupyter Notebook
20
star
31

czid-cli

CZID (formerly IDseq) infectious disease command-line interface
Go
19
star
32

software-impact-hackathon-2023

A collection of projects by participants in CZI's hackathon "Mapping the Impact of Research Software in Science" (October 2023)
18
star
33

happy

Happy Path Deployment Tool
Go
18
star
34

cellxgene-documentation

Documentation for the cellxgene product
17
star
35

aws-oidc

AWS OIDC Federation
Go
15
star
36

cryoet-data-portal

CryoET Data Portal
TypeScript
15
star
37

rotator

Rotator is a tool for rotating credentials on a regular schedule.
Go
15
star
38

czid-dedup

deduplicate FASTA and FASTQ files
Rust
15
star
39

czgenepi

Python
13
star
40

terraform-provider-bless

Terraform provider to automate the creation of BLESS deployments
Go
12
star
41

scEvalsJam

Jupyter Notebook
12
star
42

czi-oss-training

Training Materials for Good Open Source Citizenship
11
star
43

s3mi

Transfer big files fast between S3 and EC2. Pronounced "semi".
Python
11
star
44

single-cell-explorer

Hosted version of cellxgene
TypeScript
11
star
45

napari-plugin-alfa-cohort

Repo for the CZI Imaging Team's napari plugin Alfa Cohort collaboration
11
star
46

pyrepl

JS/Python execution bridge
JavaScript
9
star
47

software-mention-extraction

Software mention extraction and linking from scientific articles
Jupyter Notebook
9
star
48

napari-cryoet-data-portal

A napari plugin to list, preview, and open data from the CZ Imaging Institute's CryoET Data Portal
Python
8
star
49

prometheus-demo

Simulates monitoring HTTP microservice and short-lived job metrics
Python
8
star
50

ruby-prof-speedscope

A ruby-prof printer for the speedscope.app trace viewer.
Ruby
8
star
51

galago

Interpretation aids for genomic epidemiology
TypeScript
8
star
52

go-misc

miscellaneous go code
Go
8
star
53

scoreboard

minimalist web app for comparing algorithm performance on benchmark data
JavaScript
8
star
54

idseq-cli

IDseq infectious disease command-line interface
Python
8
star
55

open-science

curated links and customized materials to support biomedical researchers in implementing open science approaches
JavaScript
8
star
56

swipe

SFN-WDL infrastructure for pipeline execution - a template repository and Terraform module for SFN-WDL based projects
Python
7
star
57

idseq-bench

IDseq infectious disease benchmarking tools
Python
7
star
58

napari-hub-collections

DEPRECATED: collections of plugins for the napari hub
Python
7
star
59

bff

Breaking, Feature, Fix - a tool for managing semantic versioning
Go
6
star
60

gs

A small, user-friendly Google Cloud Storage CLI client
Python
6
star
61

ontology-ui

Ontology visualizer
TypeScript
6
star
62

frontend-libs

Monorepo of TypeScript projects for the Chan Zuckerberg Initiative.
TypeScript
6
star
63

miniwdl-plugins

MiniWDL plugins
Python
6
star
64

concept_discovery

Python
6
star
65

spatial-warehouse

Investigation into spatial data and data schema
Jupyter Notebook
6
star
66

docs-editor

rich text editor for education purpose.
HTML
5
star
67

reaper

Go
5
star
68

DRSM-corpus

An annotated literature corpus for NLP studies of 'Disease Research State' based on different categories of research.
5
star
69

full-text-mining-ner

Open Source Code for Extraction of Methods and Datasets from Biomedical Papers
Python
5
star
70

github-actions

A collection of re-usable GitHub Actions
HCL
5
star
71

single-cell

A collection of documents that reflect various design decisions that have been made for the cellxgene project.
4
star
72

nbconvert-service

A very light API wrapper around nbconvert
Python
4
star
73

napari-hub-cli

This is a CLI for the napari hub
Python
4
star
74

czid-platformics

Python
4
star
75

wdl-cell-ranger

Run Cell Ranger pipelines using WDL and Cromwell
Python
4
star
76

cellxgene-compbio-methods

Jupyter Notebook
3
star
77

shasta-docs

[DEPRECATED] Shasta github.io repo for user guides, etc.
3
star
78

napari-segmentation-workshop

workshop materials for performing cell segmentation in napari
TeX
3
star
79

cellxgene-ontology-guide

Python
2
star
80

napari.dev

HTML
2
star
81

homebrew-tap

Ruby
2
star
82

chained-aws-lambda

Python
2
star
83

ncbi-tool-cliclient

CLI client for NCBI data tool
Go
2
star
84

ncbi-tool-sync

Sync component for NCBI data tool
Go
2
star
85

czecs

CLI deployment tool for AWS Elastic Container Service
Go
2
star
86

czLandscapingTk

Landscaping analysis tools for publicly available scientific knowledge sources.
Python
2
star
87

scatter-demo

demo scatter plot using webgl and regl
JavaScript
2
star
88

go-kmsauth

Port kmsauth to go
Go
2
star
89

safe_type

Type Coercion & Type Enhancement
Ruby
2
star
90

awesome-data-insights-projects

2
star
91

platformics

Codegen Python GraphQL Entity Framework
Python
2
star
92

miniwdl-viz

A library to decompose WDL files into a JSON/YAML, and create a mermaid flowchart
WDL
2
star
93

shasta-docker

[DEPRECATED] This repository contains the code and configuration required to create Docker images for Shasta on different platforms.
Python
2
star
94

mini-ete3-js

npm package for basic phylogenetics functionality in the browser
TypeScript
1
star
95

cellxgene-manuscript-2023

Code and data to reproduce analyses in the 2023 cellxgene manuscript
Jupyter Notebook
1
star
96

CZ-PR-bot

A Github bot for streamlined PR reviews
JavaScript
1
star
97

ncbi-tool-search

Collection of script-style utility functions for the NT/NR search mapping research.
Go
1
star
98

crc-squared

light speed crc32c checksums
Go
1
star
99

go-travis-wait

A better wrapper for travis_wait
Shell
1
star
100

single-cell-tissue-coverage-visualization-prototype

HTML
1
star