Awesome Scholarly Data Analysis
List of resources on scholarly data analysis ranging from datasets, papers, and code about bibliometrics, citation analysis, and other scholarly commons resources. Available online at https://shubhanshu.com/awesome-scholarly-data-analysis/
Table of Contents
Table of contents generated with markdown-toc
Datasets
Publication and Citation
- Arnet Miner
- Microsoft Academic Graph
- OpenAlex - Replacement for MAG
- Open Academic Graph - MAG + AMiner
- OpenAIRE Research Graph - More info here
- Semantic Scholar Corpus
- CiteSeer
- PubMed
- CORA datasets for citation string parsing
- Humanities and multilingual citation string parsing Flux-CiM and ICONIP see Neural ParsCit paper for details
- Citation string parsing data for social sciences for English and German citations - comparison with Grobid and Cermine
- CrossRef DOI URLs
- DOIboost (Crossres + MAG + ORCID + Unpaywall)
- DBLP Citation dataset
- DBLP XML data
- DBLP Discovery Dataset (D3)
- NBER Patent Citations
- Scopus Citation Database
- Papers, patents, and grants from Indiana University
- Small Network Data - Mark Newman's Lab
- The Koblenz Network Collection
- Google Scholar citation relations
- Google Scholar Citations data set direct-download
- Open citations project
- Wikicite Project
- Ecnonomic Papers
- ArXiv data dump
- ArXiv data on Kaggle
- EuropePMC
- Complete ACL anthology as bibtex file
- ACL Anthology Reference Corpus
- Astrophysics data system (ADS) - All physics papers
- CORE 37M full text open access papers
- Inspire database for high energy physics articles
- Scholarly Data of workshops and conferences in RDF triplets
- The Collection of Computer Science Bibliographies
- OpenCitations corpus
- COCI Doi-Doi citation data
- DOAJ API (Directory of Open Access Journals)
- ROAD (Directory of Open Access Scholarly Resources)
- Sherpa/Romeo (Publisher copyright policies & self-archiving)
- OpenAPC (fees paid for open access journal articles)
- OSF API (Open Science Framework)
- Digital tools for researchers
- Fatcat - versioned, publicly-editable catalog of research publications
- Microsoft Academic Knowledge Graph - RDF dump
- arXiv CS citation in context
- arXiv fulltext + citations dataset
- Self-citation analysis data based on PubMed Central subset (2002-2005)
- Unpaywalled Corpus - PDF to 23M DOIs Data Schema
- A dataset of publication records for Nobel laureates - paper
- OpenAIRE Scholexplorer - 126+ Million literature-dataset and dataset-dataset links between 12+ Million objects - About the data
- Manually annotated citation data from the ACL Anthology into uses, motivation, future, extends, compare or contrast, and background
- iCite - NIH Open Citation Collection
- MEDLINE/PubMed Baseline Repository (MBR) - All Medline abstracts and paper paper meta-data in XML
- American Physical Society Data Sets for Research
- Co-citation networks of all Nature papers
- Semantic Scholar Graph of References in Context (GORC) dataset
- Multiple journal publication datasets
- Structured citations in the English Wikipedia
- ICSR Lab (free for researchers) for scopus and plumx use
- COVID-19 Open Research Dataset (CORD-19)
- PaperRobot - includes PubMed Paper Reading Dataset
- SciMag - Microsoft Academic Linked to SciMago Journals - WebPage
- SciGraph Springer Nature
- Citations to scholarly data in various language wikipedias Code
- 800K publications matched from CrossRef, CORE, and Mendeley with data on publication and open access dates
- Coronavirus Open Citations Dataset
- Crossref dumps DOI meta-data
- S2ORC: The Semantic Scholar Open Research Corpus - 12.7M full text papers
- Wikipedia Citations: A comprehensive dataset of citations with identifiers extracted from English Wikipedia
- Microsoft Academic Data for conducting covid-19 research
- Initiative for Open Abstracts
- Dataset Search: metadata for datasets - Datasets with DOIs and compact identifiers
- Open Syllabus Project
- Journal Causal effect in Citations
- Sci-Hub Download Logs - Latest
- Sci-Hub databases
- SAGE Rejected article tracker dataset from ArXiv - Github
- The Open Research Knowledge Graph (ORKG)
- ACADEMIA INDUSTRY DYNAMICS
- Test of Time Awards
- ACL-Cite-Net
- The DBLP Discovery Dataset (D3): A Massive Dataset of Scholarly Metadata for Analyzing the State of Computer Science Research Zenodo
- Papers and patents are becoming less disruptive over time - Paper
- OpenAIRE Research Graph Dump
- OpCitance: Citation contexts identified from the PubMed Central open access articles
- A large dataset of scientific text reuse in Open-Access publications
- A dataset of publication records for Nobel laureates
Peer Review
- PeerRead - paper drafts, reviews, and accept/reject decision
- CiteTracked: A Longitudinal Dataset of Peer Reviews and Citations - Contact Author
- Elsevier's Peer Review Workbench
- ACL-18 Numerical Peer Review Dataset
- Argument Mining for Understanding Peer Reviews
- APE: Argument Pair Extraction - Annotated ICLR 2013-2020 review-rebuttal argument pair
- Argument Mining Driven Analysis of Peer-Reviews Dataset
- Publons review length dataset with 498K reviews - anonymized
- Peer review analyze: A novel benchmark resource for computational analysis of peer reviews
- Open Editors: data about scholarly journals' editors and editorial board members - Github
- NLPEER: A Unified Resource for the Computational Study of Peer Review
- eLife Open Peer Review Corpus
- PLoS Open Peer Review Corpus
- MDPI Open Peer Review Corpus
Grants and Funding
- GrantExplorer: a free, open-source tool for examining the phrases funded by U.S. federal agencies
- USASpending.gov: Award Data Archive
- NIH research funding
- Authors linked to PIs in NIH Grants
Academic Genealogy
- Mathematics Genealogy Project
- Academic Tree - Cross discipline academic genealogies
- MPACT project - Library Sciences
- PhDTree
- Chemistry Genealogy - curated at UIUC
- Notre Dame Genealogy Project
- UIUC Chemistry, Chemical Engineering, and Biochemistry
- Software Engineering Academic Genealogy
- Other lists of genealogy projects
- Wikipedia - Computer Science Genealogy
- Wikipedia - Theorecical Physicits Genealogy
- Wikipedia - Chemists Genealogy
- SCIENTIFIC GENEALOGY MASTER LIST - Scientists Associated with Concepts in Chemistry & Physics
- Economic Geneology Text Format
- S2AMP : Semantic Scholar Analysis of Mentorship Dataset
- MENTORSHIP - A dataset of mentorship in science with semantic and demographic estimations - Code
Author Profiles
- Temporal profiles of PubMed authors
- ORCID data dump
- National Library of Medicine Profiles
- UIUC Professors database - Publications, Affiliations
- Author Profiles of scholarly authors in Wikipedia
- Career Transitions of CS students
- Author name gender and ethnicity dataset based on PubMed
- MapAffil 2016 dataset -- PubMed author affiliations mapped to cities and their geocodes worldwide
- Conceptual novelty scores for PubMed articles
- 100,000 top-scientists that provides standardized information on citations, h-index, co-authorship adjusted hm-index, citations to papers in different authorship positions and a composite indicator
- Canadian PhD career survey - Science report
- Data from the CVs of over 150 assistant professors in psychology in top-ranked research universities and small liberal art colleges in the US - Used in this blog
- Wikidata Author Disambiguation Dataset
- The 4 Universities Data Set - Web pages of CS departments classified for author role (faculty, student, etc.)
- Journal editors dataset
- Career long various citation metrics for 100,000 top-scientists
- Network-Data-Career-Transitions - two anonymized network datasets of post-PhD career transitions and trajectories in computing research
- Open dataset of scholars on Twitter - 500K OpenAlex Author ID to Twitter User Id
- Gender Inequities in the Online Dissemination of Scholars’ Work
Author name disambiguation
- INSPIRE dataset
- Lee Giles dataset
- Cleaner version of Lee Giles dataset
- DBLP Korean Authors
- Arnet Miner
- Arnet Miner - Manual Name Disambiguation data 210 authors
- DBLP Name disambiguation dataset - Error corrected version
- rexa-coref-data
- Dedped author names on IEEE Vis papers 1990-2018
- Author-ity dataset for PubMed 2009
- ACL Anthology dataset
- Base data for estimating precision and recall of Author-ity among NIH-funded scientists
- ORCID-Linked Labeled Data for Evaluating Author Name Disambiguation at Scale
- S2AND - Semantic Scholar Author Name Disambiguation Tool and Dataset
- BibTex Dataset for 1M authors
- Ethnicity sensitive author disambiguation from INSPIRE HEP
- Pre-processed PubMed data for a study of coauthorship
- WhoIsWho: Web-Scale Academic Name Disambiguation:the WhoIsWho Benchmark,Leaderboard,and Toolkit - https://www.aminer.cn/whoiswho
- LAGOS-AND: A Large Gold Standard Dataset for Scholarly Author Name Disambiguation - Github
- Chain Dream : Name Disambiguation Task2
Thesis datasets
- Open Access Theses and Dissertations
- The Networked Digital Library of Theses and Dissertations (NDLTD)
- PhD Dissertations in the Area of Software Engineering
- ProQuest Dissertations & Theses Global
- History Dissertation Analysis
- Peer-making: the interconnections between PhD Thesis Committee membership and co-publishing - Zenodo
- DISAPERE: A Dataset for DIscourse Structure in Academic PEer REview
- ETDs: Virginia Tech Electronic Theses and Dissertations
- DSpace@MIT: a digital repository for MIT's research, including peer-reviewed articles, technical reports, working papers, theses, and more
- The ScanBank Dataset: A Benchmark Dataset for Figure Extraction from Scanned Electronic Theses and Dissertations
- ETDMiner: extract metadata from scanned ETD Google Drive
Information Extraction and NLP
- Citation Parsing
- Citation Parsing in humanities
- Sentences tagged for Drug Disease pairs
- Document Summarization and citation span identification
- ACL Anthology human summaries for 1000 papers
- Keyphrase Extraction
- Related Work Summarization
- Biomedical NLP annotated datasets
- Chemical compound and drug name recognition task
- Semantic Scholar Dataset
- ScienceIE
- ACL RD TEC 2.0 also at @CLARIN
- SEPID Corpus - Segmended ACL ARC 1.0
- PubMed Central Open Access - BioC
- PubMed Fulltext - protein-protein and genetic interactions
- BioNLP - Argo
- Biomedical NLP - Stav
- GENIA - BioNLP 2011
- Genia Treebank used for SciSpacy training - SciSpacy link
- Full GENIA corpus
- Anatomical Entity Mention (AnEM) corpus
- CellFinder - Entity detection
- Multi-Level Event Extraction (MLEE)
- Biomedical sentence simplification
- PubMed - Colorado Richly Annotated Full-Text
- Biomedical NER datasets related publication
- BioVerbNet
- Lunar and Planetary Science abstracts for NER and Relations
- ACM data affiliations
- ACM - DBLP database entry matching
- Colorado Richly Annotated Full-Text - PubMed abstract annotated with entities mapped to 10 biomedical ontology terms.
- CLEF datasets for multilingual Biomedical NLP+IE
- MedMentions - UMLS entities in PubMed
- Colright Initiatve - Rich text competition
- SciERC - scientific entities, their relations, and coreference clusters for 500 AI conf abstracts
- PubMed200k_RCT - Label abstract sentences into Objective, Background, Method, Results, Conclusions
- NER, Parsing, Classification datasets from SciBert
- ACA Wiki - Paper summaries of more than 1600 papers
- SemEval-2018 task 7 Semantic Relation Extraction and Classification in Scientific Papers
- A Compendium of Free, Public Biomedical Text Mining Tools Available on the Web
- Medical Information Extraction from PubMed abstracts
- Corpus of 40 scientific papers manually annotated by multiple scientific discourse facets
- PharmaCoNER: Pharmacological Substances, Compounds and proteins and Named Entity Recognition track - Train - Dev - Test - Background Test set
- Bacteria Biotope (BB) Task - NER, NEL, Relation, KB Extraction
- Entity/relation recognition and GOF/LOF mutated gene text identification task based on the Active Gene Annotation Corpus
- The Regulatory Network of Plant Seed Development (SeeDev) Task - NER, Relation
- TalkSumm - Summary of papers via alignment to talks
- SeminalSurveyDBLP - Classification of seminal or survey papers
- Supp.ai - PubMed supplement-drug interactions and supplement-supplement interactions
- GENETAG - More recent versions Publication and Download 2005
- MedTag: A Collection of Biomedical Annotations - Download
- Open Biomedical corpora
- Biomedical Abstract Meaning Representation corpus based on PubMed Fulltext - Also see other NLM curated biomedical resources
- SciDTB: Discourse Dependency TreeBank for Scientific Abstracts
- SciDTB corpus annotated for argumentation mining - Paper
- Dr. Inventor Multi-layer Scientific Corpus for multiple scientific discourse facets
- ART corpus - 225 papers manually annotated the CISP labels (i.e. "Goal", "Method", "Result").- Browse files - Project details
- Multi-CoreSC CRA corpus (MCCRA) - 50 papers annotated with multiple CoreSC labels per sentence. - Project details
- PubMedQA - Question answering on PubMed
- Corposaurus - Collection of biomedical corpus for NER
- BioNER corpus
- NeuroQuery - 14,000 full-text publications and 400,000 peak activations - NeuroQuery website
- Medical Information Extraction dataset
- A Large Parallel Corpus of Full-Text Scientific Articles
- Annotated Corpus of Scientific Conference's Homepages for Information Extraction
- Chi QA - Health Question Answering dataset from NLM
- Corpus of Open Access articles from multiple fields in Science, Technology, and Medicine - Includes wikification data
- Grounding Scientific Entity References in STEM Scholarly Content to Authoritative Encyclopedic and Lexicographic Sources
- Open Research Knowledge Graph project - Website
- Academic PhraseBank
- SciKG - Statement extraction datasets
- A Fully Coreference-annotated Corpus of Scholarly Papers from the ACL Anthology
- A manual corpus of annotated main findings of clinical case reports
- TREC Precision Medicine / Clinical Decision Support Track
- Lots of biomedical entity linking and entity identification datasets
- Materials Science Named Entity Recognition: train/development/test sets
- Entities in 3.27 million materials science abstracts
- Normalized entities in material science papers
- Named Entity Recognition for Bacterial Type IV Secretion Systems - Paper
- Annotating and detecting phenotypic information for chronic obstructive pulmonary disease
- MiRoR11 - P2 - Annotated corpus for primary and reported outcomes extraction
- Data from: PGxCorpus, a Manually Annotated Corpus for Pharmacogenomics
- Multiple PUBMED annotated corpora from iProLink project
- Mars Target Encyclopedia - LPSC abstracts labeled data set
- Annotation of phenotypes using ontologies
- The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text - SPECIES Direct Download - ORGANISMS Direct Download
- Entity mention in articles used for benchmark
- RAMBO 800+: A Corpus for the Development of Gene/Protein Recognition from Rare and Ambiguous Abbreviations
- Medical Relation Extraction - CrowdTruth
- KP20k - Kehphrase extraction on 20k abstracts
- Named Entity Recognition: (17.3 MB), 8 datasets on biomedical named entity recognition
- Relation Extraction: (2.5 MB), 2 datasets on biomedical relation extraction
- Question Answering: (5.23 MB), 3 datasets on biomedical question answering task
- SciREX : A Challenge Dataset for Document-Level Information Extraction
- Papers with Code - Links between papers and repositories and extraction of SOTA results
- Citation Context Classification based on purpose
- Citation Context Classification based on influence
- PubMed knowledge graph (PKG) Figshare
- Citation and Header Datasets
- Gobrid-NER data
- Multiple NER and Entity Linking data for science
- Scitation Context Classification
- S2ORC: The Semantic Scholar Open Research Corpus - 12.7M full text papers
- EuropePMC annotations for entities and relationships
- NLPContributionGraph - Structuring Scholarly NLP Contributions in the Open Research Knowledge Graph
- GOBRID NER
- GOBRID Sequence Labeling data
- The General Index - Metadata, Ngrams, and Keyphrases in 107,233,728 journal articles
- Pubtrends Review Dataset
- PubTator Central (PTC) - NLP annotated PMC datasets
- PubMedCentral Author Manuscript Collection
- Paper analyzer pubmed
- NER on Material Science Papers
- SoMeSci - Software Mentions in Science
- NLMChem a new resource for chemical entity recognition in PubMed full-text literature
- Scientific summarization datasets
- PubMed Classification
- Annotated scientific findings with sentence-level and aspect-level certainty
- SoftwareKG_Social and SoftwareKG_PubMed - Software mentions in articles
- Bioinformatics Named Entity Recogniser for Databases and Software
- The CodeMeta Project: preservation, discovery, reuse, and attribution of software
- Social Science Software Citation Dataset
- SoMeSci - A 5 Star Open Data Gold Standard Knowledge Graph of Software Mentions in Scientific Articles
- Softcite dataset: A gold-standard dataset of software mentions in research publications for supervised learning based named entity recognition
- SoftwareKG-PMC:a Knowledge Graph of Software mentions extracted from articles of the PMC Open Access Dataset
- DEAL: Detecting Entities in the Astrophysics Literature
- COMPUTER SCIENCE KNOWLEDGE GRAPH
- SCIERC: Multi-Task Identification of Entities, Relations, and Coreferencefor Scientific Knowledge Graph Construction - Code
- University of Washington BIO NLP datasets
- multimodal_summ: Multimodal summarization of research papers
- ACL Anthology Corpus - Full Text
- Entity Linking of Crossref Funding Orgs in Acknowledgements - paper
- Microsoft Academic Knowledge Graph (MAKG) - Zenodo ComplEx entity embeddings (120 GB) for all 243 million authors, 239 publications, 49,000 journals, and 16,000 conferences
- Wikidata:WikiProject Clinical Trials
- A Dataset of Alt Texts from HCI Publications
- PubMed-OA-Extraction-dataset
- SciRepEval: A Multi-Format Benchmark for Scientific Document Representations
- The MAPLE Benchmark for Scientific Literature Tagging
Networks
- ACL Anthology Network
- I³ Open Innovation Dataset Index - Multiple datasets related to patent networks, inventor careers, etc.
- Science4cast Competition - capture the evolution of scientific concepts and predict which research topics will emerge in the coming years
Taxonomies and Ontologies of Research Concepts
- SciGraph Springer Nature
- Medical Subject Headings maintained by the National Library of Medicine of the United States
- Computer Science Ontology maintained by Scholarly Knowledge: Modeling, Mining and Sense Making
- Physics Subject Headings (PhySH) maintained by American Physical Society (APS) GitHub
- Open Biological and Biomedical Ontology (OBO) maintained by the OBO Foundry
- ACM Computing Classification System maintained by the Association for Computing Machinery
- Physics and Astronomy Classification Scheme (PACS) maintained by American Institute of Physics (AIP) discontinued in 2010 and replaced by Physics Subject Headings
- Mathematics Subject Classification (MSC) mantained by Mathematical Reviews and zbMATH
- Journal of Economic Literature (JEL) maintained by the American Economic Association
- STW Thesaurus for Economics maintained by ZBW - Leibniz Information Centre for Economics
- Australian and New Zealand Standard Research Classification (ANZSRC) maintained by Australian Bureau of Statistics, it consists of 3 sub-classification schemes:
- Fields of Research (FoR) classification
- Research Fields, Courses and Disciplines (RFCD) classification
- Socio-Economic Objective (SEO) classification
- Library of Congress Classification (LCC) maintained by Library of Congress
- Fields of Study (FoS) maintained by Microsoft Academic
- CrossRef Open Funder's Registry
- Scientific Keyphrase Extraction Datasets - KP20k, NUS, MAG_KP
- Grounding Scientific Entity References in STEM Scholarly Content to Authoritative Encyclopedic and Lexicographic Sources
- XL-BEL is a benchmark for cross-lingual biomedical entity linking (XL-BEL). The benchmark spans 10 typologically diverse languages
- IteraTeR: Understanding Iterative Revision from Human-Written Text based on ArXiv abstract edit versions
- CiteSum: Citation Text-guided Scientific Extreme Summarization and Low-resource Domain Adaptation
- AckExtract: Acknowledgement and its name entities extraction from scholarly papers
- The MSVEC Dataset: Multi-Domain Scientific Claim Verification Evaluation Corpus (MSVEC)
- GIANT: The 1-Billion Annotated Synthetic Bibliographic-Reference-String Dataset for Deep Citation Parsing - dataverse
Affiliations
Altmetrics and Dimensions
- Altmetrics API
- Dimensions.ai API - documentation, example
- Core Conference Rankings
- China Computer Federation Conference Rankings
Tools
User interface to publication datasets and analysis
- Google Scholar
- Semantic Scholar
- Microsoft Academic Graph
- OpenAIRE Explore
- AceMap
- GitXiv
- ACL Anthology
- NIPS papers
- Abel tools for PubMed data
- infolis: linking research data and publications
- Metrics toolkit
- Rcrossref (R library)
- Rscopus (R library)
- Scholar (R library)
- Bibliometrix (R library)
- CITAN (R library)
- BibeR (BibeR: A Web-based tool for bibliometric analysis in scientific literature)
- scihub.py (Python library)
- SoPaper (Python library)
- CiteSeer tools
- Novelty quantification in PubMed articles
- TidyPMC - R based PMC XML parser
- PublicationHarvester - Download PubMed publications of an author
- Publish or Perish - retrieves and analyzes academic citations from MS Academic and Scholar
- Affiliation string parser
- CiteSeerX
- Data Set Knowledge Graph (DSKG) - a RDF data set about data sets
- Citation Gecko - Find related papers
- pySciSci - Python tool for working with MAG, PubMed, etc.
- ACM Digital Library
Tools for collecting open access papers
- ContentMine - getpapers
- rcoreoa - CORE API R client
- metaknowledge - A Python library for doing bibliometric and network analysis in science and health policy research
- PubMedPortable - PubMed to Postgres
- medic - Parsing MEDLINE and storing into a DB
Tools for classifying research papers
Visualizations
Language Processing and Information Extraction
- Biomedical - BioSentVec Embeddings
- Biomedical embeddings - CambridgeLTL
- NIH scientific paper pre-processing
- SciSpacy - Spacy models for Biomedical NLP from AllenAI
- Multitask Biomedical NER
- SciBERT - Bert LM for Biomedical and CS papers
Citation and metadata extraction
- CERMINE
- Grobid
- EXCITE (Extraction of Citations from PDF Documents)
- Science-Parse
- unarXiv (Citation in context from arXiv)
- Biblio-Glutton
- PDF/LaTeX to JSON
- CrossRef Reference Matching code and evaluation data
- Citation style classifier and evaluation data
- refextract - extracting references used in scholarly communication
Publication and Publisher Info
Author Name Disambiguation
Community
Journals
- Frontiers in Research Metrics and Analytics
- Scientometrics
- Journal of Informetrics
- Quantitative Science Studies (Open Access)
- Science, technology and human values
- Social Studies of Science
- Science and Public Policy
Conferences
- Joint Conference on Digital Libraries (JCDL)
- International Conference on Theory and Practice of Digital Libraries (TPDL)
- European Semantic Web Conference (ESWC), Research of Research Track
- STI Conference series (Science and Technology indicators, e.g., 2018)
- ISSI Conference series (INTERNATIONAL CONFERENCE ON SCIENTOMETRICS & INFORMETRICS, e.g., 2019)
Workshops
- SIGMET - Metrics workshop
- International Workshop on Mining Scientific Publications
- Semantics, Analytics, Visualisation: Enhancing Scholarly Dissemination (SAVE-SD)
- Workshop on Reframing Research (RefResh)
- Enabling Open Semantic Science (SemSci)
- Workshop on Scholarly Document Processing
Summer Schools
Courses
Associations & Community
- International Society for Informetrics and Scientometrics (ISSI)
- European Network of Indicator Designers (ENID)
- 4S (Society for Social Studies of Science)
- SIG/MET - Special Interest Group for the measurement of information production and use
Research Groups
Blogs
Contributions
The following people have contributed to the items on this list.
- Shubhanshu Mishra - Maintainer of the list.
- Angelo Antonio Salatino
- Philipp Zumstein
- Ali (Aliakbar Akbaritabar)
- Andrea Mannocci