• Stars
    star
    117
  • Rank 301,828 (Top 6 %)
  • Language
    Python
  • License
    MIT License
  • Created about 10 years ago
  • Updated 11 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Uses publisher APIs to programmatically retrieve scientific journal articles for text mining.

article-downloader

Circle CI Documentation Status DOI

Uses publisher-approved APIs to programmatically retrieve large amounts of scientific journal articles for text mining. Exposes a top-level ArticleDownloader class which provides methods for retrieving lists of DOIs (== unique article IDs) from text search queries, downloading HTML and PDF articles given DOIs, and programmatically sweeping through search parameters for large scale downloading.

Important Note: This package is only intended to be used for publisher-approved text-mining activities! The code in this repository only provides an interface to existing publisher APIs and web routes; you need your own set of API keys / permissions to download articles from any source that isn't open-access.

Full API Documentation

You can read the documentation for this repository here.

Installation

Use pip install articledownloader. If you don't have pip installed, you could also download the ZIP containing all the files in this repo and manually import the ArticleDownloader class into your own Python code.

Usage

Use the ArticleDownloader class to download articles. You'll need an API key, and please respect each publisher's terms of use.

It's usually best to add your API key to your environment variables with something like export API_KEY=xxxxx.

You can find DOIs using a CSV where the first column corresponds to search queries, and these queries will be used to find articles and retrieve their DOIs.

Examples

Downloading a single PDF article

from articledownloader.articledownloader import ArticleDownloader
downloader = ArticleDownloader(els_api_key='your_elsevier_API_key')
my_file = open('my_path/something.pdf', 'w')  # Need to use 'wb' on Windows

downloader.get_pdf_from_doi('my_doi', my_file, 'crossref')

Downloading a single HTML article

from articledownloader.articledownloader import ArticleDownloader
downloader = ArticleDownloader(els_api_key='your_elsevier_API_key')
my_file = open('my_path/something.html', 'w')

downloader.get_html_from_doi('my_doi', my_file, 'elsevier')

Getting metadata

from articledownloader.articledownloader import ArticleDownloader
downloader = ArticleDownloader(els_api_key='your_elsevier_API_key')

#Get 500 DOIs from articles published after the year 2000 from a single journal
downloader.get_dois_from_journal_issn('journal_issn', rows=500, pub_after=2000)

#Get the title for a single article (only works with CrossRef for now)
downloader.get_title_from_doi('my_doi', 'crossref')

#Get the abstract for a single article (only works with Elsevier for now)
downloader.get_abstract_from_doi('my_doi', 'elsevier')

Using search queries to find DOIs

CSV file:

search query 001,
search query 002,
search query 003,
.
.
.

Python:

from articledownloader.articledownloader import ArticleDownloader
downloader = ArticleDownloader('your_API_key')

#grab up to 5 articles per search
queries = downloader.load_queries_from_csv(open('path_to_csv_file', 'r'))

dois = []
for query in queries:
  dois.append(downloader.get_dois_from_search(query))

for i, doi in enumerate(dois):
    my_file = open(str(i) + '.pdf', 'w')
    downloader.get_pdf_from_doi(doi, my_file, 'crossref') #or 'elsevier'
    my_file.close()

More Repositories

1

materials-synthesis-generative-models

Public release of data and code for materials synthesis generation
HTML
69
star
2

materials-word-embeddings

Word2Vec model trained across 640k+ materials science journal articles
Python
51
star
3

table_extractor

Extracts tables into json format from HTML/XML files
HTML
34
star
4

annotated-materials-syntheses

23
star
5

sdata-data-plots

Plots for "Machine-learned and codified synthesis parameters of oxide materials" in the journal Scientific Data
Jupyter Notebook
12
star
6

MatKG

Code Base for MatKG Dataset paper
Jupyter Notebook
11
star
7

cross-domain-exploration

8
star
8

synthesis-api

Codebase for Synthesis Project API server
Python
8
star
9

OSDA_Generator

Python
7
star
10

synthesis-database-public

Codebase for compiling a database of materials syntheses
Python
6
star
11

NLP4SIB

Datasets and pre-trained models for Munjal, Mrigi, et al. "Scaling Sodium-ion Battery Development with NLP." AI for Accelerated Materials Design-NeurIPS 2023 Workshop. 2023.
Python
5
star
12

phase-sentiment

Jupyter Notebook
4
star
13

interpretable-condition-prediction

Code and data for the paper Interpretable Machine Learning Enabled Inorganic Reaction Classification and Synthesis Condition Prediction by Karpovich et al.
Jupyter Notebook
3
star
14

synthesis-thermo-public

Implementations of material-embedding and thermodynamic-function-learning models
Python
2
star
15

Li-SSE-Processing-Conditions

Public release of data for the processing conditions of Li SSEs
2
star
16

deep-rl-inorganic

Code and data for the paper Deep Reinforcement Learning for Inverse Inorganic Materials Design by Karpovich et al.
Jupyter Notebook
1
star