titipata/scipdf_parser

Stars
309
Rank 135,306 (Top 3 %)
Language
Python
License
MIT License
Created over 5 years ago
Updated 8 months ago

titipata/scipdf_parser

titipata

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Python PDF parser for scientific publications: content and figures

SciPDF Parser

A Python parser for scientific PDF based on GROBID.

Installation

Use pip to install from this Github repository

pip install git+https://github.com/titipata/scipdf_parser

Note

We also need an en_core_web_sm model for spacy, where you can run python -m spacy download en_core_web_sm to download it
You can change GROBID version in serve_grobid.sh to test the parser on a new GROBID version

Usage

Run the GROBID using the given bash script before parsing PDF

bash serve_grobid.sh

This script will download GROBID and run the service at default port 8070 (see more here). To parse a PDF provided in example_data folder or direct URL, use the following function:

import scipdf
article_dict = scipdf.parse_pdf_to_dict('example_data/futoma2017improved.pdf') # return dictionary
 
# option to parse directly from URL to PDF, if as_list is set to True, output 'text' of parsed section will be in a list of paragraphs instead
article_dict = scipdf.parse_pdf_to_dict('https://www.biorxiv.org/content/biorxiv/early/2018/11/20/463760.full.pdf', as_list=False)

# output example
>> {
    'title': 'Proceedings of Machine Learning for Healthcare',
    'abstract': '...',
    'sections': [
        {'heading': '...', 'text': '...'},
        {'heading': '...', 'text': '...'},
        ...
    ],
    'references': [
        {'title': '...', 'year': '...', 'journal': '...', 'author': '...'},
        ...
    ],
    'figures': [
        {'figure_label': '...', 'figure_type': '...', 'figure_id': '...', 'figure_caption': '...', 'figure_data': '...'},
        ...
    ],
    'doi': '...'
}

xml = scipdf.parse_pdf('example_data/futoma2017improved.pdf', soup=True) # option to parse full XML from GROBID

To parse figures from PDF using pdffigures2, you can run

scipdf.parse_figures('example_data', output_folder='figures') # folder should contain only PDF files

You can see example output figures in figures folder.

pubmed_parser

📋 A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset

detecting-scientific-claim

Extracting scientific claims from biomedical abstracts (powered by AllenNLP)

paper-reviewer-matcher

Linear programming solver for paper-reviewer matching and mind-matching

arxivpy

Python wrapper for arXiv API

science_concierge

📻 a Python repository for content-based recommendation based on Latent semantic analysis (LSA) topic distance and Rocchio Algorithm, see the implementation interactively on

affiliation_parser

Simple python parser for MEDLINE, Pubmed OA affiliation string

allennlp-tutorial

Tutorial on AllenNLP library with demo "which journal to submit paper?"

Jupyter Notebook

customize_ipython_notebook

🐧 CSS and logo to customize ipython notebook display for Kording lab

Jupyter Notebook

wos_parser

Python parser for Web of science XML, Web of Science parser, WoS parser

grant_database

💵 Downloader, preprocessor, parser and deduper for NIH and NSF grants

yelp_dataset_challenge

Play around with Yelp dataset in Python (in progress and very messy repo)

keyphrase_extraction

Implementing keyword extraction algorithm using tf-idf weighting, see

affilparser

Conditional Random Field (CRF) Parser for Affiliation String in MEDLINE and Pubmed OA

penn-events-calendar

University of Pennsylvania events with search and recommendation engine

scrape_google_scholar

Snippet for scraping Google Scholar and transform it to Spark Dataframe

titipata.github.io

Minimal personal page for Titipat.

touchbar-example

sample project of touch bar on the new mac using electron

cooccurence

Simple class for converting documents to co-ocurence matrix

dogbreed

Streamlit demo for dog breed identification/classification

bme469_neural_control_of_movement

reading, homework and project for BME 469 Neural Control of Movement (Spring 2016)

Jupyter Notebook

random_commands

A place to put my note taking

forecast

very simple weather forecast UI

aibuilders-vision

Lessons for AI Builders: Vision Track

Jupyter Notebook

google_scholar_scoreboard

🌽 Real time citation scoreboard from Google Scholar [default for Kording lab]

science_concierge_manuscript

LaTeX document and PDF for Science Concierge Manuscript

me454_nonlinear_optimal_control

Mathematica project for ME 454 Nonlinear Optimal Control class

be566_network_neuroscience

Analysis code for BE566 projects at UPenn