• Stars
    star
    309
  • Rank 134,507 (Top 3 %)
  • Language
    Python
  • License
    MIT License
  • Created about 5 years ago
  • Updated 6 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Python PDF parser for scientific publications: content and figures

SciPDF Parser

A Python parser for scientific PDF based on GROBID.

Installation

Use pip to install from this Github repository

pip install git+https://github.com/titipata/scipdf_parser

Note

  • We also need an en_core_web_sm model for spacy, where you can run python -m spacy download en_core_web_sm to download it
  • You can change GROBID version in serve_grobid.sh to test the parser on a new GROBID version

Usage

Run the GROBID using the given bash script before parsing PDF

bash serve_grobid.sh

This script will download GROBID and run the service at default port 8070 (see more here). To parse a PDF provided in example_data folder or direct URL, use the following function:

import scipdf
article_dict = scipdf.parse_pdf_to_dict('example_data/futoma2017improved.pdf') # return dictionary
 
# option to parse directly from URL to PDF, if as_list is set to True, output 'text' of parsed section will be in a list of paragraphs instead
article_dict = scipdf.parse_pdf_to_dict('https://www.biorxiv.org/content/biorxiv/early/2018/11/20/463760.full.pdf', as_list=False)

# output example
>> {
    'title': 'Proceedings of Machine Learning for Healthcare',
    'abstract': '...',
    'sections': [
        {'heading': '...', 'text': '...'},
        {'heading': '...', 'text': '...'},
        ...
    ],
    'references': [
        {'title': '...', 'year': '...', 'journal': '...', 'author': '...'},
        ...
    ],
    'figures': [
        {'figure_label': '...', 'figure_type': '...', 'figure_id': '...', 'figure_caption': '...', 'figure_data': '...'},
        ...
    ],
    'doi': '...'
}

xml = scipdf.parse_pdf('example_data/futoma2017improved.pdf', soup=True) # option to parse full XML from GROBID

To parse figures from PDF using pdffigures2, you can run

scipdf.parse_figures('example_data', output_folder='figures') # folder should contain only PDF files

You can see example output figures in figures folder.

More Repositories

1

pubmed_parser

πŸ“‹ A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset
Python
571
star
2

detecting-scientific-claim

Extracting scientific claims from biomedical abstracts (powered by AllenNLP)
Python
141
star
3

paper-reviewer-matcher

Linear programming solver for paper-reviewer matching and mind-matching
Python
82
star
4

arxivpy

Python wrapper for arXiv API
Python
51
star
5

science_concierge

πŸ“» a Python repository for content-based recommendation based on Latent semantic analysis (LSA) topic distance and Rocchio Algorithm, see the implementation interactively on
Python
47
star
6

affiliation_parser

Simple python parser for MEDLINE, Pubmed OA affiliation string
Python
37
star
7

allennlp-tutorial

Tutorial on AllenNLP library with demo "which journal to submit paper?"
Jupyter Notebook
32
star
8

customize_ipython_notebook

🐧 CSS and logo to customize ipython notebook display for Kording lab
Jupyter Notebook
29
star
9

wos_parser

Python parser for Web of science XML, Web of Science parser, WoS parser
Python
26
star
10

grant_database

πŸ’΅ Downloader, preprocessor, parser and deduper for NIH and NSF grants
Python
20
star
11

yelp_dataset_challenge

Play around with Yelp dataset in Python (in progress and very messy repo)
Python
19
star
12

keyphrase_extraction

Implementing keyword extraction algorithm using tf-idf weighting, see
Python
17
star
13

affilparser

Conditional Random Field (CRF) Parser for Affiliation String in MEDLINE and Pubmed OA
Python
13
star
14

penn-events-calendar

University of Pennsylvania events with search and recommendation engine
Python
11
star
15

scrape_google_scholar

Snippet for scraping Google Scholar and transform it to Spark Dataframe
Python
5
star
16

titipata.github.io

Minimal personal page for Titipat.
JavaScript
4
star
17

touchbar-example

sample project of touch bar on the new mac using electron
JavaScript
4
star
18

cooccurence

Simple class for converting documents to co-ocurence matrix
Python
3
star
19

dogbreed

Streamlit demo for dog breed identification/classification
Python
3
star
20

bme469_neural_control_of_movement

reading, homework and project for BME 469 Neural Control of Movement (Spring 2016)
Jupyter Notebook
2
star
21

random_commands

A place to put my note taking
2
star
22

forecast

very simple weather forecast UI
2
star
23

aibuilders-vision

Lessons for AI Builders: Vision Track
Jupyter Notebook
1
star
24

google_scholar_scoreboard

🌽 Real time citation scoreboard from Google Scholar [default for Kording lab]
Python
1
star
25

science_concierge_manuscript

LaTeX document and PDF for Science Concierge Manuscript
TeX
1
star
26

me454_nonlinear_optimal_control

Mathematica project for ME 454 Nonlinear Optimal Control class
Mathematica
1
star
27

be566_network_neuroscience

Analysis code for BE566 projects at UPenn
Python
1
star