• Stars
    star
    2,175
  • Rank 21,206 (Top 0.5 %)
  • Language
    Python
  • License
    MIT License
  • Created about 8 years ago
  • Updated about 2 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Simple wrapper of tabula-java: extract table from PDF into pandas DataFrame

tabula-py

Build Status PyPI version Documentation Status Patreon

tabula-py is a simple Python wrapper of tabula-java, which can read tables in a PDF. You can read tables from a PDF and convert them into a pandas DataFrame. tabula-py also enables you to convert a PDF file into a CSV, a TSV or a JSON file.

You can see the example notebook and try it on Google Colab, or we highly recommend reading our documentation, especially the FAQ section.

tabula-py example

Requirements

  • Java 8+
  • Python 3.8+

OS

I confirmed working on macOS and Ubuntu. But some people confirm it works on Windows 10. See also the documentation for the detailed installation for Windows 10.

Usage

Install

Ensure you have a Java runtime and set the PATH for it.

pip install tabula-py

Example

tabula-py enables you to extract tables from a PDF into a DataFrame, or a JSON. It can also extract tables from a PDF and save theย file as a CSV, a TSV, or a JSON.ย ย 

import tabula

# Read pdf into list of DataFrame
dfs = tabula.read_pdf("test.pdf", pages='all')

# Read remote pdf into list of DataFrame
dfs2 = tabula.read_pdf("https://github.com/tabulapdf/tabula-java/raw/master/src/test/resources/technology/tabula/arabic.pdf")

# convert PDF into CSV file
tabula.convert_into("test.pdf", "output.csv", output_format="csv", pages='all')

# convert all PDFs in a directory
tabula.convert_into_by_batch("input_directory", output_format='csv', pages='all')

See an example notebook for more details. I also recommend reading the tutorial article written by @aegis4048, and another tutorial written by @tdpetrou.

Contributing

Interested in helping out? I'd love to have your help!

You can help by:

  • Reporting a bug.
  • Adding or editing documentation.
  • Contributing code via a Pull Request. See also for the contribution
  • Write a blog post or spread the word about tabula-py to people who might be able to benefit from using it.

Contributors

Another support

You can also support our continued work on tabula-py with a donation on GitHub Sponsors or Patreon.

More Repositories

1

julia-100-exercises

julia version of 100 numpy exercises
Jupyter Notebook
129
star
2

Mykytea-python

Python wrapper for KyTea
C++
36
star
3

notebooks

Jupyter Notebook
31
star
4

ml_in_production

Machine Learning infrastructure/architecture/operation for productionization
30
star
5

MeCab.jl

Julia binding of Japanese morphological analyzer MeCab
Julia
21
star
6

cloudera-parcel

customized cloudera-parcel
Python
13
star
7

sparkavro

Load Avro data into Spark with sparklyr
R
12
star
8

ibis-demo

Demo notebook of Ibis for "Spark + Python + Dita science Festival"
Jupyter Notebook
12
star
9

homebrew-cloudera

Homebrew Formulas for cloudera tools
Ruby
10
star
10

sparklyr-distribute

Example code of spark_apply with sparklyr for CDH
R
8
star
11

NLTK-pyspark

Example repository for NLTK execution on PySpark cluster with Cloudera Data Science Workbench
Python
8
star
12

spacyr-sparklyr

Example code of spacyr with sparklyr
R
8
star
13

tdworkflow

Unofficial Treasure Workflow Client
Python
7
star
14

cdsw-simple-serving-python

Python
7
star
15

Mykytea-ruby

Ruby wrapper for KyTea
C++
7
star
16

amazon-movie-review

Recommendation for Amazon movie review data
Python
6
star
17

pollynomial

AWS Polly wrapper for Ruby: Text to speech gem
Ruby
6
star
18

solar-power-prediction

Jupyter Notebook
6
star
19

cloudera-sparklyr

Build script and Demo for Cloudera Director with Sparklyr
HTML
5
star
20

hocon-validator

HOCON validator
Python
5
star
21

cJuman-installer

This is installer for cJuman which is wrapper of JUMAN.
C
5
star
22

cdsw-serve-docker

REST API server example with Docker for Cloudera Data Science Workbench
5
star
23

docker-sphinx-recommonmark

Sphinx documentation toolchain, including latex and recommonmark in an Ubuntu docker container.
Dockerfile
5
star
24

sparklytd

spaklyr plugin for td-spark to connect TD from R
R
4
star
25

digdaglog2sql

Extract SQLs from digdag log
Python
4
star
26

mecab-on-pyspark

Example code for distributing Python packages on Spark cluster
Python
3
star
27

implyr-example

Example repository of implyr
R
3
star
28

JPKyteaTokenizer

Japanese tokenizer with KyTea for nltk
Python
3
star
29

pficommon_json_test

pficommon::text::json test
C++
3
star
30

molehill

Hivemall SQLs and digdag workflows generator
Python
3
star
31

morph-websocket

Real time morphological analyzing web-app.
Ruby
2
star
32

cookiecutter-digdag

A template generates digdag workflows for SQL and Python
Python
2
star
33

audience_generator

Create dummy data for Audience Studio on Treasure Data
Python
2
star
34

kytea_sinatra

Test application for KyTea with Sinatra
Ruby
2
star
35

JuliaTokyoTutorial

Julia Tokyo Tutorial
2
star
36

homebrew-jumanpp

A Homebrew formula for juman++ http://nlp.ist.i.kyoto-u.ac.jp/index.php?JUMAN++
Ruby
2
star
37

ml_intern2015

Cookpad summer intern 2015 exercise
Python
1
star
38

chezou-hugo

HTML
1
star
39

japan_weather

Python
1
star
40

mizuyarilink_octopress

CSS
1
star
41

prelims-cli

Python
1
star
42

ConfidenceWeighted.jl

confidence weighted classifier
Julia
1
star