• Stars
    star
    154
  • Rank 242,095 (Top 5 %)
  • Language
    C#
  • License
    MIT License
  • Created about 4 years ago
  • Updated 4 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Extract tables from PDF files (port of tabula-java)

tabula-sharp

tabula-sharp is a library for extracting tables from PDF files β€” it is a port of tabula-java

Windows Linux Mac OS

  • Supports .NET 6, .NET Core 3.1, .NET Standard 2.0, .NET Framework 4.52, 4.6, 4.61, 4.62, 4.7
  • No java bindings

NuGet packages available on the releases page and on www.nuget.org:

Differences with tabula-java

  • Uses PdfPig, and not PdfBox.
  • Coordinate system starts from the bottom left point (going up) of the page, and not from the top left point (going down).
  • The NurminenDetectionAlgorithm is replaced by SimpleNurminenDetectionAlgorithm, because it requieres an image management library.
  • Table results might be different because of the way PdfPig builds Letters bounding box.

Usage

Stream mode - BasicExtractionAlgorithm

using (PdfDocument document = PdfDocument.Open("doc.pdf", new ParsingOptions() { ClipPaths = true }))
{
	ObjectExtractor oe = new ObjectExtractor(document);
	PageArea page = oe.Extract(1);
	
	// detect canditate table zones
	SimpleNurminenDetectionAlgorithm detector = new SimpleNurminenDetectionAlgorithm();
	var regions = detector.Detect(page);
	
	IExtractionAlgorithm ea = new BasicExtractionAlgorithm();
	List<Table> tables = ea.Extract(page.GetArea(regions[0].BoundingBox)); // take first candidate area
	var table = tables[0];
	var rows = table.Rows;
}

Lattice mode - SpreadsheetExtractionAlgorithm

using (PdfDocument document = PdfDocument.Open("doc.pdf", new ParsingOptions() { ClipPaths = true }))
{
	ObjectExtractor oe = new ObjectExtractor(document);
	PageArea page = oe.Extract(1);

	IExtractionAlgorithm ea = new SpreadsheetExtractionAlgorithm();
	List<Table> tables = ea.Extract(page);
	var table = tables[0];
	var rows = table.Rows;
}

Results

Stream mode - BasicExtractionAlgorithm

example

Lattice mode - SpreadsheetExtractionAlgorithm

example

More Repositories

1

DocumentLayoutAnalysis

Document Layout Analysis resources repos for development with PdfPig.
C#
572
star
2

YOLOv4MLNet

Use the YOLO v4 and v5 (ONNX) models for object detection in C# using ML.Net
C#
79
star
3

camelot-sharp

A C# library to extract tabular data from PDFs (port of camelot Python version using PdfPig).
C#
31
star
4

PdfPigMLNetBlockClassifier

Proof of concept of training a simple Region Classifier using PdfPig and ML.NET (LightGBM). The objective is to classify each text block in a pdf document page as either title, text, list, table and image.
C#
23
star
5

lean-monitor-2

Windows/Linux/MacOS Desktop App to browse QuantConnect Lean engine's backtest and monitor live performances. Original project https://github.com/mirthestam/lean-monitor
C#
23
star
6

Caly

Cross-platform pdf reader application
C#
23
star
7

simple-docstrum

A step-by-step C# implementation of the Docstrum algorithm
Jupyter Notebook
22
star
8

YOLOv3MLNet

Use the YOLO v3 (ONNX) model for object detection in C# using ML.Net
C#
20
star
9

Nzy3d

A .Net API for 3d charts (based on nzy3d-api)
C#
19
star
10

PublayNet-maskrcnn-mlnet

Using a MaskRCNN model trained on the PublayNet dataset with ML.Net in C# / .Net for Document layout analysis and page segmmentation task.
C#
16
star
11

youtube-transcript-api-sharp

This is a C# API which allows you to get the transcript/subtitles for a given YouTube video. It also works for automatically generated subtitles and it does not require a headless browser, like other selenium based solutions do! Ported from https://github.com/jdepoix/youtube-transcript-api
C#
15
star
12

PdfPig.Rendering.Skia

Cross-platform library to render pdf documents as images with PdfPig using SkiaSharp
C#
13
star
13

RamerDouglasPeuckerNetV2

An algorithm that decimates a curve composed of line segments to a similar curve with fewer points.
C#
12
star
14

RamerDouglasPeuckerNet

Ramer-Douglas-Peucker algorithm for 2D data in C#
C#
10
star
15

PdfPigSvmRegionClassifier

Proof of concept of a simple SVM Region Classifier using PdfPig and Accord.Net. The objective is to classify each text block in a pdf document page as either title, text, list, table and image.
C#
7
star
16

SnipsNlu

Snips NLU C# wrapper library to extract meaning from text
C#
5
star
17

RapidOcrNet

Cross-platform OCR processing using PaddleOCR ONNX models. Based on RapidAI's RapidOCR
C#
3
star
18

PublayNetSharp

Extract and convert PubLayNet data to PageXml format
C#
2
star
19

GenerateRandomScatter

Tool to generate random scatter plots with their ground truth in C# using OxyPlot.
C#
1
star
20

VirtualizingStackPanelDuplicatesIssue

Minimum app for issue with VirtualizingStackPanel
C#
1
star
21

OxyPlot.Dark.Wpf

OxyPlot modification with a dark theme
C#
1
star
22

IccProfile

ICC Profile reader
C#
1
star