BobLd/tabula-sharp

Stars
154
Rank 242,095 (Top 5 %)
Language
C#
License
MIT License
Created about 4 years ago
Updated 4 months ago

BobLd/tabula-sharp

BobLd

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Extract tables from PDF files (port of tabula-java)

tabula-sharp

tabula-sharp is a library for extracting tables from PDF files — it is a port of tabula-java

Supports .NET 6, .NET Core 3.1, .NET Standard 2.0, .NET Framework 4.52, 4.6, 4.61, 4.62, 4.7
No java bindings

NuGet packages available on the releases page and on www.nuget.org:

Differences with tabula-java

Uses PdfPig, and not PdfBox.
Coordinate system starts from the bottom left point (going up) of the page, and not from the top left point (going down).
The NurminenDetectionAlgorithm is replaced by SimpleNurminenDetectionAlgorithm, because it requieres an image management library.
Table results might be different because of the way PdfPig builds Letters bounding box.

Usage

Stream mode - BasicExtractionAlgorithm

using (PdfDocument document = PdfDocument.Open("doc.pdf", new ParsingOptions() { ClipPaths = true }))
{
	ObjectExtractor oe = new ObjectExtractor(document);
	PageArea page = oe.Extract(1);
	
	// detect canditate table zones
	SimpleNurminenDetectionAlgorithm detector = new SimpleNurminenDetectionAlgorithm();
	var regions = detector.Detect(page);
	
	IExtractionAlgorithm ea = new BasicExtractionAlgorithm();
	List<Table> tables = ea.Extract(page.GetArea(regions[0].BoundingBox)); // take first candidate area
	var table = tables[0];
	var rows = table.Rows;
}

Lattice mode - SpreadsheetExtractionAlgorithm

using (PdfDocument document = PdfDocument.Open("doc.pdf", new ParsingOptions() { ClipPaths = true }))
{
	ObjectExtractor oe = new ObjectExtractor(document);
	PageArea page = oe.Extract(1);

	IExtractionAlgorithm ea = new SpreadsheetExtractionAlgorithm();
	List<Table> tables = ea.Extract(page);
	var table = tables[0];
	var rows = table.Rows;
}

Results

Stream mode - BasicExtractionAlgorithm

Lattice mode - SpreadsheetExtractionAlgorithm

DocumentLayoutAnalysis

Document Layout Analysis resources repos for development with PdfPig.

YOLOv4MLNet

Use the YOLO v4 and v5 (ONNX) models for object detection in C# using ML.Net

camelot-sharp

A C# library to extract tabular data from PDFs (port of camelot Python version using PdfPig).

PdfPigMLNetBlockClassifier

Proof of concept of training a simple Region Classifier using PdfPig and ML.NET (LightGBM). The objective is to classify each text block in a pdf document page as either title, text, list, table and image.

lean-monitor-2

Windows/Linux/MacOS Desktop App to browse QuantConnect Lean engine's backtest and monitor live performances. Original project https://github.com/mirthestam/lean-monitor

Caly

Cross-platform pdf reader application

simple-docstrum

A step-by-step C# implementation of the Docstrum algorithm

Jupyter Notebook

YOLOv3MLNet

Use the YOLO v3 (ONNX) model for object detection in C# using ML.Net

Nzy3d

A .Net API for 3d charts (based on nzy3d-api)

PublayNet-maskrcnn-mlnet

Using a MaskRCNN model trained on the PublayNet dataset with ML.Net in C# / .Net for Document layout analysis and page segmmentation task.

youtube-transcript-api-sharp

This is a C# API which allows you to get the transcript/subtitles for a given YouTube video. It also works for automatically generated subtitles and it does not require a headless browser, like other selenium based solutions do! Ported from https://github.com/jdepoix/youtube-transcript-api

PdfPig.Rendering.Skia

Cross-platform library to render pdf documents as images with PdfPig using SkiaSharp

RamerDouglasPeuckerNetV2

An algorithm that decimates a curve composed of line segments to a similar curve with fewer points.

RamerDouglasPeuckerNet

Ramer-Douglas-Peucker algorithm for 2D data in C#

PdfPigSvmRegionClassifier

Proof of concept of a simple SVM Region Classifier using PdfPig and Accord.Net. The objective is to classify each text block in a pdf document page as either title, text, list, table and image.

SnipsNlu

Snips NLU C# wrapper library to extract meaning from text

RapidOcrNet

Cross-platform OCR processing using PaddleOCR ONNX models. Based on RapidAI's RapidOCR

PublayNetSharp

Extract and convert PubLayNet data to PageXml format

GenerateRandomScatter

Tool to generate random scatter plots with their ground truth in C# using OxyPlot.

VirtualizingStackPanelDuplicatesIssue

Minimum app for issue with VirtualizingStackPanel

OxyPlot.Dark.Wpf

OxyPlot modification with a dark theme

IccProfile

ICC Profile reader