• Stars
    star
    544
  • Rank 78,613 (Top 2 %)
  • Language
    HTML
  • License
    MIT License
  • Created about 9 years ago
  • Updated 5 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

🚜 Parse text and tables from PDF files.

pdfreader Node CI Code Quality

Read text and parse tables from PDF files.

Supports tabular data with automatic column detection, and rule-based parsing.

Dependencies: it is based on pdf2json, which itself relies on Mozilla's pdf.js.

ℹ️ Important notes:

  • This module is meant to be run using Node.js only. It does not work from a web browser.
  • This module extracts text entries from PDF files. It does not support photographed text. If you cannot select text from the PDF file, you may need to use OCR software first.

Summary:

Installation, tests and CLI usage

After installing Node.js:

git clone https://github.com/adrienjoly/npm-pdfreader.git
cd npm-pdfreader
npm install
npm test
node parse.js test/sample.pdf

Installation into an existing project

To install pdfreader as a dependency of your Node.js project:

npm install pdfreader

Then, see below for examples of use.

Raw PDF reading

This module exposes the PdfReader class, to be instantiated. You can pass { debug: true } to the constructor, in order to log debugging information. (useful for troubleshooting)

Your instance has two methods for parsing a PDF. They return the same output and differ only in input: PdfReader.parseFileItems (as below) for a filename, and PdfReader.parseBuffer (see: "Raw PDF reading from a PDF already in memory (buffer)") from data that you don't want to reference from the filesystem.

Whichever method you choose, it asks for a callback, which gets called each time the instance finds what it denotes as a PDF item.

An item object can match one of the following objects:

  • null, when the parsing is over, or an error occured.
  • File metadata, {file:{path:string}}, when a PDF file is being opened, and is always the first item.
  • Page metadata, {page:integer, width:float, height:float}, when a new page is being parsed, provides the page number, starting at 1. This basically acts as a carriage return for the coordinates of text items to be processed.
  • Text items, {text:string, x:float, y:float, w:float, ...}, which you can think of as simple objects with a text property, and floating 2D AABB coordinates on the page.

It's up to your callback to process these items into a data structure of your choice, and also to handle any errors thrown to it.

For example:

import { PdfReader } from "pdfreader";

new PdfReader().parseFileItems("test/sample.pdf", (err, item) => {
  if (err) console.error("error:", err);
  else if (!item) console.warn("end of file");
  else if (item.text) console.log(item.text);
});

Parsing a password-protected PDF file

new PdfReader({ password: "YOUR_PASSWORD" }).parseFileItems(
  "test/sample-with-password.pdf",
  function (err, item) {
    if (err) console.error(err);
    else if (!item) console.warn("end of file");
    else if (item.text) console.log(item.text);
  }
);

Raw PDF reading from a PDF buffer

As above, but reading from a buffer in memory rather than from a file referenced by path. For example:

import fs from "fs";
import { PdfReader } from "pdfreader";

fs.readFile("test/sample.pdf", (err, pdfBuffer) => {
  // pdfBuffer contains the file content
  new PdfReader().parseBuffer(pdfBuffer, (err, item) => {
    if (err) console.error("error:", err);
    else if (!item) console.warn("end of buffer");
    else if (item.text) console.log(item.text);
  });
});

Other examples of use

example cv resume parse convert pdf to text

example cv resume parse convert pdf table to text

Source code of the examples above: parsing a CV/résumé.

For more, see Examples of use.

Rule-based data extraction

The Rule class can be used to define and process data extraction rules, while parsing a PDF document.

Rule instances expose "accumulators": methods that defines the data extraction strategy to be used for each rule.

Example:

const processItem = Rule.makeItemProcessor([
  Rule.on(/^Hello \"(.*)\"$/)
    .extractRegexpValues()
    .then(displayValue),
  Rule.on(/^Value\:/)
    .parseNextItemValue()
    .then(displayValue),
  Rule.on(/^c1$/).parseTable(3).then(displayTable),
  Rule.on(/^Values\:/)
    .accumulateAfterHeading()
    .then(displayValue),
]);
new PdfReader().parseFileItems("test/sample.pdf", (err, item) => {
  if (err) console.error(err);
  else processItem(item);
});

Troubleshooting & FAQ

Is it possible to parse a PDF document from a web application?

Solutions exist, but this module cannot be run directly by a web browser. If you really want to use this module, you will have to integrate it into your back-end so that PDF files can be read from your server.

Cannot read property 'userAgent' of undefined error from an express-based node.js app

Dmitry found out that you may need to run these instructions before including the pdfreader module:

global.navigator = {
  userAgent: "node",
};

window.navigator = {
  userAgent: "node",
};

Source: express - TypeError: Cannot read property 'userAgent' of undefined error on node.js app run - Stack Overflow

More Repositories

1

landing-page-boilerplate

🖼 A pure client-side landing page template that you can fork, customize and host freely. Relies on Mailchimp and Google Analytics.
HTML
141
star
2

playemjs

▶️ Streams a list of tracks from Youtube, Soundcloud, Vimeo...
JavaScript
94
star
3

chrome-next-step-for-trello

✅ Chrome extension to check tasks directly from your Trello boards
HTML
71
star
4

js-test

💯 Exerciseur / outils d'évaluation d'étudiants -- NOT MAINTAINED ANYMORE
JavaScript
41
star
5

HsbcStatementParser

Transforms PDF bank statements from HSBC into a list of operations in JSON or TSV format.
JavaScript
14
star
6

webmidi-launchkey-mini

🎹 Online 8-bit/chiptune synth, using WebAudio and WebMIDI
JavaScript
13
star
7

chrome-inbox-permalinks

🔗 A Chrome Extension that provides direct URLs to your emails, from Google Inbox.
JavaScript
12
star
8

cours-javascript

📖 Supports de cours JavaScript au format Gitbook, pour étudiants EEMI de 1ère année
HTML
12
star
9

npm-pdfreader-example

Example of use of pdfreader: parse a PDF résumé
JavaScript
10
star
10

backup-scripts

Bash scripts that I run regularly to backup my content from Trello, Diigo Outliner, etc...
JavaScript
9
star
11

comment-editor-for-trello

Advanced Comments by AJ: turn Trello into a Notebook. (Power-up for Trello)
HTML
9
star
12

freelance-directory-client

📇 Web app to find the right freelancer from your contacts, and keep their skills / availability / preferences up to date.
JavaScript
9
star
13

adrienjoly.github.io

📌 My public portfolio / personal homepage
HTML
8
star
14

telegram-scribe-bot

🤖 A chat-bot to take notes and add tasks from Telegram.
TypeScript
8
star
15

js-exam

Customizable coding exam web app for Javascript course students. Moved 👉 https://github.com/adrienjoly/js-test
JavaScript
7
star
16

album-shelf

💿 Collection of music records I love, maintained with Jekyll on Github Pages
JavaScript
7
star
17

react-playemjs

React component that manages a music/video track queue and plays a sequence of songs by embedding several players in a HTML DIV including Youtube, Soundcloud and Vimeo. Based on http://github.com/adrienjoly/playemjs
JavaScript
7
star
18

hangout-timer

A plug-in/extension for Google Hangouts, to time speakers' participation, for efficient team meetings.
HTML
6
star
19

algocodesearch

🕵️‍♀️ off-sprint project that intends to index symbols from a language server (LSP), for code search
JavaScript
5
star
20

mp3-fetch-metadata

Script that identifies track title and artist from a list of MP3 files, thanks to audio fingerprinting.
JavaScript
5
star
21

cours-nodejs

👩‍🎓 Cours: Création d’API et d'Applications Web avec Node.js
HTML
4
star
22

react-music-player

A web app made with react.js, that can play a list of tracks from various streaming services
JavaScript
4
star
23

1poll

A simple doodle-like poll that makes it easy for contributors to add more options.
JavaScript
4
star
24

playem

A static web app that sequentially plays Youtube videos from your Facebook stream.
JavaScript
4
star
25

classroom-submissions-to-pdf

Extract codepen/jsfiddle/jsbin links from Google Classroom submissions => convert in PDF, for annotation and grading
JavaScript
3
star
26

slides-webaudio-gameboy-music

HTML
3
star
27

snoozer

Calendar management experiments for productivity improvement (wip)
JavaScript
3
star
28

nodeMongoAdmin

WIP: a web app to manage/administrate a mongodb database (like phpMyAdmin)
JavaScript
3
star
29

persistent-harmony

A wrapper class to create persistent javascript objects, relying on harmony proxies.
JavaScript
3
star
30

jekyll-tutorial

Tutorial: how to maintain a list online using Jekyll and Github Pages
Ruby
3
star
31

react-1poll

A simple React component to make doodle-like collaborative polls.
JavaScript
2
star
32

classroom-assignments-cli

👩‍🎓 a CLI to download assignements submitted by students on Google Classroom
JavaScript
2
star
33

HackathonDating

2
star
34

chrome-contacts-for-google-inbox

A minimal extension that links recipients from Google Inbox to their Google Contact page.
JavaScript
2
star
35

a-frame-descent-vr

👾 Learning how to create a VR version of "Descent" using a-frame
JavaScript
2
star
36

notif-mailer

Service that sends notification emails, from a Firebase queue.
JavaScript
2
star
37

gmailbox

Python scripts to download emails to a mbox file, and then deliver them to a POP3 client.
Python
2
star
38

openwhyd-pl-dl

A youtube-dl based script to backup your Openwhyd playlists.
Shell
2
star
39

search

Search all your conversations from one same place
HTML
1
star
40

deno-beatfinger

👾 Experimental development of an engine-agnostic rhythm game in TDD, powered by Deno and Phazer3 (for now)
HTML
1
star
41

AlbumKeeper

a simple meteor project that maintains a list of web-based music albums
JavaScript
1
star
42

cv

CV SHODO
SCSS
1
star
43

oauth-bridge

(WIP) A OAuth bridge to plug on Openwhyd.org
JavaScript
1
star
44

nopass

(WIP) trying to get rid of password-based user identification, once for good
JavaScript
1
star
45

dartinder

a tinder parody app (pure mobile web client to swipe photos and play sounds)
HTML
1
star
46

freelance-directory-profile

A xml file publishing my expertise, rate, availability and preferences for freelance work.
1
star
47

sholegacy-slides

HTML
1
star
48

edison-thermo

A simple thermometer app that runs on an Intel Edison Arduino board with Grove LCD display, and without XDK
JavaScript
1
star
49

cours-nodejs-exercise-testers

🤖 Scripts d'évaluation automatique pour les exercices de mon cours Node.js
JavaScript
1
star
50

gigfm

A web app for playing music from Last.fm-recommended concerts. Based on facebook-template-node. Made with @loickm in 24 hours, during an Angelhack hackathon.
JavaScript
1
star
51

algolia-qyu

Holds, controls and reports on execution of a queue of asynchronous jobs.
JavaScript
1
star
52

openwhyd-search

Quickly search music tracks I posted on my openwhyd profile. ⚠ Closing, in favor of https://github.com/openwhyd/openwhyd-mobile-web-client.
JavaScript
1
star
53

prolog-pacman

Student project written in 2004 with C. Aussourd. Demo video 👉
Prolog
1
star
54

trello-outliner

⚠️ MOVED TO --> https://github.com/adrienjoly/comment-editor-for-trello
JavaScript
1
star
55

ast-example

TypeScript
1
star
56

enonce

👩‍🎓 Rendu dynamique d'énoncé pour évaluation individualisée d'étudiants
CSS
1
star
57

tcup-sensor

Node.js server for the Intel Edison (arduino board) that pushes the temperature of a cup of tea in real-time, thanks to Socket.io.
JavaScript
1
star
58

cours-lead-dev

Atelier de découverte du rôle de "Lead Developer", donné à l'ESGI
Ruby
1
star
59

poexport

Pocket Outlook Export is a script-like program that exports all the pocket outlook contacts from a PocketPC to a CSV file, following the structure defined in a custom CSV template. Based on .net framework and Pocket Outlook Object Model (POOM) wrapper.
C#
1
star