• Stars
    star
    395
  • Rank 109,040 (Top 3 %)
  • Language
    TypeScript
  • License
    MIT License
  • Created over 1 year ago
  • Updated 5 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

πŸ“„ Utilities to work with PDFs in Node.js, browser and workers

unpdf

A collection of utilities to work with PDFs. Designed specifically for Deno, workers and other nodeless environments.

unpdf ships with a serverless build/redistribution of Mozilla's PDF.js for serverless environments. Apart from some string replacements and mocks, unenv does the heavy lifting by converting Node.js specific code to be platform-agnostic. See pdfjs.rollup.config.ts for all the details.

This library is also intended as a modern alternative to the unmaintained but still popular pdf-parse.

Features

  • πŸ—οΈ Works in Node.js, browser and workers
  • πŸͺ­ Includes serverless build of PDF.js (unpdf/pdfjs)
  • πŸ’¬ Extract text and images from PDFs
  • 🧱 Opt-in to legacy PDF.js build
  • πŸ’¨ Zero dependencies

Installation

Run the following command to add unpdf to your project.

# pnpm
pnpm add unpdf

# npm
npm install unpdf

# yarn
yarn add unpdf

Usage

Extract Text From PDF

import { extractText, getDocumentProxy } from "unpdf";

// Fetch a PDF file from the web
const buffer = await fetch(
  "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf",
).then((res) => res.arrayBuffer());

// Or load it from the filesystem
const buffer = await readFile("./dummy.pdf");

// Load PDF from buffer
const pdf = await getDocumentProxy(new Uint8Array(pdf));
// Extract text from PDF
const { totalPages, text } = await extractText(pdf, { mergePages: true });

Use Legacy Or Custom PDF.js Build

Generally, you don't need to worry about the PDF.js build. unpdf ships with a serverless build of the latest PDF.js version. However, if you want to use the official PDF.js version or the legacy build, you can define a custom PDF.js module.

// Before using any other methods, define the PDF.js module
import { defineUnPDFConfig } from "unpdf";

defineUnPDFConfig({
  // Use the legacy build
  pdfjs: () => import("pdfjs-dist/legacy/build/pdf.js"),
});

// Now, you can use the other methods
// …

Access the PDF.js Module

This will return the resolved PDF.js module. If no build is defined, the serverless build bundled with unpdf will be initialized.

import { getResolvedPDFJS } from "unpdf";

const { version } = await getResolvedPDFJS();

Use Serverless PDF.js Build In πŸ¦• Deno

Instead of using the methods provided by unpdf, you can directly import the serverless PDF.js build in Deno. This is useful if you want to use the PDF.js API directly.

import { getDocument } from "https://esm.sh/unpdf/pdfjs";

const data = Deno.readFileSync("dummy.pdf");
const doc = await getDocument(data).promise;

console.log(await doc.getMetadata());

for (let i = 1; i <= doc.numPages; i++) {
  const page = await doc.getPage(i);
  const textContent = await page.getTextContent();
  const contents = textContent.items.map((item) => item.str).join(" ");
  console.log(contents);
}

Config

interface UnPDFConfiguration {
  /**
   * By default, UnPDF will use the latest version of PDF.js. If you want to
   * use an older version or the legacy build, set a promise that resolves to
   * the PDF.js module.
   *
   * @example
   * // Use the legacy build
   * () => import('pdfjs-dist/legacy/build/pdf.js')
   */
  pdfjs?: () => Promise<PDFJS>;
}

Methods

defineUnPDFConfig

Define a custom PDF.js module, like the legacy build. Make sure to call this method before using any other methods.

function defineUnPDFConfig(config: UnPDFConfiguration): Promise<void>;

getResolvedPDFJS

Returns the resolved PDF.js module. If no build is defined, the latest version will be initialized.

function getResolvedPDFJS(): Promise<PDFJS>;

getMeta

function getMeta(data: BinaryData | PDFDocumentProxy): Promise<{
  info: Record<string, any>;
  metadata: Record<string, any>;
}>;

extractText

Extracts all text from a PDF. If mergePages is set to true, the text of all pages will be merged into a single string. Otherwise, an array of strings for each page will be returned.

function extractText(
  data: BinaryData | PDFDocumentProxy,
  { mergePages }?: { mergePages?: boolean },
): Promise<{
  totalPages: number;
  text: string | string[];
}>;

renderPageAsImage

Note
This method will only work in Node.js and browser environments.

To render a PDF page as an image, you can use the renderPageAsImage method. This method will return an ArrayBuffer of the rendered image.

In order to use this method, you have to meet the following requirements:

  • Use the official PDF.js build
  • Install the canvas package in Node.js environments

Example

import { defineUnPDFConfig, renderPageAsImage } from "unpdf";

defineUnPDFConfig({
  // Use the official PDF.js build
  pdfjs: () => import("pdfjs-dist"),
});

const pdf = await readFile("./dummy.pdf");
const buffer = new Uint8Array(pdf);
const pageNumber = 1;

const result = await renderPageAsImage(buffer, pageNumber, {
  canvas: () => import("canvas"),
});
await writeFile("dummy-page-1.png", Buffer.from(result));

Type Declaration

declare function renderPageAsImage(
  data: BinaryData | PDFDocumentProxy,
  pageNumber: number,
  options?: {
    canvas?: () => Promise<typeof import("canvas")>;
    /** @default 1 */
    scale?: number;
    width?: number;
    height?: number;
  },
): Promise<ArrayBuffer>;

extractImages

function extractImages(
  data: BinaryData | PDFDocumentProxy,
  pageNumber: number,
): Promise<Uint8ClampedArray[]>;

FAQ

Why Is canvas An Optional Dependency?

The official PDF.js library depends on the canvas module for Node.js environments, which doesn't work inside worker threads. That's why unpdf ships with a serverless build of PDF.js that mocks the canvas module.

However, to render PDF pages as images in Node.js environments, you need to install the canvas module. That's why it is a peer dependency.

License

MIT License Β© 2023-PRESENT Johann Schopplich

More Repositories

1

nitro

Next Generation Server Toolkit. Create web servers with everything you need and deploy them wherever you prefer.
TypeScript
5,939
star
2

consola

🐨 Elegant Console Logger for Node.js and Browser
TypeScript
5,919
star
3

ofetch

😱 A better fetch API. Works on node, browser and workers.
TypeScript
3,876
star
4

magic-regexp

A compiled-away, type-safe, readable RegExp alternative
TypeScript
3,685
star
5

h3

⚑️ Minimal H(TTP) framework built for high performance and portability
TypeScript
3,433
star
6

unplugin

Unified plugin system for Vite, Rollup, Webpack, esbuild, Rolldown, and more
TypeScript
3,018
star
7

unbuild

πŸ“¦ A unified JavaScript build system
TypeScript
2,270
star
8

magicast

πŸ§€ Programmatically modify JavaScript and TypeScript source codes with a simplified, elegant and familiar syntax powered by recast and babel.
TypeScript
2,270
star
9

webpackbar

Elegant ProgressBar and Profiler for Webpack 3 , 4 and 5
TypeScript
2,056
star
10

unstorage

πŸ’Ύ Unstorage provides an async Key-Value storage API with conventional features like multi driver mounting, watching and working with metadata, dozens of built-in drivers and a tiny core.
TypeScript
1,707
star
11

jiti

Runtime Typescript and ESM support for Node.js
TypeScript
1,573
star
12

ipx

πŸ–ΌοΈ High performance, secure and easy-to-use image optimizer.
TypeScript
1,491
star
13

fontaine

Automatic font fallback based on font metrics
TypeScript
1,478
star
14

destr

πŸš€ Faster, secure and convenient alternative for JSON.parse for artibrary inputs
TypeScript
1,058
star
15

ufo

πŸ”— URL utils for humans
TypeScript
1,002
star
16

defu

🌊 Assign default properties recursively
TypeScript
992
star
17

untun

πŸš‡ Tunnel your local HTTP(s) server to the world! powered by Cloudflare Quick Tunnels.
TypeScript
969
star
18

changelogen

πŸ’… Beautiful Changelogs using Conventional Commits
TypeScript
877
star
19

citty

πŸŒ† Elegant CLI Builder
TypeScript
729
star
20

hookable

πŸͺ Awaitable Hooks
TypeScript
693
star
21

unhead

Unhead is the any-framework document head manager built for performance and delightful developer experience.
TypeScript
618
star
22

ohash

Super fast hashing library based on murmurhash3 written in Vanilla JS
JavaScript
526
star
23

uqr

Generate QR Code universally, in any runtime, to ANSI, Unicode or SVG.
TypeScript
523
star
24

unimport

Unified utils for auto importing APIs in modules.
TypeScript
498
star
25

c12

βš™οΈ Smart Configuration Loader
TypeScript
474
star
26

nypm

🌈 Unified Package Manager for Node.js and Bun
TypeScript
455
star
27

ungh

πŸ™ Unlimited access to github API
TypeScript
453
star
28

std-env

Runtime Agnostic JS utils
TypeScript
447
star
29

mlly

🀝 Common ECMAScript module utils
TypeScript
446
star
30

giget

✨ Download templates and git repositories with pleasure!
TypeScript
439
star
31

rou3

🌳 Lightweight and fast rou(ter) for JavaScript
TypeScript
432
star
32

listhen

πŸ‘‚ Elegant HTTP Listener
TypeScript
423
star
33

untyped

Generate types and markdown from a config object.
TypeScript
419
star
34

unctx

🍦 Composables in vanilla JS
TypeScript
396
star
35

pathe

πŸ›£οΈ Drop-in replacement of the Node.js's path module module that ensures paths are normalized
TypeScript
396
star
36

unenv

πŸ•ŠοΈ Convert javaScript code to be runtime agnostic
TypeScript
358
star
37

mkdist

Lightweight file-to-file transpiler.
TypeScript
342
star
38

scule

🧡 String Case Utils
TypeScript
342
star
39

crossws

πŸ”Œ Cross-platform WebSocket Servers for Node.js, Deno, Bun and Cloudflare Workers.
TypeScript
299
star
40

rc9

Read/Write config couldn't be easier!
TypeScript
271
star
41

knitwork

🧢 Utilities to generate safe JavaScript code.
TypeScript
264
star
42

get-port-please

πŸ”Œ Get an available open port
TypeScript
243
star
43

runtime-compat

Display APIs compatibility across different JavaScript runtimes
Vue
230
star
44

theme-colors

🎨 Easily generate color shades for themes
TypeScript
213
star
45

perfect-debounce

Debounce promise-returning & async functions.
TypeScript
210
star
46

pkg-types

Node.js utilities and TypeScript definitions for package.json and tsconfig.json
TypeScript
206
star
47

lmify

πŸ€™ Install NPM dependencies programmatically (please switch to unjs/nypm)
JavaScript
200
star
48

uncrypto

Single API for Web Crypto API and Crypto Subtle working in Node.js, Browsers and other runtimes
TypeScript
184
star
49

undio

⇔ Conventionally and Safely convert between various JavaScript data types
TypeScript
184
star
50

httpxy

πŸ”€ A Full-Featured HTTP and WebSocket Proxy for Node.js
TypeScript
179
star
51

unwasm

πŸ‡Ό WebAssembly tools for JavaScript
JavaScript
176
star
52

unkit

πŸ“™ UnJS standard library
TypeScript
174
star
53

undocs

Minimal Documentation theme and CLI for shared usage across UnJS projects.
Vue
161
star
54

automd

πŸ€– Automated markdown maintainer
TypeScript
161
star
55

db0

πŸ“š Lightweight SQL Connector
TypeScript
160
star
56

node-fetch-native

better fetch for Node.js. Works on any JavaScript runtime!
TypeScript
154
star
57

template

πŸ“‹ UnJS Project Starter Template
TypeScript
152
star
58

serve-placeholder

β™‘ Smart placeholder for missing assets
TypeScript
149
star
59

cookie-es

πŸͺ Cookie and Set-Cookie parser and serializer
TypeScript
132
star
60

website

UnJS website Content and Design!
Vue
129
star
61

mongoz

πŸ₯­ Zero Config MongoDB Server
TypeScript
108
star
62

jimp-compact

✏️ Lightweight version of Jimp -- An image processing library written entirely in JavaScript for Node.js
TypeScript
106
star
63

confbox

Compact and high quality YAML, TOML, JSONC and JSON5 parsers
TypeScript
102
star
64

nanotar

πŸ“Ό Tiny and fast tar utils for any JavaScript runtime!
TypeScript
100
star
65

redirect-ssl

Connect/Express middleware to enforce https using is-https
TypeScript
100
star
66

mdbox

⬇ Just simple markdown utils
JavaScript
79
star
67

image-meta

Detect image type and size using pure javascript.
TypeScript
78
star
68

errx

Zero dependency library to capture and parse stack traces in Node, Bun, Deno and more.
TypeScript
78
star
69

compatx

🌴 Compatibility toolkit.
TypeScript
56
star
70

items-promise

Bare minimum async methods using promises
JavaScript
55
star
71

nitro-deploys

Continues Nitro deployments for end-to-end testing deployment providers.
TypeScript
49
star
72

unrouting

Making filesystem routing universal
TypeScript
45
star
73

ezpass

Dead simple password protection middleware
TypeScript
37
star
74

eslint-config

βœ… Shared ESLint config for unjs repositories
TypeScript
36
star
75

workbox-cdn

Workbox Unofficial CDN and standalone NPM package.
Shell
32
star
76

externality

TypeScript
31
star
77

create-require

Polyfill for Node.js module.createRequire (<= v12.2.0)
JavaScript
31
star
78

codeup

Automated codebase updater [POC]
TypeScript
30
star
79

is-https

Check if the given request is HTTPS
TypeScript
29
star
80

impound

TypeScript
29
star
81

rollup-plugin-node-deno

Convert NodeJS to Deno compatible code with rollup
TypeScript
29
star
82

requrl

Grab full URL from request.
TypeScript
28
star
83

fs-memo

Easy persisted memo object for Node.js
TypeScript
26
star
84

bundle-runner

Run webpack bundles in Node.js with optional VM sandboxing
TypeScript
25
star
85

community

UnJS Community Notes
22
star
86

renovate-config

16
star
87

nitro-preset-starter

TypeScript
16
star
88

glob-native

TypeScript
16
star
89

nitro-starter

Nitro starter template
TypeScript
16
star
90

governance

UnJS Governance Notes
15
star
91

.github

Community Health Files
8
star
92

unjs.github.io

HTML
5
star
93

html-validate-es

TypeScript
4
star