• Stars
    star
    284
  • Rank 140,523 (Top 3 %)
  • Language
    TypeScript
  • License
    MIT License
  • Created 9 months ago
  • Updated 3 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

๐Ÿ“„ Utilities to work with PDFs in Node.js, browser and workers

unpdf

A collection of utilities to work with PDFs. Designed specifically for Deno, workers and other nodeless environments.

unpdf ships with a serverless build/redistribution of Mozilla's PDF.js for serverless environments. Apart from some string replacements and mocks, unenv does the heavy lifting by converting Node.js specific code to be platform-agnostic. See pdfjs.rollup.config.ts for all the details.

This library is also intended as a modern alternative to the unmaintained but still popular pdf-parse.

Features

  • ๐Ÿ—๏ธ Works in Node.js, browser and workers
  • ๐Ÿชญ Includes serverless build of PDF.js (unpdf/pdfjs)
  • ๐Ÿ’ฌ Extract text and images from PDFs
  • ๐Ÿงฑ Opt-in to legacy PDF.js build
  • ๐Ÿ’จ Zero dependencies

Installation

Run the following command to add unpdf to your project.

# pnpm
pnpm add unpdf

# npm
npm install unpdf

# yarn
yarn add unpdf

Usage

Extract Text From PDF

import { extractText, getDocumentProxy } from "unpdf";

// Fetch a PDF file from the web
const buffer = await fetch(
  "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf",
).then((res) => res.arrayBuffer());

// Or load it from the filesystem
const buffer = await readFile("./dummy.pdf");

// Load PDF from buffer
const pdf = await getDocumentProxy(new Uint8Array(pdf));
// Extract text from PDF
const { totalPages, text } = await extractText(pdf, { mergePages: true });

Use Legacy Or Custom PDF.js Build

Generally, you don't need to worry about the PDF.js build. unpdf ships with a serverless build of the latest PDF.js version. However, if you want to use the official PDF.js version or the legacy build, you can define a custom PDF.js module.

// Before using any other methods, define the PDF.js module
import { defineUnPDFConfig } from "unpdf";

defineUnPDFConfig({
  // Use the legacy build
  pdfjs: () => import("pdfjs-dist/legacy/build/pdf.js"),
});

// Now, you can use the other methods
// โ€ฆ

Access the PDF.js Module

This will return the resolved PDF.js module. If no build is defined, the serverless build bundled with unpdf will be initialized.

import { getResolvedPDFJS } from "unpdf";

const { version } = await getResolvedPDFJS();

Use Serverless PDF.js Build In ๐Ÿฆ• Deno

Instead of using the methods provided by unpdf, you can directly import the serverless PDF.js build in Deno. This is useful if you want to use the PDF.js API directly.

import { getDocument } from "https://esm.sh/unpdf/pdfjs";

const data = Deno.readFileSync("dummy.pdf");
const doc = await getDocument(data).promise;

console.log(await doc.getMetadata());

for (let i = 1; i <= doc.numPages; i++) {
  const page = await doc.getPage(i);
  const textContent = await page.getTextContent();
  const contents = textContent.items.map((item) => item.str).join(" ");
  console.log(contents);
}

Config

interface UnPDFConfiguration {
  /**
   * By default, UnPDF will use the latest version of PDF.js. If you want to
   * use an older version or the legacy build, set a promise that resolves to
   * the PDF.js module.
   *
   * @example
   * // Use the legacy build
   * () => import('pdfjs-dist/legacy/build/pdf.js')
   */
  pdfjs?: () => Promise<PDFJS>;
}

Methods

defineUnPDFConfig

Define a custom PDF.js module, like the legacy build. Make sure to call this method before using any other methods.

function defineUnPDFConfig(config: UnPDFConfiguration): Promise<void>;

getResolvedPDFJS

Returns the resolved PDF.js module. If no build is defined, the latest version will be initialized.

function getResolvedPDFJS(): Promise<PDFJS>;

getMeta

function getMeta(data: BinaryData | PDFDocumentProxy): Promise<{
  info: Record<string, any>;
  metadata: Record<string, any>;
}>;

extractText

Extracts all text from a PDF. If mergePages is set to true, the text of all pages will be merged into a single string. Otherwise, an array of strings for each page will be returned.

function extractText(
  data: BinaryData | PDFDocumentProxy,
  { mergePages }?: { mergePages?: boolean },
): Promise<{
  totalPages: number;
  text: string | string[];
}>;

renderPageAsImage

Note
This method will only work in Node.js and browser environments.

To render a PDF page as an image, you can use the renderPageAsImage method. This method will return an ArrayBuffer of the rendered image.

In order to use this method, you have to meet the following requirements:

  • Use the official PDF.js build
  • Install the canvas package in Node.js environments

Example

import { defineUnPDFConfig, renderPageAsImage } from "unpdf";

defineUnPDFConfig({
  // Use the official PDF.js build
  pdfjs: () => import("pdfjs-dist"),
});

const pdf = await readFile("./dummy.pdf");
const buffer = new Uint8Array(pdf);
const pageNumber = 1;

const result = await renderPageAsImage(buffer, pageNumber, {
  canvas: () => import("canvas"),
});
await writeFile("dummy-page-1.png", Buffer.from(result));

Type Declaration

declare function renderPageAsImage(
  data: BinaryData | PDFDocumentProxy,
  pageNumber: number,
  options?: {
    canvas?: () => Promise<typeof import("canvas")>;
    /** @default 1 */
    scale?: number;
    width?: number;
    height?: number;
  },
): Promise<ArrayBuffer>;

extractImages

function extractImages(
  data: BinaryData | PDFDocumentProxy,
  pageNumber: number,
): Promise<Uint8ClampedArray[]>;

FAQ

Why Is canvas An Optional Dependency?

The official PDF.js library depends on the canvas module for Node.js environments, which doesn't work inside worker threads. That's why unpdf ships with a serverless build of PDF.js that mocks the canvas module.

However, to render PDF pages as images in Node.js environments, you need to install the canvas module. That's why it is a peer dependency.

License

MIT License ยฉ 2023-PRESENT Johann Schopplich

More Repositories

1

consola

๐Ÿจ Elegant Console Logger for Node.js and Browser
TypeScript
5,424
star
2

nitro

Next Generation Server Toolkit. Create web servers with everything you need and deploy them wherever you prefer.
TypeScript
4,941
star
3

magic-regexp

A compiled-away, type-safe, readable RegExp alternative
TypeScript
3,531
star
4

ofetch

๐Ÿ˜ฑ A better fetch API. Works on node, browser and workers.
TypeScript
3,106
star
5

h3

โšก๏ธ Minimal H(TTP) framework built for high performance and portability
TypeScript
2,910
star
6

unplugin

Unified plugin system for Vite, Rollup, Webpack, esbuild, rolldown, and more
TypeScript
2,799
star
7

magicast

๐Ÿง€ Programmatically modify JavaScript and TypeScript source codes with a simplified, elegant and familiar syntax powered by recast and babel.
TypeScript
2,112
star
8

webpackbar

Elegant ProgressBar and Profiler for Webpack 3 , 4 and 5
TypeScript
2,041
star
9

unbuild

๐Ÿ“ฆ An unified javascript build system
TypeScript
1,989
star
10

unstorage

๐Ÿ’พ Unstorage provides an async Key-Value storage API with conventional features like multi driver mounting, watching and working with metadata, dozens of built-in drivers and a tiny core.
TypeScript
1,406
star
11

fontaine

Automatic font fallback based on font metrics
TypeScript
1,388
star
12

jiti

Runtime Typescript and ESM support for Node.js
TypeScript
1,284
star
13

ipx

๐Ÿ–ผ๏ธ High performance, secure and easy-to-use image optimizer.
TypeScript
1,070
star
14

destr

๐Ÿš€ Faster, secure and convenient alternative for JSON.parse
TypeScript
894
star
15

ufo

๐Ÿ”— URL utils for humans
TypeScript
888
star
16

untun

๐Ÿš‡ Tunnel your local HTTP(s) server to the world! powered by Cloudflare Quick Tunnels.
TypeScript
853
star
17

defu

๐ŸŒŠ Assign default properties recursively
TypeScript
828
star
18

changelogen

๐Ÿ’… Beautiful Changelogs using Conventional Commits
TypeScript
745
star
19

hookable

๐Ÿช Awaitable Hooks
TypeScript
593
star
20

citty

๐ŸŒ† Elegant CLI Builder
TypeScript
533
star
21

unhead

Unhead is the any-framework document head manager built for performance and delightful developer experience.
TypeScript
501
star
22

ohash

Super fast hashing library based on murmurhash3 written in Vanilla JS
JavaScript
460
star
23

unimport

Unified utils for auto importing APIs in modules.
TypeScript
433
star
24

uqr

Generate QR Code universally, in any runtime, to ANSI, Unicode or SVG.
TypeScript
401
star
25

mlly

๐Ÿค Common ECMAScript module utils
TypeScript
398
star
26

ungh

๐Ÿ™ Unlimited access to github API
TypeScript
383
star
27

nypm

๐ŸŒˆ Unified Package Manager for Node.js and Bun
TypeScript
379
star
28

std-env

Runtime Agnostic JS utils
TypeScript
373
star
29

untyped

Generate types and markdown from a config object.
TypeScript
372
star
30

c12

โš™๏ธ Smart Configuration Loader
TypeScript
367
star
31

listhen

๐Ÿ‘‚ Elegant HTTP Listener
TypeScript
363
star
32

giget

โœจ Download templates and git repositories with pleasure!
TypeScript
353
star
33

radix3

๐ŸŒณ Lightweight and fast router for JavaScript based on Radix Tree
TypeScript
348
star
34

unctx

๐Ÿฆ Composables in vanilla JS
TypeScript
348
star
35

pathe

๐Ÿ›ฃ๏ธ Drop-in replacement of the Node.js's path module module that ensures paths are normalized
TypeScript
332
star
36

mkdist

Lightweight file-to-file transpiler.
TypeScript
304
star
37

unenv

๐Ÿ•Š๏ธ Convert javaScript code to be runtime agnostic
TypeScript
282
star
38

scule

๐Ÿงต String Case Utils
TypeScript
268
star
39

knitwork

Utilities to generate JavaScript code.
TypeScript
224
star
40

rc9

Read/Write config couldn't be easier!
TypeScript
216
star
41

get-port-please

๐Ÿ”Œ Get an available open port
TypeScript
204
star
42

lmify

๐Ÿค™ Install NPM dependencies programmatically (please switch to unjs/nypm)
JavaScript
200
star
43

theme-colors

๐ŸŽจ Easily generate color shades for themes
TypeScript
185
star
44

runtime-compat

Display APIs compatibility across different JavaScript runtimes
Vue
185
star
45

perfect-debounce

Debounce promise-returning & async functions.
TypeScript
171
star
46

pkg-types

Node.js utilities and TypeScript definitions for package.json and tsconfig.json
TypeScript
170
star
47

unkit

๐Ÿ“™ UnJS standard library
TypeScript
168
star
48

crossws

๐Ÿ”Œ Cross-platform WebSocket Servers for Node.js, Deno, Bun and Cloudflare Workers.
TypeScript
167
star
49

uncrypto

Single API for Web Crypto API and Crypto Subtle working in Node.js, Browsers and other runtimes
TypeScript
154
star
50

httpxy

๐Ÿ”€ A Full-Featured HTTP and WebSocket Proxy for Node.js
TypeScript
149
star
51

serve-placeholder

โ™ก Smart placeholder for missing assets
TypeScript
144
star
52

node-fetch-native

better fetch for Node.js. Works on any JavaScript runtime!
TypeScript
141
star
53

template

๐Ÿ“‹ UnJS Project Starter Template
TypeScript
136
star
54

unwasm

๐Ÿ‡ผ WebAssembly tools for JavaScript
JavaScript
128
star
55

website

UnJS website Content and Design!
Vue
116
star
56

db0

๐Ÿ“š Lightweight SQL Connector
TypeScript
111
star
57

mongoz

๐Ÿฅญ Zero Config MongoDB Server
TypeScript
102
star
58

automd

๐Ÿค– Automated markdown maintainer
TypeScript
100
star
59

cookie-es

๐Ÿช Cookie Serializer and Deserializer
TypeScript
97
star
60

redirect-ssl

Connect/Express middleware to enforce https using is-https
TypeScript
96
star
61

jimp-compact

โœ๏ธ Lightweight version of Jimp -- An image processing library written entirely in JavaScript for Node.js
TypeScript
91
star
62

nanotar

๐Ÿ“ผ Tiny and fast tar utils for any JavaScript runtime!
TypeScript
80
star
63

undocs

Minimal Documentation theme and CLI for shared usage across UnJS projects.
Vue
76
star
64

mdbox

โฌ‡ Just simple markdown utils
JavaScript
58
star
65

items-promise

Bare minimum async methods using promises
JavaScript
51
star
66

image-meta

Detect image type and size using pure javascript.
TypeScript
51
star
67

nitro-deploys

Nitro Deployments Testing
TypeScript
45
star
68

compat-flags

๐ŸŒด Gradual feature flags.
TypeScript
45
star
69

confbox

Compact and high quality YAML, TOML, JSONC and JSON5 parsers
TypeScript
38
star
70

ezpass

Dead simple password protection middleware
TypeScript
33
star
71

workbox-cdn

Workbox Unofficial CDN and standalone NPM package.
Shell
30
star
72

rollup-plugin-node-deno

Convert NodeJS to Deno compatible code with rollup
TypeScript
29
star
73

externality

TypeScript
28
star
74

create-require

Polyfill for Node.js module.createRequire (<= v12.2.0)
JavaScript
27
star
75

requrl

Grab full URL from request.
TypeScript
26
star
76

is-https

Check if the given request is HTTPS
TypeScript
26
star
77

bundle-runner

Run webpack bundles in Node.js with optional VM sandboxing
TypeScript
21
star
78

eslint-config

๐Ÿ“– Shared ESLint config for unjs repositories
JavaScript
20
star
79

fs-memo

Easy persisted memo object for Node.js
TypeScript
18
star
80

nitro-starter

Nitro starter template
TypeScript
16
star
81

nitro-preset-starter

TypeScript
15
star
82

governance

UnJS Governance Notes
14
star
83

community

UnJS Community Notes
14
star
84

renovate-config

13
star
85

.github

Community Health Files
8
star
86

unjs.github.io

HTML
4
star
87

html-validate-es

TypeScript
4
star