• Stars
    star
    164
  • Rank 230,032 (Top 5 %)
  • Language
    TypeScript
  • License
    MIT License
  • Created almost 5 years ago
  • Updated 6 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

🔍 Data pipeline for crawling PDFs from the Web and transforming their contents into structured data using AWS textract. Built with AWS CDK + TypeScript

aws-pdf-textract-pipeline Mentioned in Awesome CDK

🔍 Data pipeline for crawling PDFs from the Web and transforming their contents into structured data using AWS Textract. Built with AWS CDK + TypeScript.

This is an example data pipeline that illustrates one possible approach for large-scale serverless PDF processing - it should serve as a good foundation to modify for your own purposes.

Example Extension Popup

Getting Started

Run the following commands to install dependencies, build the CDK stack, and deploy the CDK Stack to AWS.

yarn install
yarn build
cdk bootstrap
cdk deploy

Overview

The following is an overview of each process performed by this CDK stack.

  1. Scrape PDF download URLs from a website

    Scraping data from the COGCC website.

  2. Store PDF download URL in DynamoDB

    Example Extension Popup

  3. Download the PDF to S3

    A lambda fires off when a new PDF download URL has been created in DynamoDB.

  4. Process the PDF with AWS Textract

    Another lambda fires off when a PDF has been downloaded to the S3 bucket.

  5. Process the AWS Textract results

    When an SNS event is detected from AWS Textract, a lambda is fired off to process the result.

  6. Save the processed Textract result to DynamoDB.

    After the full result is pruned down the the desired datastructure, we save the data in DynamoDB. Example Extension Popup

Scripts

  • yarn install - installs dependencies
  • yarn build - builds the production-ready CDK Stack
  • yarn test - runs Jest
  • cdk bootstrap - bootstraps AWS Cloudformation for your CDK deploy
  • cdk deploy - deploys the CDK stack to AWS

Notes

  • Warning - the AnalyzeDocument process from AWS Textract costs $50 per 1,000 PDF pages. Be careful when deploying this CDK stack as you could unintentionally rack up an expensive AWS bill quickly if you're not paying attention.

  • If a PDF download URL has already been added to the pdfUrlsTable DynamoDB table, the pipeline will not re-execute for the PDF.

  • Includes tests with Jest.

  • Recommended to use Visual Studio Code with the Format on Save setting turned on.

Built with

Additional Resources

License

Opens source under the MIT License.

Built with ❤️ by aeksco

More Repositories

1

react-typescript-web-extension-starter

🖥️ Web Extension starter kit built with React, TypeScript, TailwindCSS, Storybook, Jest, EsLint, Prettier, and Webpack. Supports Google Chrome + Mozilla Firefox + Brave Browser + Microsoft Edge + Opera 🔥
JavaScript
936
star
2

nuxt-netlify-lambda-starter

🛠️ SEO-friendly website starter backed by Netlify lambda functions in a simple, friendly repo
Vue
59
star
3

openjscad-react

📦 React.js component for the OpenJSCAD.org project
TypeScript
21
star
4

hardcider

🍺 CLI for quickly generating citations for websites and books
JavaScript
19
star
5

lyrebird

🐦 Impersonate USB devices over Bluetooth LE
C
9
star
6

aws-s3-bucket-maker

💥 Builds a self-destructing S3 bucket and associated IAM Role for temporary file transfer workflows
TypeScript
5
star
7

docker_jupyter_mongodb

Docker-Compose + MongoDB + Jupyter Data Science Notebook + Zipline
Jupyter Notebook
4
star
8

aws-pdf-generator-pipeline

💼 Data pipeline for generating PDFs from HTML files. Built with AWS CDK + TypeScript.
TypeScript
4
star
9

openjscad-react-next-starter

🏗️ A web starter project with OpenJSCAD, React, Next.js, TypeScript, TailwindCSS and Netlify
TypeScript
4
star
10

movies

🎥 An on-going list of movies I've watched
3
star
11

Fe1.1-USB-Hub

3
star
12

react-typescript-stripe-cognito

Purchase page with React.js, TypeScript, Next.js, Stripe, AWS Cognito, and TailwindCSS
TypeScript
3
star
13

vue-netlify-lambda-prerender

🖖 Basic starter project for a prerendered Vue frontend with a Netlify lambda function backend
JavaScript
3
star
14

autolock

🔒 A USB device that locks your computer when it detects you've left your seat
C++
3
star
15

jupyter-tabula

Docker container image built with Jupyter Notebook and Tabula for PDF scraping
Jupyter Notebook
2
star
16

github-reviewer-extension

Web extension to enhance large GitHub pull-request reviews
JavaScript
2
star
17

aws-cdk-starter

A starter project using AWS CDK + TypeScript
TypeScript
2
star
18

dotfiles

some sweet sweet dotfiles
Shell
2
star
19

react-next-typescript-netlify-starter

🛠️ SEO-friendly React + Next.js + TypeScript + Netlify website starter in a simple, friendly repo
TypeScript
2
star
20

VueJSWorkshop

A VueJS Workshop
JavaScript
2
star
21

vuejs-simple-frontend-example

👋 A simple Vue.js CRUD frontend backed by localstorage. Great for learning and prototyping - hack it!
Vue
2
star
22

google-homepage

HTML
2
star
23

hotsheets

Flies like an app, stings like a spreadsheet
Vue
1
star
24

codotype-plugin-starter-kit

🌱 Write your own Codotype plugins with this starter kit
TypeScript
1
star
25

breakout-boards

⚡ A collection of printed circuit boards
1
star
26

magellan

Client-side ontology-driven filesystem knowledge capture and faceted search
CoffeeScript
1
star
27

ts-find-unused

CLI tool to find unused code in TypeScript projects
TypeScript
1
star
28

agentql-docker-example

Example project for running the AgentQL Python SDK in a Docker container, using MongoDB for storing inputs and outputs.
Python
1
star
29

vercel-deploy-nextjs-plugin

JavaScript
1
star
30

vercel-deploy-nextjs-base

A repo to test non-traditional Next.js + React + TypeScript deploys on Vercel
TypeScript
1
star
31

printrbot_settings

Printrbot Simple Metal Cura profiles for various filaments.
1
star
32

react-node-typescript-postgres-starter

JavaScript
1
star
33

aws-cdk-ecs-fargate

Experimenting with AWS CDK + Elastic Container Service + Fargate
TypeScript
1
star
34

aws-api-gateway-lambda

A single AWS Lambda function via REST API endpoint through API Gateway. Built with AWS CDK + TypeScript.
TypeScript
1
star
35

static-site-sso

🔒 Secure a static website behind SSO
1
star
36

angular_crud_example

Angular 5 CRUD Example
TypeScript
1
star
37

aeksco

My GitHub Profile README.md
1
star
38

hackathon_jupyterhub

A hackathon-friendly JupyterHub deployment with Docker & NGINX.
Shell
1
star
39

codotype

🔥 Generate full-stack web applications built with tools that developers love
TypeScript
1
star