aeksco/aws-pdf-textract-pipeline

Stars
164
Rank 230,032 (Top 5 %)
Language
TypeScript
License
MIT License
Created almost 5 years ago
Updated 6 months ago

aeksco/aws-pdf-textract-pipeline

aeksco

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

🔍 Data pipeline for crawling PDFs from the Web and transforming their contents into structured data using AWS textract. Built with AWS CDK + TypeScript

aws-pdf-textract-pipeline

🔍 Data pipeline for crawling PDFs from the Web and transforming their contents into structured data using AWS Textract. Built with AWS CDK + TypeScript.

This is an example data pipeline that illustrates one possible approach for large-scale serverless PDF processing - it should serve as a good foundation to modify for your own purposes.

Getting Started

Run the following commands to install dependencies, build the CDK stack, and deploy the CDK Stack to AWS.

yarn install
yarn build
cdk bootstrap
cdk deploy

Overview

The following is an overview of each process performed by this CDK stack.

Scrape PDF download URLs from a website

Scraping data from the COGCC website.
Store PDF download URL in DynamoDB
Download the PDF to S3

A lambda fires off when a new PDF download URL has been created in DynamoDB.
Process the PDF with AWS Textract

Another lambda fires off when a PDF has been downloaded to the S3 bucket.
Process the AWS Textract results

When an SNS event is detected from AWS Textract, a lambda is fired off to process the result.
Save the processed Textract result to DynamoDB.

After the full result is pruned down the the desired datastructure, we save the data in DynamoDB.

Scripts

yarn install - installs dependencies
yarn build - builds the production-ready CDK Stack
yarn test - runs Jest
cdk bootstrap - bootstraps AWS Cloudformation for your CDK deploy
cdk deploy - deploys the CDK stack to AWS

Notes

Warning - the AnalyzeDocument process from AWS Textract costs $50 per 1,000 PDF pages. Be careful when deploying this CDK stack as you could unintentionally rack up an expensive AWS bill quickly if you're not paying attention.
If a PDF download URL has already been added to the pdfUrlsTable DynamoDB table, the pipeline will not re-execute for the PDF.
Includes tests with Jest.
Recommended to use Visual Studio Code with the Format on Save setting turned on.

Built with

Additional Resources

License

Opens source under the MIT License.

Built with ❤️ by aeksco

react-typescript-web-extension-starter

🖥️ Web Extension starter kit built with React, TypeScript, TailwindCSS, Storybook, Jest, EsLint, Prettier, and Webpack. Supports Google Chrome + Mozilla Firefox + Brave Browser + Microsoft Edge + Opera 🔥

nuxt-netlify-lambda-starter

🛠️ SEO-friendly website starter backed by Netlify lambda functions in a simple, friendly repo

openjscad-react

📦 React.js component for the OpenJSCAD.org project

hardcider

🍺 CLI for quickly generating citations for websites and books

lyrebird

🐦 Impersonate USB devices over Bluetooth LE

aws-s3-bucket-maker

💥 Builds a self-destructing S3 bucket and associated IAM Role for temporary file transfer workflows

docker_jupyter_mongodb

Docker-Compose + MongoDB + Jupyter Data Science Notebook + Zipline

Jupyter Notebook

aws-pdf-generator-pipeline

💼 Data pipeline for generating PDFs from HTML files. Built with AWS CDK + TypeScript.

openjscad-react-next-starter

🏗️ A web starter project with OpenJSCAD, React, Next.js, TypeScript, TailwindCSS and Netlify

movies

🎥 An on-going list of movies I've watched

Fe1.1-USB-Hub

react-typescript-stripe-cognito

Purchase page with React.js, TypeScript, Next.js, Stripe, AWS Cognito, and TailwindCSS

vue-netlify-lambda-prerender

🖖 Basic starter project for a prerendered Vue frontend with a Netlify lambda function backend

autolock

🔒 A USB device that locks your computer when it detects you've left your seat

jupyter-tabula

Docker container image built with Jupyter Notebook and Tabula for PDF scraping

Jupyter Notebook

github-reviewer-extension

Web extension to enhance large GitHub pull-request reviews

aws-cdk-starter

A starter project using AWS CDK + TypeScript

dotfiles

some sweet sweet dotfiles

react-next-typescript-netlify-starter

🛠️ SEO-friendly React + Next.js + TypeScript + Netlify website starter in a simple, friendly repo

VueJSWorkshop

A VueJS Workshop

vuejs-simple-frontend-example

👋 A simple Vue.js CRUD frontend backed by localstorage. Great for learning and prototyping - hack it!

google-homepage

hotsheets

Flies like an app, stings like a spreadsheet

codotype-plugin-starter-kit

🌱 Write your own Codotype plugins with this starter kit

breakout-boards

⚡ A collection of printed circuit boards

magellan

Client-side ontology-driven filesystem knowledge capture and faceted search

ts-find-unused

CLI tool to find unused code in TypeScript projects

agentql-docker-example

Example project for running the AgentQL Python SDK in a Docker container, using MongoDB for storing inputs and outputs.

vercel-deploy-nextjs-plugin

vercel-deploy-nextjs-base

A repo to test non-traditional Next.js + React + TypeScript deploys on Vercel

printrbot_settings

Printrbot Simple Metal Cura profiles for various filaments.

react-node-typescript-postgres-starter

aws-cdk-ecs-fargate

Experimenting with AWS CDK + Elastic Container Service + Fargate

aws-api-gateway-lambda

A single AWS Lambda function via REST API endpoint through API Gateway. Built with AWS CDK + TypeScript.

static-site-sso

🔒 Secure a static website behind SSO

angular_crud_example

Angular 5 CRUD Example

aeksco

My GitHub Profile README.md

hackathon_jupyterhub

A hackathon-friendly JupyterHub deployment with Docker & NGINX.

codotype

🔥 Generate full-stack web applications built with tools that developers love