aws-pdf-textract-pipeline
This is an example data pipeline that illustrates one possible approach for large-scale serverless PDF processing - it should serve as a good foundation to modify for your own purposes.
Getting Started
Run the following commands to install dependencies, build the CDK stack, and deploy the CDK Stack to AWS.
yarn install
yarn build
cdk bootstrap
cdk deploy
Overview
The following is an overview of each process performed by this CDK stack.
-
Scrape PDF download URLs from a website
Scraping data from the COGCC website.
-
Store PDF download URL in DynamoDB
-
Download the PDF to S3
A lambda fires off when a new PDF download URL has been created in DynamoDB.
-
Process the PDF with AWS Textract
Another lambda fires off when a PDF has been downloaded to the S3 bucket.
-
Process the AWS Textract results
When an SNS event is detected from AWS Textract, a lambda is fired off to process the result.
-
Save the processed Textract result to DynamoDB.
After the full result is pruned down the the desired datastructure, we save the data in DynamoDB.
Scripts
yarn install
- installs dependenciesyarn build
- builds the production-ready CDK Stackyarn test
- runs Jestcdk bootstrap
- bootstraps AWS Cloudformation for your CDK deploycdk deploy
- deploys the CDK stack to AWS
Notes
-
Warning - the
AnalyzeDocument
process from AWS Textract costs $50 per 1,000 PDF pages. Be careful when deploying this CDK stack as you could unintentionally rack up an expensive AWS bill quickly if you're not paying attention. -
If a PDF download URL has already been added to the
pdfUrlsTable
DynamoDB table, the pipeline will not re-execute for the PDF. -
Includes tests with Jest.
-
Recommended to use
Visual Studio Code
with theFormat on Save
setting turned on.
Built with
Additional Resources
- CDK API Reference
- Puppeteer
- Puppeteer Lambda
- CDK TypeScript Reference
- CDK Assertion Package
- Textract Pricing Chart
- awesome-cdk repo
License
Opens source under the MIT License.
Built with