• Stars
    star
    125
  • Rank 286,335 (Top 6 %)
  • Language
    CSS
  • Created over 8 years ago
  • Updated over 8 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Evaluating the performance and accuracy of ABBYY FineReader's OCR on Senate Financial Disclosure scanned forms

Using ABBYY FineReader to extract tabular data from U.S. Senators' personal finance reports

Members of Congress are required to submit regular reports detailing their personal wealth. However, despite the existence of electronic filing systems, some legislators still submit via paper, which is then scanned and uploaded as images or PDFs into an online database (Senate / House).

The Senate's electronic filing system came into effect a couple years ago; Senator Bernie Sanders is one example of a Senator who has moved from paper to the electronic filing system:

Sen. Bernie Sanders annual disclosures, 2011 and 2014

Extracting data from scanned images is one of the most common and most difficult data wrangling tasks, such that OpenSecrets (aka The Center for Responsive Politics) pitched a civic hackathon challenge to build a solution for efficiently parsing Congressmembers' personal financial disclosures.

My writeup here is meant as a quick overview of the effectiveness of using ABBYY FineReader for Mac in producing usable, perhaps even delmited data from the scanned disclosure forms. Note that I'm not attempting to solve the problem of how to clean up the imperfect OCR results and insert them into a database, and how to automate it as a batch process. Just extracting text, even semi-accurately, from a single scanned form is a hard challenge on its own.

For a better overview of PDFs and structured data, including the different kind of PDFs, and the many challenges and approaches to extracting structured data from those different PDFs, check out Jacob Fenton's and Jeremy Singer-Vine's NICAR16 presentation on Parsing Prickly PDFs. If all you care about is the actual personal finances of Congressmembers, OpenSecrets has you covered.

Also, Robert Gebeloff of the New York Times has put together a list of the various other commercial products and their use-cases in this NICAR presentation (.docx)

My initial takeaway: FineReader is remarkably good for this task; in a later walkthrough I'll explain how to apply this in semi-automated fashion across all the forms (or any other set of scanned papers).

For the purposes of brevity, this writeup focuses on the Senate financial disclosures - the OCR challenge for both chambers of Congress is fundamentally the same.

What the submitted financial disclosure forms look like

The Senate's financial disclosure database can be found here:

https://efdsearch.senate.gov/search/home/

If you want to visit the direct links I provide, you'll need to visit the Senate site with your browser and manually agree to the site's terms of use. This will start a browser session that allows you to access the direct links.

An electronically-submitted personal finance report

Here's what an annual report on personal finances for 2014 looks like when it's electronically-submitted, courtesy of Senator Marco Rubio:

https://efdsearch.senate.gov/search/view/annual/de85e0d9-7eeb-49b6-83df-67affd2df645/

For your convenience, I've mirrored the HTML for Sen. Rubio's financial report, which you can visit here without going through the Senate site.

rubio-table.png

As you can see, the HTML is straightforward to parse as machine-readable data. So let's dispel once and for all with this fiction that Senator Rubio doesn't know what he's doing. He knows exactly what he's doing.

A personal finance report submitted as paper

And here's what that same report looks like when it's submitted on paper, courtesy of Senator Dianne Feinstein:

https://efdsearch.senate.gov/search/view/paper/B06D0983-3786-41CB-92C6-5209F288D517/

OpenSecrets has a copy of the PDF that you can view without visiting the Senate site. Here's what one of the scanned pages looks like:

The OCR challenge

It's important to note that even though Senator Rubio's electronic form is easy to read, programmatically, there's still the challenge of creating a data schema that you can import his financial data into.

That same challenge exists for Senator Feinstein's paper form, except with the additional and exponentially more challenging task of just extracting the data. This challenge is what necessitates the use of optical character recognition technology, aka OCR.

Here are my desired outcomes:

1. Convert scanned English text characters into plaintext data

That is, convert a picture of the letter a.jpg into a digital plaintext representation that can be read by a text editor: a

2. Convert scanned data tables into Excel spreadsheet tables

That is, convert a picture of a table of data:

pension-table.jpg

Into something that can be read as delimited data values in a standard Excel spreadsheet:

City & County of San Francisco San Francisco, CA Pension $56,804

The process of turning images into string literals is extremely difficult, and doing it at high rate accuracy is beyond most development shops smaller than Google. The additional challenge of seeing that the images represent tabular data is of itself another, non-trivial challenge.

Using ABBYY FineReader

There are open-source OCR programs, of which Tesseract is the most well-known, but they don't generally do the task of recognizing tabular data (note: software such as Tabula deals with actual tabular data, not scanned images).

Commercial packages -- such as ABBYY's FineReader and OmniPage -- do claim to effectively OCR tabular data. I've never used either until now, but I'm using ABBYY's product because I've heard good things about it and because I don't make enough money to afford OmniPage.

I'm on a Mac so I only have access to FineReader Pro for Mac, which is listed at $119. Windows users have access to FineReader 12 Professional and Corporate -- and, I'm betting, more tech support and updates.

ABBYY Cloud?

During the OpenSecrets PDF hackathon, developer Ross Tsiomenko tested the idea of using ABBYY's cloud service to batch process the Congressional forms](https://www.opengovfoundation.org/developers-blog-liberating-congressional-financial-disclosure-data-from-pdfs/):

Financial Disclosure Reports are not text-based PDFs, but rather scanned-in images, meaning OCR (optical character recognition) software must be used to extract the data. ABBYY Cloud OCR is the only software currently known to extract tabular data correctly; the prototype uses a shell script to upload a PDF to the Cloud API, which returns a text file with most columns and rows intact. This is then cleaned up and turned into a csv file using Python.

...Once everything is working, an alternative to the paid ABBYY Cloud OCR service should be found. Although ABBYY works great, it is not free; processing all forms filed within a calendar year would take 10,000+ page requests (not counting development trial and error), which could cost up to $900 according to ABBYY pricing.

I haven't used the cloud service and I agree that the cost is probably prohibitive for most projects. So for this writeup, I'm focusing only on the desktop application -- I imagine both the Windows and Mac versions have similar OCR effectiveness. I'll cover the process of how to use the Desktop application to perform batch OCR in another writeup.

Simple table

The OCRing of text is well-known -- and I show how FineReader OCR compares to Tesseract OCR when it comes to a scanned cover letter later in this writeup.

So let's get right into interesting part: the OCR of tabular data.

Here's one of the simpler variations of forms in the Senate disclosures:

000602483.gif

Note: this example and others come from Senator Lamar Alexander's 2011 Annual Report (mirrored here at OpenSecrets). The electronic system only came into effect a couple years ago and Senator Alexander's latest annual report was submitted electronically, so good on him.

What FineReader sees

When importing that single image into FineReader, this is what FineReader purports to "see", in terms of the OCRable regions of the page:

000602483-excel-abbyy-preview.jpg

The resulting Excel spreadsheet

FineReader has the ability to export the OCRed image as a PDF. But we want a table -- i.e. an Excel spreadsheet.

And here's the result:

000602483-excel.jpg

Pretty good! You can download the Excel file here. Or, if you want, here's the PDF that FineReader produces, which includes the OCRed text that you can at least highlight-copy-paste.

Less simple table

OK, now here's a much less simpler table:

000602474.gif

Not only is there significantly more ink (and smudged ink) to deal with, but there are vertical table headers and other complex tabular features to process.

Here's the result of FineReader's Excel output:

000602474-excel.jpg

Definitely not as clean as the previous example, but to be honest, much better than what I had expected. I'm kind of shocked that it managed to make sense of the vertically-oriented headers. You'd still have a long ways to go before you could put this into a database, but FineReader's output gives you a lot of options for heuristics to simplify the translation process.

And what if you just need to very quickly see if anyone at anytime has ever owned assets in "Acme Co."? At the very least, FineReader provides very greppable text data.

Here's the Excel spreadsheet. And here's the PDF with embedded OCRed text.

A bunch of checkboxes

One more example: a bunch of checkboxes:

000602470.gif

That can't be that hard, right? But take a closer look...among other issues, we have boxes inside of boxes. And also, a bunch of hand-written scrawl that we can safely assume will not be accurately parsed.

Here's what the spreadsheet produced by FineReader looks like:

000602470-excel.jpg

Yep, that's basically unusable. Not so much because of the character accuracy (in fact, FineReader translates the X'ed boxes), but because the tabular structure isn't preserved in the way you'd hope it to be.

You can download the spreadsheet file here. And the PDF with embeddable text here.

Simple letter page (FineReader vs Tesseract)

What about regular letters of prose? This is something within the featureset of open-source OCR such as Tesseract (I'm using 3.04.01, which was released in February), so I'll compare it against FineReader.

Here's the original page from Sen. Alexander's report, with its original neck-wrenching-orientation

000602485.gif

FineReader + pdftotext -layout

Here's the PDF created by FineReader's OCR, which is able to detect orientation automatically. Because we don't care about tabular data, I've used Poppler's pdftotext utility to just extract the text, along with pdftotext's -layout flag to produce it in such a way that the whitespace is similar to the visual layout of the PDF.

Not bad. There are a few problems that would be significant hurdles if you wanted to grep across the text, including a comma where a decimal point should be in 17.1226% interest. And the somewhat inexplicably consistent translation of of to o f:

     LAMAR ALEXANDER
         TENNESSEE




                                              Mttd States Senate
                                                     WASHINGTON, DC 20510




                                                                 May 15,2012


           Dear Senators Boxer and Isakson,

                   My wife, Leslee B. Alexander, owns 17,1226% interest in her family’    s Texas
           corporation, the Starboard Corporation. Since 2003,1have reported her ownership o f this
           interest and relied on the Starboard Corporation to provide my accountants with a list o f
           underlying assets o f the corporation so that I could also list them on my annual financial report.

                   This year, while preparing my 2011 Financial disclosure, my accountants inquired o f the
           Starboard corporation whether the list o f underlying assets was up-to-date. On May 10, 2012,
           the corporation notified my accountants that one such asset had not been included on the listβ€” a
           piece o f commercial real estate in San Antonio purchased by the Starboard Corporation in 2004.
           My accountants say that this property had not been reported to them previously.

                  This omission did not affect the accuracy o f the β€œ
                                                                    amount or o f type o f income”from the
           Starboard Corporation reported on my annual financial disclosures between 2004 and 2010.
           What was inaccurate was failure to report ownership o f this one underlying asset o f the
           corporation, the San Antonio property.

                  In this year’s 2011 Financial Disclosure report, the San Antonio property is included
           along with ten other underlying assets o f the Starboard Corporation (all o f which have been
           previously reported) among our non-publicly traded assets and unearned income sources. It can
           be found on page 9, line 1 o f the 2011 report.

                  Looking ahead, I have talked both with my Nashville accountants and the Texas
           accountant for the Starboard Corporation and emphasized to them the importance o f reporting
           underlying assets and o f observing the new rules concerning reporting transactions within 30
           days. I do not own any publicly traded securities.

                     Should you have additional questions regarding this matter, please contact me at 202 224
            1989.


                                                                 Sim
                                                                 Sincerely,
$0
ST
iN                                                       Lamar Alexander
Β©
io
o
D
P
P
O


Tesseract OCR

For this test, I used Tesseract version 3.04.01, which was released in February 2016. One of the things Tesseract won't do is process GIFs (which is, for whatever reason, the preferred image format of the Senate disclosure database), so you'll need something like ImageMagick.

And, Tesseract doesn't seem to do automatic orientation detection (or at least I don't know how to invoke it), so you'll have to reorient the image before passing it to Tesseract to OCR.

The command-line sequence with ImageMagick (which provides the convert command to do image transformations) looks like this:

$ convert 000602485.gif -rotate 270 000602485.tiff
$ tesseract 000602485.tiff 000602485-tesseract

It produces a file named 000602485-tesseract.txt. Because Tesseract, by default, produces a plaintext stream, there is no option to use pdftotext -layout on its output (you can, however, configure Tesseract to output HOCR data, which gives you the option of manually determining spatial regions for yourself, which projects like Jacob Fenton's whatwordwhere aim to do).

Here's the text output from Tesseract:

LAMAR ALEXANDER
TENNESSEE

flflnittd 0%tatts :52an

WASHINGTON, DC 20510
May 15, 2012

Dear Senators Boxer and Isakson,

My wife, Leslee B. Alexander, owns 17.1226% interest in her family’s Texas
corporation, the Starboard Corporation. Since 2003, I have reported her ownership of this
interest and relied on the Starboard Corporation to provide my accountants with a list of
underlying assets of the corporation so that I could also list them on my annual financial report.

This year, while preparing my 201 1 Financial disclosure, my accountants inquired of the
Starboard corporation whether the list of underlying assets was up-to-date. On May 10, 2012,
the corporation notified my accountants that one such asset had not been included on the listβ€”a
piece of commercial real estate in San Antonio purchased by the Starboard Corporation in 2004.
My accountants say that this property had not been reported to them previously.

This omission did not affect the accuracy of the β€œamount or of type of income” from the
Starboard Corporation reported on my annual financial disclosures between 2004 and 2010.
What was inaccurate was failure to report ownership of this one underlying asset of the
corporation, the San Antonio property.

In this year’s 2011 Financial Disclosure report, the San Antonio property is included
along with ten other underlying assets of the Starboard Corporation (all of which have been
previously reported) among our non-publicly traded assets and unearned income sources. It can
be found on page 9, line 1 of the 2011 report.

Looking ahead, I have talked both with my Nashville accountants and the Texas
accountant for the Starboard Corporation and emphasized to them the importance of reporting
underlying assets and of observing the new rules concerning reporting transactions within 30
days. I do not own any publicly traded securities.

Should you have additional questions regarding this matter, please contact me at 202 224
1989.

m Sincerely,
ST
N , Lamar Alexander

Ll?
CD!

(5:)
G)

For being free, Tesseract does a very capable job. I didn't bother to do a real analysis of its accuracy versus FineReader, though, other than to note that it correctly interpreted of as of.

Conclusion

Turning data that was "optimized" for paper -- whether it be digital PDFs or scanned images simply packaged as PDFs -- will be a significant computational task as long as humans require human-readable information. And it's safe to assume that state-of-the-art OCR will never be 100% accurate.

For now, I'm pretty satisfied with the kind of performance FineReader provides at ~$100 -- and I don't blame the open-source contributors of Tesseract if they aren't able to voluntarily tackle the challenge of OCRing tabular data -- Tesseract (which is also trainable) works pretty damn well on regular text.

In terms of OpenSecrets's call-to-arms to automate the processing of Congressional paper forms, good OCR is not enough, we need a system to batch collect and process the documents, which will be its own writeup.

It's worth noting, though, that OpenSecrets isn't waiting around for magic OCR to come around: they've processed the financial disclosure forms the old-fashioned way -- human-powered reading and data entry -- and have generously provided their results in browsable and searchable form on the Personal Finances section of their eponymous political transparency site:

https://www.opensecrets.org/pfds/

More Repositories

1

watson-word-watcher

A proof of concept using IBM's Speech-to-Text API to do quick-and-dirty transcriptions
Python
309
star
2

journalism-syllabi

Computer-Assisted Reporting and Data Journalism Syllabuses, compiled by Dan Nguyen
Python
165
star
3

github-for-portfolios

A layperson's step-by-step guide to building webpages with Github
CSS
73
star
4

python-notebooks-data-wrangling

Python 3.x notebooks about real-world data cleaning and visualization
Jupyter Notebook
68
star
5

facebook-trending-rss-fetcher

Python code to scrape and collect data from the RSS feeds Facebook uses to augment its Trending Section
Python
56
star
6

smalldata_journalism

An online reference for data journalism
Ruby
25
star
7

learn-data-csv-cli

A work-in-progress guide showing how and why you should learn command-line tools (xsv, csvkit) to work with data
Python
19
star
8

bashfoo

My personally curated list of bash/command-line commands and snippets that are very useful yet I keep on forgetting
Python
18
star
9

datajournalism-primer

a general list of resources and articles for people interested in getting into data journalism
HTML
16
star
10

congress-colleges

What fancy schools do U.S. legislators go to?
HTML
15
star
11

gis-geospatial-fun-python3x

Tracking my progress in doing GIS/Geospatial work in Python 3.x
Jupyter Notebook
12
star
12

nicar-2019-pdfplumbing

NICAR 2019 workshop on using Python and PDFplumber to extract text from PDFs
Jupyter Notebook
12
star
13

Congressmiles

A tutorial on using Face.com's and NYT Congress's API + Sunlight data
Ruby
10
star
14

dannguyen.github.io

I'm making a Github Pages repo!
HTML
9
star
15

scrape-senate-financial-disclosures

looking at U.S. Senators' disclosures, including how to parse and track them
HTML
9
star
16

local-news-data

how hard is it to get a list of all local news sites in the United States (LOL)
Python
8
star
17

python-at-stanford

Python Courses at Stanford
8
star
18

NICAR-Google-Refine

The lesson and source files for Dan Nguyen's NICAR 2012 lesson on Google Refine
6
star
19

pdftotablestable

Comparing the programs that extract tabular data from PDFs, e.g. ABBYY FineReader, Tabula, CometDocs
6
star
20

house-financial-disclosures

Scraping House representative financial disclosures
Python
6
star
21

clinton-hillary-email-fbi-investigation-docs

OCR copy of the 2015-2016 FBI Investigation into Hillary Clinton's emails
6
star
22

pydataproject-template

dan's personal reference for properly creating an empty/fresh python-based data wrangling project
Python
5
star
23

padjo-2017-sql-exam

PADJO 2017 SQL Exam - Now with extra election and disbursement data!
Shell
5
star
24

aws-textract-pdf-to-csv-demo

Testing the new AWS Textract when it comes to extracting data tables from PDFs (pdf-to-csv) and whether it can deliver us from our endless torments
5
star
25

nhtsa-complaint-data

Some scripts/data description for NHTSA complaint data
5
star
26

quickdataproject-template

a template I use for quick data project examples where collection, wrangling, and exploration can be done by standalone shell/python scripts
Python
5
star
27

screencappy

A command-line tool for making it easier to create and save screenshots as a blogger
Python
4
star
28

dmv-vanity-plate-rejections

A repo of collected data and records from U.S. state DMVs regarding rejected vanity license plates
HTML
4
star
29

csvkitcat

csvkitcat has been archived (Oct. 2020), and is being carted over to csvmedkit
Python
4
star
30

frozen.analytics.usa.gov

A "frozen" version of https://analytics.usa.gov to practice network traffic inspection and web scraping
CSS
4
star
31

writhub

A simple Python-based static post generator, because I just need to post, not make an entire website
Python
4
star
32

journaling-on-github

My personal repo for doing quick journaling on Github with Markdown, plus some helper TOC scripts
Python
4
star
33

acp-2017-finding-stories-in-data

"How to Find Stories in Data" for the Associated Collegiate Press 2017 San Francisco Midwinter Convention
4
star
34

kfc-scrape

chicken
3
star
35

til

A simple static Jekyll blog of things I've learned, day-to-day, particularly in programming and data journalism
Ruby
3
star
36

altair-dataviz

Visualization in Python with the Altair library. Done in Jupyter Notebooks.
Jupyter Notebook
3
star
37

mechanical-unmurk-ocr

For the OCRing of scanned, murky documents where privacy, speed, accuracy, and cost are all priorities
3
star
38

seeing-is-beliebing

Instagram util for finding photos taken shorty before and after near where another photo was taken
JavaScript
3
star
39

simplestuff-sqlite

A data/lesson repo teaching SQL syntax and concepts with a very simple SQLite database
Shell
3
star
40

smalldata

A list of small datasets for examples of exploration in spreadsheets
Python
3
star
41

cms_medicare_fee_data

Data notebook for CMS Medicare fee data
3
star
42

marktoc

A Python library for generating a table of contents and anchor markup for a Markdown file
Python
3
star
43

sf-shelter-waitlist-daily-snapshots

A compilation of daily snapshots of San Francisco's emergency shelter reservation wait-list during the COVID-19 pandemic
Python
3
star
44

seshkit

seshkit is a command-line tool for creating transcripts from audio files
Python
3
star
45

excsv

goofin around with a command-line utility for quickly inspecting CSV files
Python
3
star
46

merle

A command-line tool for getting meta information from a URL
Python
2
star
47

DepGal

Build out a gal using RMagick
JavaScript
2
star
48

csvviz

please i would like someday a tool that is like csvkit but for making charts from the command line
Python
2
star
49

supcli

supcli: my personal guide to modern CLI, including third-party replacement for classic Nix tools
2
star
50

xkcd-on-reactjs

Just playing around with React.js to make a searchable xkcd archive
Ruby
2
star
51

yearbook

Ruby
2
star
52

ny-gis-cartodb-fun

Examples of GIS with New York data and CartoDB
2
star
53

sf-ethics-lobbyist-sql

A repo of San Francisco lobbyist data compiled into SQLite form, including data-handling scripts
Shell
2
star
54

emojicsv

Machine-readable emotions in machine-readable CSV
HTML
2
star
55

command-line-basics-mz2022

command line lessons for 2022 quickie repo
2
star
56

SCOTUS-Transcript-Viewer

A Backbone.js viewer of SCOTUS transcripts
JavaScript
2
star
57

Shakyspeare

Analyzing the Bard's work with Ruby!
2
star
58

death-data

2
star
59

bts-transstats-t100-domestic-demo

Demo of data processing for BTS transtats
2
star
60

middleman-meta-tags

Meta and SEO tag helpers for Middleman
Ruby
2
star
61

city_crime_data

collecting crime report data from cities that have it in a granular format
Makefile
2
star
62

bashappy_helpers

A bunch of helper functions I wrote to use for my own macOS terminal convenience
Shell
2
star
63

air_skift

Air rails
Ruby
2
star
64

secdataexploring

fetching and exploring SEC structured data for fun
Python
2
star
65

dod-leso-1033-data

A repo for collecting data/records regarding the Defense Logistics
Python
2
star
66

matplotlib-styling-tutorial

A quick iPython notebook showing how to create and style Matplotlib charts with roughly same flexibility as ggplot2
Jupyter Notebook
2
star
67

texas-state-salaries

playing around with texas state salary data courtesy of the Texas Tribune
Python
2
star
68

healthcare.gov

A copy of healthcare.gov when it was built on Jekyll, before they removed the source code
JavaScript
2
star
69

jekyll-datasite-template

Trying to make a template that scaffolds a basic jekyll site with bootstrap and vendor d3v5
JavaScript
2
star
70

pgark

pgark (page archiver): Python library and CLI for archiving URLs on popular services like Wayback Machine [alpha, just spitballing]
Python
2
star
71

nature-inspired-algorithms-in-python

Going through Jason Brownlee's "Clever Algorithms: Nature-Inspired Programming Recipes" http://cleveralgorithms.com/nature-inspired/stochastic/random_search.html
Python
1
star
72

lookups-of-note

Lookup tables and data references
1
star
73

censusscout

making my own lightweight version of Census Explorer because y not
JavaScript
1
star
74

motherfuckingwebdesignguide

just do it
1
star
75

foodscrape

A demonstration of scraping health inspection websites and doing statistical analysis
1
star
76

nicar-2019-github-intro

Intro to git and github for journalists
Makefile
1
star
77

Sinatra-Fun

Testing out sinatra
Ruby
1
star
78

jekyll-bootstrap-starter

a basic jekyll theme that sits atop of Bootstrap 4.x. For my convenience only
HTML
1
star
79

data-wrangling-fakebook

The Little Data Wrangling Fakebook
Python
1
star
80

foiastories

a curated list of interesting foia/foil requests
1
star
81

astronautdata

A repo of astronaut data
HTML
1
star
82

danssphinx-template

This is a bunch of examples of things I forget how to do in Sphinx and reST
Python
1
star
83

sql2md

A bash script for converting SQLite query into Markdown-ready-pastable results
Shell
1
star
84

poynter-census-data-2019

Poynter Census Data Workshop 2019, using Sphinx-hieroglyph slidemaker
Python
1
star
85

stanford-public-affairs-data-journalism

1
star
86

sf-evictions

just collecting san francisco evictions data
Python
1
star
87

d3choro-template

yaddaydaydayda
CSS
1
star
88

merde

Shit
1
star
89

digital-jo-2017

Quickie repo for digital journalism notes for stanford journalism 2017
1
star
90

twitkit

yet another attempt at making a personal twitter data exploration command-line tool
Python
1
star
91

wire-glossary

the fuck did I do
Ruby
1
star
92

high-charty

JavaScript
1
star
93

wikipedia-trends

1
star
94

revelecture

A command-line tool to turn Markdown files into Reveal.js powered slideshows
JavaScript
1
star
95

hello-svelte

need to practice this javascript thing
HTML
1
star
96

ok-earthquakes-RNotebook

Using R's ggplot2 and rgdal to examine earthquake activity in Oklahoma
R
1
star
97

fatal-encounters-and-census-sql

SQLite database exercises for analyzing Fatal Encounters (police officer involved homicides) and Census data
Shell
1
star
98

python-audio-playtime

experimenting with Python audio visualizers and extraction libraries
1
star
99

scrapespeare

A collection of The Bard's text for basic programming exercises and data mining.
XSLT
1
star
100

twitch-stream-exploring-ppp-with-cli

Just some notes and data and files for a twitch stream on how to data wrangle the PPP loan data
1
star