• This repository has been archived on 16/Feb/2022
  • Stars
    star
    225
  • Rank 176,204 (Top 4 %)
  • Language
    Jupyter Notebook
  • License
    The Unlicense
  • Created over 4 years ago
  • Updated over 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Newspaper Navigator

By Benjamin Charles Germain Lee (2020 Library of Congress Innovator in Residence)

This project is an experiment and there is no active development and/or maintenance of the codebase. Fork at your own risk! In the unlikely event that there are further updates, the LC Labs team will announce it through our communication channels. Sign up for our listserv on labs.loc.gov, follow us on Twitter @LC_Labs, and watch #NewspaperNavigator .

Introduction

The goal of Newspaper Navigator is to re-imagine searching over the visual content in Chronicling America. The project consists of two stages:

  • Creating the Newspaper Navigator dataset by extracting headlines, photographs, illustrations, maps, comics, cartoons, and advertisements from 16.3 million historic newspaper pages in Chronicling America using emerging machine learning techniques. In addition to the visual content, the dataset includes captions and other relevant text derived from the METS/ALTO OCR, as well as image embeddings for fast similarity querying.
  • Creating an exploratory search application for the Newspaper Navigator dataset in order to enable new ways for the American public to navigate Chronicling America.

This repo contains the code for both steps of the project, as well as a list of Newspaper Navigator resources.

Updates

Update (09/10/2020):

  • The development of the Newspaper Navigator search application is complete! You can find the search application at: https://news-navigator.labs.loc.gov/search.

  • The Newspaper Navigator data archaeology is now available at: http://dx.doi.org/10.17613/k9gt-6685. The data archaeology examines the ways in which a Chronicling America newspaper page is transmuted and decontextualized during its journey from a physical artifact to a series of probabilistic photographs, illustrations, maps, comics, cartoons, headlines, and advertisements in the Newspaper Navigator dataset. You can also find the PDF of the paper in the repo here.

Update (05/05/2020):

  • The pipeline has finished processing 16,368,041 Chronicling America pages, and the Newspaper Navigator dataset is available to the public! You can find the Newspaper Navigator dataset website here: https://news-navigator.labs.loc.gov/.
  • Learn more about the dataset and its construction in a paper available here: https://arxiv.org/abs/2005.01583. You can also find the PDF of the paper in the repo here.

The Newspaper Navigator Dataset Pipeline

The sections below describe the different components of building the Newspaper Navigator pipeline. Here is a diagram of the pipeline workflow:

Alt text

Training Dataset for Visual Content Recognition in Historic Newspapers

The first step in the pipeline is creating a training dataset for visual content recognition. The Beyond Words dataset consists of crowdsourced locations of photographs, illustrations, comics, cartoons, and maps in World War I era newspapers, as well as corresponding textual content (titles, captions, artists, etc.). In order to utilize this dataset to train a visual content recognition model for historical newspaper scans, a copy of the dataset can be found in this repo (in /beyond_words_data/) formatted according to the COCO standard for object detection. The images are stored in /beyond_words_data/images/, and the JSON can be found in /beyond_words_data/trainval.json. The JSON also includes annotations of headlines and advertisements, as well as annotations for additional pages with maps to boost the number of maps in the dataset. These annotations were all done by one person (myself) and thus are unverified. The breakdown is as follows:

The dataset contains 3,437 images with 6,732 verified annotations (downloaded from the Beyond Words site on 12/01/2019), plus an additional 32,424 unverified annotations. Here is a breakdown of categories:

Category # in Full Dataset
Photograph 4,254
Illustration 1,048
Map 215
Comics/Cartoon 1,150
Editorial Cartoon 293
Headline 27,868
Advertisement 13,581
Total 48,409

If you would like to use only the verified Beyond Words data, just disregard the headline and advertisement annotations, as well as the annotations for any image added after 12/1/2019.

For an 80%-20% split of the dataset, see /beyond_words_data/train_80_percent.json and /beyond_words_data/val_20_percent.json. Lastly, the original verified annotations from the Beyond Words site can be found at beyond_words_data/beyond_words.txt.

To construct the dataset using the Beyond Words annotations added since 12/01/2019, first update the annotations file from the Beyond Words website, then run the script process_beyond_words_dataset.py. To add the additional headline and advertisement annotations, you can retrieve them from /beyond_words_data/trainval.json and add them to your dataset.

Detecting and Extracting Visual Content from Historic Newspaper Scans

With this dataset fully constructed, it is possible to train a deep learning model to identify visual content and classify the content according to 7 classes (photograph, illustration, map, comic, editorial cartoon, headline, advertisement). The approach utilized here is to finetune a pre-trained Faster-RCNN implementation in Detectron2's Model Zoo in PyTorch.

I have included scripts and notebooks designed to run out-of-the-box on most deep learning environments (tested on an AWS EC2 instance with a Deep Learning Ubuntu AMI). Below are the steps to get running on any deep learning environment with Python 3, PyTorch, and the standard scientific computing packages shipped with Anaconda:

  1. Clone this repo.
  2. Next, run /install-scripts/install_detectron_2.sh in order to install Detectron2, as well as all of its dependencies. Due to some deprecated code in pycocotools, you may need to change "unicode" to "bytes" in line 308 of ~/anaconda3/lib/python3.6/site-packages/pycocotools/coco.py in order to enable the test evaluation in Detectron2 to work correctly. If the above installation package fails, I recommend following the steps on the Detectron2 repo for installation.
  3. For the pipeline code, you'll need to clone a forked version of img2vec that I modified to include ResNet-50 embedding functionality. Then cd img2vec and run python setup.py install.
  4. For the pipeline code, you'll also need to install graphicsmagick for converting JPEG-2000 images to JPEG images. Run sudo apt-get install graphicsmagick to install it.

To experiment with training your own visual content recognition model, run the command jupyter notebook and navigate to the notebook /notebooks/train_model.ipynb, which contains code for finetuning Faster-RCNN implementations from Detectron2's Model Zoo - the notebook is pre-populated with the output from training the model for 10 epochs (scroll down to the bottom to see some sample predictions). If everything is installed correctly, the notebook should run without any additional steps!

Processing Your Own Newspaper Pages

The model weights file for the finetuned visual content recognition model is available here (the file is approximately 300 MB in size).

The visual content recognition model is a finetuned Faster-RCNN implementation (the R50-FPN backbone from Detectron2's Model Zoo). The model weights are used in the Newspaper Navigator pipeline for the construction of the Newspaper Navigator dataset. The R50-FPN backbone was selected because it has the fastest inference time of the Faster-RCNN backbones, and inference time is the bottleneck in the Newspaper Navigator pipeline (approximately 0.1 seconds per image on an NVIDIA T4 GPU). Though the X101-FPN backbone reports a slightly higher box average precision (43 % vs. 37.9 %), inference time is approximately 2.5 times slower, which would drastically increase the pipeline runtime.

Here are performance metrics on the model available for use; the model consists of the Faster-RCNN R50-FPN backbone from Detectron2's Model Zoo (all training was done on an AWS g4dn.2xlarge instance with a single NVIDIA T4 GPU) finetuned on the training set described above:

Category Average Precision # in Validation Set
Photograph 61.6% 879
Illustration 30.9% 206
Map 69.5% 34
Comic/Cartoon 65.6% 211
Editorial Cartoon 63.0% 54
Headline 74.3% 5,689
Advertisement 78.7% 2,858
Combined 63.4% 9,931

For slideshows showing the performance of this model on 50 sample pages from the Beyond Words test set, please see /demos/slideshow_predictions_filtered.mp4 (for the predictions filtered with a threshold cut of 0.5 on confidence score) and /demos/slideshow_predictions_unfiltered.mp4 (for the predictions with a very low, default threshold cut of 0.05 on confidence score).

Note: To use the model weights, import the model weights in PyTorch as usual, and add following lines:

  • cfg.merge_from_file("/detectron2/configs/COCO-Detection/faster_rcnn_R_50_FPN_3x.yaml") (note that you may need to change the filepath to navigate to detectron2 correctly)
  • cfg.MODEL.ROI_HEADS.NUM_CLASSES = 7

To see more on how to run inference using this model, take a look at the pipeline code.

Extracting Captions and Textual Content using METS/ALTO OCR

Now that we have a finetuned model for extracting visual content from newspaper scans in Chronicling America, we can leverage the OCR of each scan to weakly supervise captions and corresponding textual content. Because Beyond Words volunteers were instructed to draw bounding boxes over corresponding textual content (such as titles and captions), the finetuned model has learned how to do this as well. Thus, it is possible to utilize the predicted bounding boxes to extract textual content within each predicted bounding box from the METS/ALTO OCR XML file for each Chronicling America page. Note that this is precisely what happens in Beyond Words during the "Transcribe" step, where volunteers correct the OCR within each bounding box. The code for extracting textual content from the METS/ALTO OCR is included in the pipeline and is described below.

Generating Image Embeddings

In order to generate search and recommendation results over similar visual content, it is useful to have pre-computed image embeddings for fast querying. In the pipeline, I have included code for generating image embeddings using a forked version of img2vec.

A Pipeline for Running at Scale

The pipeline code for processing 16.3 million Chronicling America pages can be found in /notebooks/process_chronam_pages.ipynb. This code relies on the repo chronam-get-images to produce manifests of each newspaper batch in Chronicling America. A .zip file containing the manifests can be found in this repo in manifests.zip. When unzipped, the manifests are separated into two folders: processed (containing the 16,368,041 pages that were successfully processed) and failed (containing the 383 pages that failed during processing).

This notebook then:

  1. downloads the image and corresponding OCR for each newspaper page in each Chronicling America batch directly from the corresponding S3 buckets (note: you can alternatively download Chronicling America pages using chronam-get-images)
  2. performs inference on the images using the finetuned visual content detection model
  3. crops and saves the identified visual content (minus headlines)
  4. extracts textual content within the predicted bounding boxes using the METS/ALTO XML files containing the OCR for each page
  5. generates ResNet-18 and ResNet-50 embeddings for each cropped image using a forked version of img2vec for fast similarity querying
  6. saves the results for each page as a JSON file in a file tree that mirrors the Chronicling America file tree

Note: to run the pipeline, you must convert the notebook to a Python script, which can be done with the command: jupyter nbconvert --to script process_chronam_pages.ipynb. This is necessary because the code is heavily parallelized using multiprocessing, and the cell execution in Jupyter notebooks presents conflicts.

Visualizing a Day in Newspaper History

Using the chronam-get-images repo, we can pull down all of the Chronicling America content for a specific day in history (or, a larger date range if you're interested - the world is your oyster!). Running the above scripts, it's possible to go from a set of scans and OCR XML files to extracted visual content. How do we then visualize this content?

One answer is to use image embeddings and T-SNE to cluster the images in 2D. To accomplish this, I've used img2vec. Here, I've chosen to use the image embeddings. Using sklearn's implementation of T-SNE, it's easy to perform dimensionality reduction down to 2D, perfect for a visualization. We can then visualize a day in history!

For a sample visualization of June 7th, 1944 (the day after D-Day), please see /demos/visualizing_6_7_1944.png (NOTE: the image is 20 MB, enabling high resolution of images even when zooming in). If you search around in this visualization, you will find clusters of maps showing the Western Front, photographs of military action, and photographs of people. Currently, the aspect ratio of the extracted visual content is not preserved, but this is to be added in future iterations.

The script /demos/generate_visualization.py contains my code for generating the sample visualization, though it is not currently in a state of supporting out-of-the-box functionality.

The Newspaper Navigator Search Application

The Newspaper Navigator search application is available at: https://news-navigator.labs.loc.gov/search. With the application, you can explore 1.56 million photos from the Newspaper Navigator dataset. In addition to searching by keyword over the photos' captions (extracted from the OCR of each newspaper page as part of the Newspaper Navigator pipeline), you can search by visual similarity using machine learning. In particular, by selecting photos that you are interested in, you can train an "AI navigator" on the fly to retrieve photos for you according to visual similarity (for example: baseball players, sailboats, etc.). An AI navigator can train and predict on all 1.56 million photos in just a couple of seconds, thus facilitating re-training and tuning. To learn more about this application, please see the demo video on the landing page or read more on the 'About' page.

You can find all of the code for the app in /news_navigator_app in this repo. The app is fully containerized in Docker and was written in Python, Flask, HTML, CSS, and vanilla Javascript (scikit-learn was utilized for the machine learning component).

To launch the Docker container, follow these steps:

  1. Clone the repo
  2. Navigate to /news_navigator_app
  3. Run: docker-compose up --build
  4. Go to http://0.0.0.0:5000/search (or the appropriate port; just be sure to go to /search, which is the landing page)

The Redis usage in flaskapp.py is modeled after [https://docs.docker.com/compose/gettingstarted/]https://docs.docker.com/compose/gettingstarted/.

Currently, when the Dockerfile is executed, the script download_photos_and_metadata.py in /news_navigator_app/preprocessing is run before the app launches. Parameters for the pre-processing script can be found in the same directory in /news_navigator_app/params.py. The parameters in the script allow you to control the number of photos consumed by the app. Setting USE_PRECOMPUTED_METADATA to True tells the pre-processing script to pull down the pre-computed metadata for all 1.56 million photos, which amounts to ~6GB of data. I recommend not doing this, as the memory consumption of the app is quite large. Alternatively, setting USE_PRECOMPUTED_METADATA to False tells the pre-processing script to pull down metadata from https://news-navigator.labs.loc.gov and compute the metadata on the fly. When launching the app for all 1.56 million photos, this is much slower than downloading the precomputed metadata; however, it enables you to run the app over a much smaller number of photos, which is advantageous when there are memory constraints, such as on a local machine (the app requires ~15GB of RAM to run with all of the photos). To use a much smaller set of photos, you can use the default settings of USE_SAMPLE_PACKS as True, START_YEAR as 1910 and END_YEAR as 1911 (along with USE_PRECOMPUTED_METADATA as False). The USE_SAMPLE_PACKS flag tells the pre-processing script to pull down only 1000 photos from each year in the date range. Using a few thousand photos is best for testing.

All requests are handled via query strings, and there is no back-end database.

In order to replace the photos in the Newspaper Navigator dataset, you'll need to pre-compute image embeddings for all of your photos and generate the appropriate metadata. You'll also need to modify download_metadata.py.

Newspaper Navigator Resources

Related Resources

More Repositories

1

api.congress.gov

congress.gov API
Java
624
star
2

bagit-python

Work with BagIt packages from Python.
Python
210
star
3

data-exploration

Tutorials for working with Library of Congress collections data
Jupyter Notebook
179
star
4

concordia

Crowdsourcing platform for full text transcription and tagging. https://crowd.loc.gov
Python
154
star
5

bagger

The Bagger application packages data files according to the BagIt specification.
Java
120
star
6

bagit-java

Java library to support the BagIt specification.
Java
71
star
7

citizen-dj

JavaScript
70
star
8

chronam

This software project is no longer being actively developed at the Library of Congress. Consider using the Open-ONI (https://github.com/open-oni) fork of the chronam software. Project mailing list: http://listserv.loc.gov/archives/chronam-users.html.
Python
70
star
9

viewshare

A web application developed by Zepheira for the Library of Congress National Digital Information Infrastructure and Preservation Program (NDIIPP) which allows users to create and share embeddable interfaces to digital cultural heritage collections. A project of the Library of Congress; the project was retired in March 2018. Note: project members may work on both official Library of Congress projects and non-LC projects.
JavaScript
45
star
10

bagger-js

Upload BagIt-format deliveries to S3 entirely in the browser
JavaScript
32
star
11

coding-standards

Library of Congress coding standards
Python
27
star
12

labs-ai-framework

Planning Framework used by LC Labs for planning AI experiments towards responsible implementation
CSS
24
star
13

gazetteer

A historical gazetteer project of the Library of Congress. Note: project members may work on both official Library of Congress projects and non-LC projects.
Python
23
star
14

wdl-viewer

A fast, responsive HTML5 viewer for scanned items, developed for the World Digital Library. A project of the Library of Congress. Note: project members may work on both official Library of Congress projects and non-LC projects.
JavaScript
22
star
15

speech-to-text-viewer

AWS Transcribe evaluation pipeline: bulk-process audio files and view the results
Python
17
star
16

django-tabular-export

Utilities used to export data into spreadsheets from Django applications. Currently used internally at the Library of Congress in the WDL cataloging application.
Python
15
star
17

Exploring-ML-with-Project-Aida

Jupyter Notebook
13
star
18

bagit-conformance-suite

Test cases for validating BagIt implementations
Python
10
star
19

premis-v3-0

PREMIS schemas are written in XML. They are open source community tools that allow PREMIS users to validate PREMIS records against a version of the PREMIS schema.
10
star
20

mods2bibframe

mods2bibframe XSLT
XSLT
8
star
21

MarcMods3.6xsl

MARC>MODS--the mappings and corresponding XSLTs are open source community tools developed by NDMSO at LC.
XSLT
7
star
22

hitl

Code and documentation for Humans in the Loop (HITL), an LC Labs sponsored collaboration with metadata solutions provider AVP. The experiment explores a framework and considerations for integrating crowdsourcing and machine learning in ways that are ethical, engaging, and useful.
JavaScript
7
star
23

embARC

embARC (“metadata embedded for archival content”) manages internal file metadata including embedding and validation. Created by FADGI (Federal Agencies Digital Guidelines Initiative), in conjunction with AVP and PortalMedia, embARC enables users to audit and correct embedded metadata of a subset of MXF files, as well as both individual DPX files or an entire DPX sequence, while not impacting the image data.
HTML
7
star
24

speculative-annotation

Speculative Annotation is a web browser application written in Javascript and built with React, FabricJS, IIIF, OpenSeaDragon, and ChakraUI. Source images are hosted locally. The application uses the OpenSeadragon Viewer to render images, so your source images can be a combination of locally hosted images (within the application), or externally hosted images (for example, served from a IIIF image server).Application metadata is represented by a combination of local IIIF Presentation API 3.0 manifest files, and Library of Congress hosted IIIF manifest files. The application allows users to annotate select free to use items from the Library of Congress, save to browser or download locally.
JavaScript
7
star
25

pimtoolbox

The Library of Congress and the Florida Center for Library Automation developed the PREMIS in METS (PiM) Toolbox. The project provides PREMIS:METS conversion and validation tools that support the implementation of PREMIS in the METS container format.
Ruby
6
star
26

inside-baseball

Explore baseball collections from the Library of Congress and the National Museum of African American History and Culture
Python
6
star
27

iptables-gem

A project of the Library of Congress. Note: project members may work on both official Library of Congress projects and non-LC projects.
Ruby
5
star
28

sanborn-navigator

Jupyter Notebook
5
star
29

ADCTest

ADCTest is a desktop application, written in C++, that provides provides simple pass-fail reporting for the tests detailed in the FADGI Low Cost ADC Performance Testing Guidelines as well as more detailed results
C++
5
star
30

MarcMods3.5xsl

MARC>MODS 3.5--the mapping and corresponding XSLT are open source community tools developed by NDMSO at LC.
XSLT
4
star
31

pairtree

A project of the Library of Congress. Note: project members may work on both official Library of Congress projects and non-LC projects.
CSS
4
star
32

simple-artifact-uploader

A plugin for the Gradle build management tool that allows us to automatically upload completed binaries to the Artifactory deployment server.
Java
3
star
33

a-search-for-the-heart

HTML
3
star
34

seeing-lost-enclaves

Seeing Lost Enclaves is an initiative by Jeffrey Yoo Warren as part of the 2023 Innovator in Residence Program at the Library of Congress.
HTML
2
star
35

DVV

The Digital Viewer and Validator (DVV) tool is developed at the Library of Congress for use by National Digital Newspaper Program (NDNP) participants.
1
star
36

LC_Labs

1
star
37

viewshare_site

Site specific project retired Library of Congress instance of the Viewshare project
Python
1
star
38

marc2mads20

MARC>MADS--the mappings and corresponding XSLTs are open source community tools developed by NDMSO at LC.
1
star
39

CCHC

Computing Cultural Heritage in the Cloud (CCHC) is our Andrew W. Mellon-funded experiment for piloting cloud solutions to enable research including data analysis and reduction on large-scale digital collections. Three non-LC staff contracted researchers will analyze large collection datasets that are stored in and accessible from AWS, likely as JSON. The contracted research experts' code will demonstrate how the datasets are gathered, transformed, and manipulated to demonstrate the needs of computational analysis. Languages used in this code may include Python and JavaScript. Code will undergo security review as it is submitted as deliverables during the contract window, with final versions to be made available in GitHub repository by the end of Q2 FY 2022.
1
star
40

btp-data

This Python tutorial demonstrates how to process and visualize the Library of Congress' By the People transcription data using natural language processing.
Jupyter Notebook
1
star