• Stars
    star
    141
  • Rank 259,971 (Top 6 %)
  • Language
    Python
  • Created about 8 years ago
  • Updated 9 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Searching Open Library by keywords to return ISBNs

Open Library database

Open Library is an online library of bibliographic data and includes full data dumps of all its data.

This project provides instructions and scripts for importing this data into a PostgreSQL database and some sample queries to test the database.

Getting started

The following steps should get you up and running with a working database.

  1. Install the required prerequisites so that you have the software running and database server.
  2. Download the data from Open Library.
  3. Run the processing the data scripts to clean it up and make it easier to import.
  4. Import the data into the database.

Prerequisites

  • Python 3 - Tested with 3.10
  • PostgreSQL - Version 15 is tested but most recent versions should work

Downloading the data

Open Library offer bulk downloads on their website, available from the Data Dumps page

These are updated every month. The downloads available include:

  • Editions (~9GB)
  • Works (~2.5GB)
  • Authors (~0.5GB)
  • All types (~10GB)

For this project, I downloaded the Editions, Works, and Authors data. The latest can be downloaded using the following commands in a terminal:

wget https://openlibrary.org/data/ol_dump_editions_latest.txt.gz -P ~/downloads
wget https://openlibrary.org/data/ol_dump_works_latest.txt.gz -P ~/downloads
wget https://openlibrary.org/data/ol_dump_authors_latest.txt.gz -P ~/downloads

To move the data from your downloads folder, use the following commands in a terminal

mv ~/downloads/ol_dump_authors_*txt.gz ./data/unprocessed/ol_dump_authors_.txt.gz
mv ~/downloads/ol_dump_works_*txt.gz ./data/unprocessed/ol_dump_works_.txt.gz
mv ~/downloads/ol_dump_editions_*txt.gz ./data/unprocessed/ol_dump_editions_.txt.gz

To uncompress this data, I used the following commands in a terminal:

gzip -d -c data/unprocessed/ol_dump_editions_*.txt.gz > data/unprocessed/ol_dump_editions.txt
gzip -d -c data/unprocessed/ol_dump_works_*.txt.gz > data/unprocessed/ol_dump_works.txt
gzip -d -c data/unprocessed/ol_dump_authors_*.txt.gz > data/unprocessed/ol_dump_authors.txt

Processing the data

Unfortunately the downloads provided seem to be a bit messy, or at least don't play nicely with direct importing. The open library file errors on import as the number of columns provided varies. Cleaning it up is difficult as just the text file for editions is 25GB. Note: Check if this is still the case and if so there could be some Linux tools to do this - maybe try sed and awk

That means requiring another python script to clean up the data. The file openlibrary-data-process.py simply reads in the CSV (python is a little more forgiving about dodgy data) and writes it out again for each row, but only where there are 5 columns.

python openlibrary-data-process.py

Because the download files are so huge and are only going to grow, editions is now 45gb+, you can use the openlibrary-data-chunk-process.py alternative file to split the data into smaller files to load sequentially. You can change the number of lines in each chuck here. I recommend 1-3 million.

Once the files are split you should delete the 3 .txt files in the uncompressed folder because you will need around 230 Gb of freespace to load all 3 files into the database without encountering lack of space errors.

lines_per_file = 5000
python3 openlibrary-data-chunk-process.py

This generates multiple files into the data/processed directory. One of those files will be used to access the rest of them when loading the data.

Import into database

It is then possible to import the data directly into PostgreSQL tables and do complex searches with SQL.

There are a series of database scripts whch will create the database and tables, and then import the data. These are in the database folder. The data files (created in the previous process) need to already be within the data/processed folder for this to work.

The command line too psql is used to run the scripts. The following command will create the database and tables:

psql --set=sslmode=require -f openlibrary-db.sql -h localhost -p 5432 -U username postgres

Database table details

The database is split into 5 main tables

Data Description
Authors Authors are the individuals who write the works
Works The works as created by the authors, with titles, and subtitles
Autor Works A table linking the works with authors
Editions The particular editions of the works, including ISBNs
Edition_ISBNS The ISBNs for the editions

Query the data

That's the database set up - it can now be queried using relatively straightforward SQL.

Get details for a single item using the ISBN13 9781551922461 (Harry Potter and the Prisoner of Azkaban).

select
    e.data->>'title' "EditionTitle",
    w.data->>'title' "WorkTitle",
	a.data->>'name' "Name",
    e.data->>'subtitle' "EditionSubtitle",
    w.data->>'subtitle' "WorkSubtitle",
    e.data->>'subjects' "Subjects",
    e.data->'description'->>'value' "EditionDescription",
    w.data->'description'->>'value' "WorkDescription",
    e.data->'notes'->>'value' "EditionNotes",
    w.data->'notes'->>'value' "WorkNotes"
from editions e
join edition_isbns ei
    on ei.edition_key = e.key
join works w
    on w.key = e.work_key
join author_works a_w
	on a_w.work_key = w.key
join authors a
	on a_w.author_key = a.key
where ei.isbn = '9781551922461'

More Repositories

1

catalogues-api

API and front-end for searching all the UK public library catalogues for ISBN results
Pug
15
star
2

catalogues-library

A JS library for searching library catalogues
JavaScript
11
star
3

wuthering-hacks

A data dashboard for Newcastle public libraries open data
JavaScript
5
star
4

library-alt-text-bot

A twitter bot to promote web accessibility for libraries on Twitter
JavaScript
4
star
5

librarieshacked-web

The (old) public website for Libraries Hacked. Built using Pico CMS.
HTML
3
star
6

mobilelibraries-website

A public website for displaying mobile library stops, routes, and timetables
JavaScript
2
star
7

high-streets-analysis

Scripts for analysing locations of libraries alongside high streets
Python
2
star
8

disbumptors-librarydata

A twitter list of library disruptors
JavaScript
2
star
9

api-librarydata

API to provide various open datasets for public libraries in the UKlibrary dataset
JavaScript
2
star
10

library-carpentry

A one day set of library carpentry materials
2
star
11

libraries-twitter

Displaying libraries and library services on Twitter in a gallery
JavaScript
2
star
12

libraries-at-home

Displaying useful information on UK libraries to access from home including videos, blogs, and podcasts
JavaScript
1
star
13

geography-db

A database to hold common UK geography
PLpgSQL
1
star
14

mobiles-librarydata

A project to track all mobile libraries
Python
1
star
15

library-renewals-axiell

Google apps script for library loans notifications and auto renewal of loans
JavaScript
1
star
16

walespostcodes-librarydata

Wales library postcode lottery showing postcodes in Wales and their relative proximity to libraries
JavaScript
1
star
17

twitter-librarydata

Displaying UK library twitter accounts from SarahHLib library lists
JavaScript
1
star
18

librarymap

Map and library finder for UK static and mobile libraries
JavaScript
1
star
19

librarieshacked.github.io

Blogging about library data and public libraries
JavaScript
1
star
20

data-treaders

Data treaders event GitBook repository. Information about the event, timetables, and data sources.
1
star
21

mobilelibraries-database

Database schema and creation scripts for the mobile library project
PLpgSQL
1
star
22

early-english-books

A Norch implementation for Early English Book data
1
star
23

bnb-books-pebble

A Pebble watch app for finding British National Bibliography books published in current location
JavaScript
1
star
24

plymouth-librarydata

3D Plymouth map and library finder
JavaScript
1
star
25

eeb-mongodb

Mongo database and web service for Early English Book data
Shell
1
star
26

england-librarydata

Display English libraries data mixed with public libraries news in a data dashboard
JavaScript
1
star