• This repository has been archived on 16/Oct/2024
  • Stars
    star
    187
  • Rank 206,464 (Top 5 %)
  • Language
    HTML
  • License
    MIT License
  • Created over 7 years ago
  • Updated 2 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

CAP database scripts.

Capstone

test status codecov

This is the source code for case.law, a website written by the Harvard Law School Library Innovation Lab to manage and serve court opinions. Other than several cases used for our automated testing, this repository does not contain case data. Case data may be obtained through the website.

Project Background

The Caselaw Access Project is a large-scale digitization project hosted by the Harvard Law School Library Innovation Lab. Visit case.law for more details.

The Data

  1. Format Documentation and Samples
  2. Obtaining Real Data
  3. Reporting Data Errors
  4. Errata

Format Documentation and Samples

The output of the project consists of page images, marked up case XML files, ALTO XML files, and METS XML files. This repository has a more detailed explanation of the format, and two volumes worth of sample data:

CAP Samples and Format Documentation

Obtaining Real Data

This data, with some temporary restrictions, is available to all. Please see our project site with more information about how to access the API, or get bulk access to the data:

https://case.law/

Reporting Data Errors

This is a living, breathing corpus of data. While we've taken great pains to ensure its accuracy and integrity, two large components of this project, namely OCR and human review, are utterly fallible. When we were designing Capstone, we knew that one of its primary functions would be to facilitate safe, accountable updates. If you find any errors in the data, we would be extraordinarily grateful for your taking a moment to create an issue in this GitHub repository's issue tracker to report it. If you notice a large pattern of problems that would be better fixed programmatically, or have a very large number of modifications, describe it in an issue. If we need more information, we'll ask. We'll close the issue when the issue has been corrected.

Errata

These are known issues β€” there's no need to file an issue if you come across one of these.

  • Missing Judges Tag: In many volumes, elements which should have the tag name <judges> instead have the tag name <p>. We're working on this one.
  • Nominative Case Citations: In many cases that come from nominative volumes, the citation format is wrong. We hope to have this corrected soon.
  • Jurisdictions: Though the jurisdiction values in our API metadata entries are normalized, we have not propagated those changes to the XML.
  • Court Name: We've seen some inconsistencies in the court name. We're trying to get this normalized in the data, and we'll also publish a complete court name list when we're done.
  • OCR errors: There will be OCR errors on nearly every page. We're still trying to figure out how best to address this. If you've got some killer OCR correction strategies, get at us.

The Capstone Application

Capstone is a Django application with a PostgreSQL database which stores and manages the non-image data output of the CAP project. This includes:

  • Original XML data
  • Normalized metadata extracted from the XML
  • External metadata, such as the Reporter database
  • Changelog data, tracking changes and corrections

Installing Capstone and CAPAPI

Hosts Setup

Add the following to /etc/hosts:

127.0.0.1       case.test
127.0.0.1       api.case.test
127.0.0.1       cite.case.test

Docker Setup

We support local development via docker compose. Docker setup looks like this:

Using pull first will avoid rebuilding images locally:

$ docker-compose pull

Start docker:

$ docker-compose up -d

Set up databases:

$ docker-compose exec db psql --user=postgres -c "CREATE DATABASE capdb;"
$ docker-compose exec db psql --user=postgres -c "CREATE DATABASE capapi;"
$ docker-compose exec db psql --user=postgres -c "CREATE DATABASE cap_user_data;"

Log into web container:

$ docker-compose exec web bash
# 

From now on all commands starting with # are assumed to be run from within docker-compose exec web bash.

Load dev data:

⚠️ Note: Make sure that Docker has sufficient resources allocated to run Elasticsearch. Lower allocations may cause rebuild_search_index to crash. Recommended minimum:

  • CPUs: 6
  • Memory: 16 GB
  • Swap: 1 GB
  • Disk image: ~256 GB
# fab init_dev_db
# fab ingest_fixtures
# fab import_web_volumes
# fab refresh_case_body_cache
# fab rebuild_search_index

To get ngrams working, run:

# mkdir test_data/ngrams
# fab ngram_jurisdictions

Run the dev server:

# fab run

Capstone should now be running at 127.0.0.1:8000.

If you are working on javascript files, frontend, use fab run_frontend:

# fab run_frontend

Administering and Developing Capstone

- [Testing ](#testing-)
- [Requirements ](#requirements-)
- [Applying model changes ](#applying-model-changes-)
- [Stored Postgres functions ](#stored-postgres-functions-)
- [Running Command Line Scripts ](#running-command-line-scripts-)
- [Logging In ](#logging-in-)
- [Local debugging tools ](#local-debugging-tools-)
- [Model versioning ](#model-versioning-)
- [Download real data locally ](#download-real-data-locally-)
- [Working with javascript ](#working-with-javascript-)
- [Elasticsearch ](#elasticsearch-)

Testing

We use pytest for tests. Some notable flags:

Run all tests:

# pytest

Run one test:

# pytest -k test_name

Drop into pdb on test failure:

# pytest --pdb

Run tests in parallel for speed:

# pytest -n 2

Requirements

Top-level requirements are stored in requirements.in. After updating that file, you should run

# fab pip_compile

to freeze all subdependencies into requirements.txt.

To upgrade a single requirement to the latest version:

# fab pip_compile:"-P package_name"

Applying model changes

Use Django to apply migrations. After you change models.py:

# ./manage.py makemigrations

This will write a migration script to cap/migrations. Then apply:

# fab migrate

This will migrate the underlying model in PostgreSQL. In order to transfer changes to Elasticsearch, apply:

# fab rebuild_search_index

Ensure that the relevant handlers to transfer this data are written in capstone/capapi/documents.py.

Stored Postgres functions

Some Capstone features depend on stored functions. See set_up_postgres.py for documentation.

Running Command Line Scripts

Command line scripts are defined in fabfile.py. You can list all available commands using fab -l, and run a command with fab command_name.

Logging In

fab init_dev_db will create a user with email [email protected] and password Password2.

You can create additional test users from ./manage.py shell_plus using the same code that is used by the init_dev_db command, or using the web frontend on the local development server.

Creating a new user through the frontend requires access to an email verification link. That link will be shown in the output of fab run or fab run_frontend in the following format:

Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Subject: Caselaw Access Project: Verify your email address
From: [email protected]
To: [email protected]
Date: Wed, 04 Aug 2021 17:53:46 -0000
Message-ID: <162809962609.2188.6020186441304370023@63fceca6d616>

Please click here to verify your email address:

https://case.test:8000/user/verify-user/4/ffffffffffffffffff/

If you received this message in error, please ignore it.

Local debugging tools

django-extensions is enabled by default, including the very handy ./manage.py shell_plus command.

django-debug-toolbar is not automatically enabled, but if you run pip install django-debug-toolbar it will be detected and enabled by settings_dev.py.

Model versioning

For database versioning we use the Postgres temporal tables approach inspired by SQL:2011's temporal databases.

See this blog post for an explanation of temporal tables and how to use them in Postgres.

We use django-simple-history to manage creation, migration, and querying of the historical tables.

Data is kept in sync through the temporal_tables Postgres extension and the triggers created in our scripts/set_up_postgres.py file.

Download real data locally

We store complete fixtures for about 1,000 cases in the case.law downloads section.

You can download and ingest all volume fixtures from that section with the command fab import_web_volumes, or ingest a single volume downloaded from that section with the command fab import_volume:some.zip.

Working with javascript

We use Vite to compile javascript files. New javascript entrypoints can be added to vite.config.js and included in templates with {% vite_asset %}.

To see javascript changes live, run the dev server with

# fab run_frontend

This will start yarn serve behind the scenes before calling fab run.

Elasticsearch

For local dev, Elasticsearch will automatically be started by docker-compose up -d. You can then run fab refresh_case_body_cache to populate CaseBodyCache for all cases, and fab rebuild_search_index to populate the search index.

For debugging, see settings.py.example for an example of how to log all requests to and from Elasticsearch.

It may also be useful to run Kibana to directly query Elasticsearch from a browser GUI:

$ brew install kibana
$ kibana -e http://127.0.0.1:9200

You can then go to Kibana -> Dev Tools to run any of the logged queries, or GET /_mapping to see the search indexes.

Code examples

We maintain a separate CAP examples repo for some ideas about using code to interact with CAP data.

More Repositories

1

perma

Indelible links
JavaScript
400
star
2

warc-gpt

WARC + AI - Experimental Retrieval Augmented Generation Pipeline for Web Archive Collections.
Python
220
star
3

scoop

🍨 High-fidelity, browser-based, single-page web archiving library and CLI for witnessing the web.
JavaScript
114
star
4

cap-examples

Examples for getting started using https://case.law
Jupyter Notebook
64
star
5

stackview

The jQuery virtual stack plugin
JavaScript
54
star
6

thread-keeper

(Experimental) High-fidelity capture of Twitter threads as sealed PDFs.
JavaScript
53
star
7

olaw

AI + Legal APIs: A Tool-Based Retrieval Augmented Generation Workbench for Legal AI UX Research.
JavaScript
40
star
8

h2o

H2O is a web app for creating and reading open educational resources, primarily in the legal field
JavaScript
37
star
9

warcgames

Hacking challenges to learn web archive security.
Python
34
star
10

tce_server

Time capsule encryption server.
Python
32
star
11

wacz-exhibitor

Experimental proxy and wrapper for safely embedding Web Archives (warc, warc.gz, wacz) into web pages.
JavaScript
22
star
12

WARC-diff-tools

Comparing warc files
JavaScript
14
star
13

awesome

The Awesome Box is an alternate returns box at your library. If you checked out an item and you thought it was awesome, for whatever reason, you return it to the Awesome Box instead of the regular returns box.
Python
14
star
14

js-wacz

JavaScript module and CLI tool for working with web archive data using the WACZ format specification.
JavaScript
12
star
15

stacklife

StackLife is a community-based wayfinding tool for navigating the vast resources of the combined Harvard Library System. It enables researchers, teachers, scholars, and students to find what they need and help others learn from them and their paths.
PHP
12
star
16

local-memory

JavaScript
11
star
17

shelfio

Build and share shelves of books, movies, albums, and most anything on the web
CSS
10
star
18

cloud-data

Python
8
star
19

s3mothball

Archival tool to prepare collections of small files on S3 for Glacier storage.
Python
8
star
20

analytics-dash

HTML
7
star
21

alter-space

An immersive library experience that gives visitors control over light, color, sound, and space.
HTML
7
star
22

cap-dashboard

JavaScript
6
star
23

warc-embed-netlify

Experimental proxy and wrapper for safely embedding Web Archives (warc.gz, wacz) into web pages.
JavaScript
6
star
24

nuremberg

Nuremberg website
HTML
5
star
25

wacz-signing

A library for signing and timestamping file hashes
Python
5
star
26

website-static

The Harvard Library Innovation Lab website
HTML
5
star
27

colors

Jupyter Notebook
5
star
28

screenshare

YALILScreenshare
Python
4
star
29

perma-word-plugin

Allows perma.cc users to add links from Microsoft Word.
Visual Basic
4
star
30

dostuff

Interactive API for educational demonstrations
Python
4
star
31

perma-capture

Python
3
star
32

perma-payments

Python
3
star
33

cloudflare-sg-lambda

Python
3
star
34

coroner

Detect if a link is dead or alive
Python
3
star
35

lil-nudgebot

Python
3
star
36

perma-extension

A browser extension for Perma.cc
JavaScript
3
star
37

nocap

Access tons of case law data in an easy format with no cap(s)
Jupyter Notebook
3
star
38

CaselawQuest

A Godot game interface for timeline jsons downloaded from Chronolawgicβ€” our caselaw-oriented timeline tool available on http://case.law
GDScript
3
star
39

wacz-preparator

πŸ“š CLI and Javascript library for packaging a remote web archive collection into a single WACZ file.
JavaScript
2
star
40

libguides-makeover

LibGuides is getting fitted for a new suit.
JavaScript
2
star
41

CAP_Sample_Volumes_Arkansas

Two sample casebook volumes from Arkansas
2
star
42

freedata

Python
2
star
43

portal

HTTP proxy implementation using Node.js' http.createServer to accept connections and http(s).request to relay them to their destinations.
JavaScript
2
star
44

accessibility-tools

Write-as-we-go tips for improving the accessibility of LIL projects
HTML
2
star
45

wtwba-app

iPhone app for Where the Wild Books Are
Objective-C
2
star
46

Chronolawgic

A legal timeline tool
JavaScript
2
star
47

reporter-list

A list of US legal reporters
2
star
48

cold-cases-export

Export job to reformat Courtlistener.com data for cold-cases
Python
2
star
49

myip

A tool for looking up your external IP address
Python
2
star
50

citation-helper

A tool to help you build citations around your archived URLs
JavaScript
2
star
51

llms-book-bans-benchmark

Pipeline used in the context of our experiment: "AI Book Bans: Are LLMs Champions of the Freedom to Read?"
Python
2
star
52

website

(retired) The Harvard Library Innovation Lab website
PHP
2
star
53

working-in-widener

Working in Widener is an educational game. You assume the role of a shelver working in Harvard's Widener Library. You are assigned five books to shelve and asked to find their places in the stacks using their call numbers. The faster the better.
JavaScript
2
star
54

CAPNow

Case formatter.
Python
1
star
55

docker-compose-update-action

Github Action to update a docker image from a docker-compose.yml file and push to a repository
Python
1
star
56

victorybot

ᕦ(Γ²_Γ³Λ‡)α•€
Python
1
star
57

law-apps

Attempting an improved interface to library e-resources
PHP
1
star
58

lil-blog-uploader

A protected form for uploading media for LIL blog posts
Python
1
star
59

collection-confection

Collection Confection helps you explore collections in LibraryCloud
JavaScript
1
star
60

lil-blog-generator

A protected form for generating Jekyll-ready LIL blog posts
CSS
1
star
61

cap-examples-old

Scripts running on the ftl sandbox.
Python
1
star
62

thelibrary-fm

Sounds from the library, thelibrary.fm
JavaScript
1
star
63

preview

An application for generating static image previews of a page on the We
Python
1
star
64

cap-timeline

Python
1
star
65

perma-assets

These are the images, originals and compressed, that are used in perma.cc
1
star
66

LeftToRight

Provides a suggested order for members of a channel doing a virtual standup
Python
1
star
67

dotgov

Python
1
star
68

perma-js-sdk

A JavaScript library to interact with Perma.cc's REST API
JavaScript
1
star
69

dpla-life-client

JavaScript
1
star
70

perma-redirector

Microservice to re-direct requests for validly-formed GUIDS to Perma.
Dockerfile
1
star
71

mothball_pipeline

Internal repo to coordinate s3mothball for CAP
Python
1
star
72

wanderverse

Sass
1
star
73

nuremberg-data-processing

Scripts for processing Nuremberg data
Python
1
star
74

thinkingcap

A place for CAP related experiments
Python
1
star
75

tec-sync

Sync events from The Events Calendar to Google Calendar
Python
1
star
76

h2o-static

Jekyll site for H2O Docs and Blog
HTML
1
star
77

screen-api-clients

Python
1
star
78

hovermarks

Quickly and easily annotate a bookshelf
CSS
1
star
79

map-it

Helping users locate items in the library stacks.
PHP
1
star
80

aibot

LIL-specific generative ai chatbot.
Python
1
star
81

swaddle

Service worker based whitelisting WAF configured by upstream OpenAPI spec
JavaScript
1
star
82

scoop-witness-api

A simple REST API for witnessing the web using the Scoop web archiving capture engine.
Python
1
star