• Stars
    star
    210
  • Rank 186,533 (Top 4 %)
  • Language
    Python
  • Created over 11 years ago
  • Updated 6 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Work with BagIt packages from Python.

bagit-python

|Build Status| Coverage Status

bagit is a Python library and command line utility for working with BagIt style packages.

Installation

bagit.py is a single-file python module that you can drop into your project as needed or you can install globally with:

pip install bagit

Python v2.7+ is required.

Command Line Usage

When you install bagit you should get a command-line program called bagit.py which you can use to turn an existing directory into a bag:

bagit.py --contact-name 'John Kunze' /directory/to/bag

Finding Bagit on your system

The bagit.py program should be available in your normal command-line window (Terminal on OS X, Command Prompt or Powershell on Windows, etc.). If you are unsure where it was installed you can also request that Python search for bagit as a Python module: simply replace bagit.py with python -m bagit:

python -m bagit --help

On some systems Python may have been installed as python3, py, etc. – simply use the same name you use to start an interactive Python shell:

py -m bagit --help
python3 -m bagit --help

Configuring BagIt

You can pass in key/value metadata for the bag using options like --contact-name above, which get persisted to the bag-info.txt. For a complete list of bag-info.txt properties you can use as commmand line arguments see --help.

Since calculating checksums can take a while when creating a bag, you may want to calculate them in parallel if you are on a multicore machine. You can do that with the --processes option:

bagit.py --processes 4 /directory/to/bag

To specify which checksum algorithm(s) to use when generating the manifest, use the --md5, --sha1, --sha256 and/or --sha512 flags (MD5 is generated by default).

bagit.py --sha1 /path/to/bag
bagit.py --sha256 /path/to/bag
bagit.py --sha512 /path/to/bag

If you would like to validate a bag you can use the --validate flag.

bagit.py --validate /path/to/bag

If you would like to take a quick look at the bag to see if it seems valid by just examining the structure of the bag, and comparing its payload-oxum (byte count and number of files) then use the --fast flag.

bagit.py --validate --fast /path/to/bag

And finally, if you'd like to parallelize validation to take advantage of multiple CPUs you can:

bagit.py --validate --processes 4 /path/to/bag

Using BagIt in your programs

You can also use BagIt programatically in your own Python programs by importing the bagit module.

Create

To create a bag you would do this:

bag = bagit.make_bag('mydir', {'Contact-Name': 'John Kunze'})

make_bag returns a Bag instance. If you have a bag already on disk and would like to create a Bag instance for it, simply call the constructor directly:

bag = bagit.Bag('/path/to/bag')

Update Bag Metadata

You can change the metadata persisted to the bag-info.txt by using the info property on a Bag.

# load the bag
bag = bagit.Bag('/path/to/bag')

# update bag info metadata
bag.info['Internal-Sender-Description'] = 'Updated on 2014-06-28.'
bag.info['Authors'] = ['John Kunze', 'Andy Boyko']
bag.save()

Update Bag Manifests

By default save will not update manifests. This guards against a situation where a call to save to persist bag metadata accidentally regenerates manifests for an invalid bag. If you have modified the payload of a bag by adding, modifying or deleting files in the data directory, and wish to regenerate the manifests set the manifests parameter to True when calling save.

import shutil, os

# add a file
shutil.copyfile('newfile', '/path/to/bag/data/newfile')

# remove a file
os.remove('/path/to/bag/data/file')

# persist changes
bag.save(manifests=True)

The save method takes an optional processes parameter which will determine how many processes are used to regenerate the checksums. This can be handy on multicore machines.

Validation

If you would like to see if a bag is valid, use its is_valid method:

bag = bagit.Bag('/path/to/bag')
if bag.is_valid():
    print("yay :)")
else:
    print("boo :(")

If you'd like to get a detailed list of validation errors, execute the validate method and catch the BagValidationError exception. If the bag's manifest was invalid (and it wasn't caught by the payload oxum) the exception's details property will contain a list of ManifestErrors that you can introspect on. Each ManifestError, will be of type ChecksumMismatch, FileMissing, UnexpectedFile.

So for example if you want to print out checksums that failed to validate you can do this:

bag = bagit.Bag("/path/to/bag")

try:
  bag.validate()

except bagit.BagValidationError as e:
    for d in e.details:
        if isinstance(d, bagit.ChecksumMismatch):
            print("expected %s to have %s checksum of %s but found %s" %
                  (d.path, d.algorithm, d.expected, d.found))

To iterate through a bag's manifest and retrieve checksums for the payload files use the bag's entries dictionary:

bag = bagit.Bag("/path/to/bag")

for path, fixity in bag.entries.items():
  print("path:%s md5:%s" % (path, fixity["md5"]))

Contributing to bagit-python development

% git clone git://github.com/LibraryOfCongress/bagit-python.git
% cd bagit-python
# MAKE CHANGES
% python test.py

Running the tests

You can quickly run the tests by having setuptools install dependencies:

python setup.py test

Once your code is working, you can use Tox to run the tests with every supported version of Python which you have installed on the local system:

tox

If you have Docker installed, you can run the tests under Linux inside a container:

% docker build -t bagit:latest . && docker run -it bagit:latest

Benchmarks

If you'd like to see how increasing parallelization of bag creation on your system effects the time to create a bag try using the included bench utility:

% ./bench.py

License

cc0

Note: By contributing to this project, you agree to license your work under the same terms as those that govern this project's distribution.

More Repositories

1

api.congress.gov

congress.gov API
Java
624
star
2

newspaper-navigator

Jupyter Notebook
225
star
3

data-exploration

Tutorials for working with Library of Congress collections data
Jupyter Notebook
179
star
4

concordia

Crowdsourcing platform for full text transcription and tagging. https://crowd.loc.gov
Python
154
star
5

bagger

The Bagger application packages data files according to the BagIt specification.
Java
120
star
6

bagit-java

Java library to support the BagIt specification.
Java
71
star
7

citizen-dj

JavaScript
70
star
8

chronam

This software project is no longer being actively developed at the Library of Congress. Consider using the Open-ONI (https://github.com/open-oni) fork of the chronam software. Project mailing list: http://listserv.loc.gov/archives/chronam-users.html.
Python
70
star
9

viewshare

A web application developed by Zepheira for the Library of Congress National Digital Information Infrastructure and Preservation Program (NDIIPP) which allows users to create and share embeddable interfaces to digital cultural heritage collections. A project of the Library of Congress; the project was retired in March 2018. Note: project members may work on both official Library of Congress projects and non-LC projects.
JavaScript
45
star
10

bagger-js

Upload BagIt-format deliveries to S3 entirely in the browser
JavaScript
32
star
11

coding-standards

Library of Congress coding standards
Python
27
star
12

labs-ai-framework

Planning Framework used by LC Labs for planning AI experiments towards responsible implementation
CSS
24
star
13

gazetteer

A historical gazetteer project of the Library of Congress. Note: project members may work on both official Library of Congress projects and non-LC projects.
Python
23
star
14

wdl-viewer

A fast, responsive HTML5 viewer for scanned items, developed for the World Digital Library. A project of the Library of Congress. Note: project members may work on both official Library of Congress projects and non-LC projects.
JavaScript
22
star
15

speech-to-text-viewer

AWS Transcribe evaluation pipeline: bulk-process audio files and view the results
Python
17
star
16

django-tabular-export

Utilities used to export data into spreadsheets from Django applications. Currently used internally at the Library of Congress in the WDL cataloging application.
Python
15
star
17

Exploring-ML-with-Project-Aida

Jupyter Notebook
13
star
18

bagit-conformance-suite

Test cases for validating BagIt implementations
Python
10
star
19

premis-v3-0

PREMIS schemas are written in XML. They are open source community tools that allow PREMIS users to validate PREMIS records against a version of the PREMIS schema.
10
star
20

mods2bibframe

mods2bibframe XSLT
XSLT
8
star
21

MarcMods3.6xsl

MARC>MODS--the mappings and corresponding XSLTs are open source community tools developed by NDMSO at LC.
XSLT
7
star
22

hitl

Code and documentation for Humans in the Loop (HITL), an LC Labs sponsored collaboration with metadata solutions provider AVP. The experiment explores a framework and considerations for integrating crowdsourcing and machine learning in ways that are ethical, engaging, and useful.
JavaScript
7
star
23

embARC

embARC (“metadata embedded for archival content”) manages internal file metadata including embedding and validation. Created by FADGI (Federal Agencies Digital Guidelines Initiative), in conjunction with AVP and PortalMedia, embARC enables users to audit and correct embedded metadata of a subset of MXF files, as well as both individual DPX files or an entire DPX sequence, while not impacting the image data.
HTML
7
star
24

speculative-annotation

Speculative Annotation is a web browser application written in Javascript and built with React, FabricJS, IIIF, OpenSeaDragon, and ChakraUI. Source images are hosted locally. The application uses the OpenSeadragon Viewer to render images, so your source images can be a combination of locally hosted images (within the application), or externally hosted images (for example, served from a IIIF image server).Application metadata is represented by a combination of local IIIF Presentation API 3.0 manifest files, and Library of Congress hosted IIIF manifest files. The application allows users to annotate select free to use items from the Library of Congress, save to browser or download locally.
JavaScript
7
star
25

pimtoolbox

The Library of Congress and the Florida Center for Library Automation developed the PREMIS in METS (PiM) Toolbox. The project provides PREMIS:METS conversion and validation tools that support the implementation of PREMIS in the METS container format.
Ruby
6
star
26

inside-baseball

Explore baseball collections from the Library of Congress and the National Museum of African American History and Culture
Python
6
star
27

iptables-gem

A project of the Library of Congress. Note: project members may work on both official Library of Congress projects and non-LC projects.
Ruby
5
star
28

sanborn-navigator

Jupyter Notebook
5
star
29

ADCTest

ADCTest is a desktop application, written in C++, that provides provides simple pass-fail reporting for the tests detailed in the FADGI Low Cost ADC Performance Testing Guidelines as well as more detailed results
C++
5
star
30

MarcMods3.5xsl

MARC>MODS 3.5--the mapping and corresponding XSLT are open source community tools developed by NDMSO at LC.
XSLT
4
star
31

pairtree

A project of the Library of Congress. Note: project members may work on both official Library of Congress projects and non-LC projects.
CSS
4
star
32

simple-artifact-uploader

A plugin for the Gradle build management tool that allows us to automatically upload completed binaries to the Artifactory deployment server.
Java
3
star
33

a-search-for-the-heart

HTML
3
star
34

seeing-lost-enclaves

Seeing Lost Enclaves is an initiative by Jeffrey Yoo Warren as part of the 2023 Innovator in Residence Program at the Library of Congress.
HTML
2
star
35

DVV

The Digital Viewer and Validator (DVV) tool is developed at the Library of Congress for use by National Digital Newspaper Program (NDNP) participants.
1
star
36

LC_Labs

1
star
37

viewshare_site

Site specific project retired Library of Congress instance of the Viewshare project
Python
1
star
38

marc2mads20

MARC>MADS--the mappings and corresponding XSLTs are open source community tools developed by NDMSO at LC.
1
star
39

CCHC

Computing Cultural Heritage in the Cloud (CCHC) is our Andrew W. Mellon-funded experiment for piloting cloud solutions to enable research including data analysis and reduction on large-scale digital collections. Three non-LC staff contracted researchers will analyze large collection datasets that are stored in and accessible from AWS, likely as JSON. The contracted research experts' code will demonstrate how the datasets are gathered, transformed, and manipulated to demonstrate the needs of computational analysis. Languages used in this code may include Python and JavaScript. Code will undergo security review as it is submitted as deliverables during the contract window, with final versions to be made available in GitHub repository by the end of Q2 FY 2022.
1
star
40

btp-data

This Python tutorial demonstrates how to process and visualize the Library of Congress' By the People transcription data using natural language processing.
Jupyter Notebook
1
star