• Stars
    star
    145
  • Rank 254,144 (Top 6 %)
  • Language
    Python
  • License
    Other
  • Created about 14 years ago
  • Updated 2 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Format Identification for Digital Objects (FIDO) is a Python command-line tool to identify the file formats of digital objects. It is designed for simple integration into automated work-flows.

Format Identification for Digital Objects (fido)

By Open Preservation Foundation

Build Status Code Coverage

FIDO is a command-line tool to identify the file formats of digital objects. It is designed for simple integration into automated work-flows.

FIDO uses the UK National Archives (TNA) PRONOM File Format and Container descriptions. PRONOM is available from http://www.nationalarchives.gov.uk/pronom/ See LICENSE for license information.

Usage

usage: fido [-h] [-v] [-q] [-recurse] [-zip] [-noextension] [-nocontainer]
            [-pronom_only] [-input INPUT] [-filename FILENAME]
            [-useformats INCLUDEPUIDS] [-nouseformats EXCLUDEPUIDS]
            [-matchprintf FORMATSTRING] [-nomatchprintf FORMATSTRING]
            [-bufsize BUFSIZE] [-sigs SIG_ACT]
            [-container_bufsize CONTAINER_BUFSIZE]
            [-loadformats XML1,...,XMLn] [-confdir CONFDIR]
            [FILE [FILE ...]]

positional arguments:

  • FILE: files to check. If the file is -, then read content from stdin. In this case, python must be invoked with -u or it may convert the line terminators.

optional arguments:

  • -h, --help: show this help message and exit
  • -v: show version information
  • -q: run (more) quietly
  • -recurse: recurse into subdirectories
  • -zip: recurse into zip and tar files
  • -nocontainer: disable deep scan of container documents, increases speed but may reduce accuracy with big files
  • -pronom_only: disables loading of format extensions file, only PRONOM signatures are loaded, may reduce accuracy of results
  • -input INPUT: file containing a list of files to check, one per line. - means stdin
  • -filename FILENAME: filename if file contents passed through STDIN
  • -useformats INCLUDEPUIDS: comma separated string of formats to use in identification
  • -nouseformats EXCLUDEPUIDS: comma separated string of formats not to use in identification
  • -matchprintf FORMATSTRING: format string (Python style) to use on match. See nomatchprintf, README.txt.
  • -nomatchprintf FORMATSTRING: format string (Python style) to use if no match. See README.txt
  • -bufsize BUFSIZE: size (in bytes) of the buffer to match against (default=131072 bytes)
  • -sigs SIG_ACT: SIG_ACT "check" for new version of signature file for download. SIG_ACT "list" list all available sig file versions. SIG_ACT "update" to automatically update to latest available sig file. SIG_ACT "n" download and use version n.
  • -container_bufsize CONTAINER_BUFSIZE: size (in bytes) of the buffer to match against (default=524288 bytes)
  • -loadformats XML1,...,XMLn: comma separated string of XML format files to add.
  • -confdir CONFDIR: configuration directory to load_fido_xml, for example, the format specifications from.

Installation

(also see: http://wiki.opf-labs.org/display/KB/FIDO+usage+guide)

Any platform

  1. Download the latest zip release from https://github.com/openpreserve/fido/releases
  2. Unzip into some directory
  3. Open a command shell, cd to the directory that you placed the zip contents into
  4. Run python setup.py install to install FIDO and dependencies. This may require sudo on Linux/OSX or admin privileges on Windows.
  5. You should now be able to see the help text: fido -h

Using pip

  1. Run pip install opf-fido This may require sudo on Linux/OSX or admin privileges on Windows.
  2. You should now be able to see the help text: fido -h

Updating signatures

Signatures can be updated from the OPF's signature service. The service is pull only and iit's location is in the versions.xml configuration file as

<updateSite>https://fidosigs.openpreservation.org</updateSite>

To check what version of the PRONOM signatures you are using type: fido -v and you'll see something like:

FIDO v1.6.0 (pronom-xml-95.zip, container-signature-20200121.xml, format_extensions.xml)

Here pronom-xml-95.zip denotes PRONOM version 95. To see if a more recent set of signatures is available type fido -sigs check which will report back:

Updated signatures v104 are available, current version is v95

if new signatures are available or

Your signature files are up to date, current version is v104

if not. To update signatures to the latest version type fido -sigs update:

Updated signatures v104 are available, current version is v95
Updating signatures

If you are having trouble due to firewall restrictions, see OPF wiki: http://wiki.opf-labs.org/display/PT/Command+Line+Interface+proxy+usage

Please note that this WILL NOT update the container signature file located in the 'conf' folder. The reason for this that the PRONOM container signature file contains special types of sequences which need to be tested before FIDO can use them. If there is an update available for the PRONOM container signature file it will show up in a next commit.

Dependencies

FIDO 1.0 through 1.3.3 will run on Python 2.7 with no other dependencies.

FIDO 1.3.4 and later requires the python dependency 'olefile'. This can be installed using pip install olefile, by running python setup.py install, or a pip installation will handle dependencies.

FIDO 1.3.3 and later have experimental Python 3 support.

FIDO 1.4 and later have Python 3 support.

Format Definitions

By default, FIDO loads format information from two files conf/formats.xml and conf/format_extensions.xml. Addition format files can be specified using the -loadformats command line argument. They should use the same syntax as conf/format_extensions.xml. If more than one format file needs to be specified, then they should be comma separated as with the -formats argument.

Output

Output is controlled with the two parameters matchprintf and nomatchprintf. Each is a string that may contain formating information. They have access to an object called info with the following fields:

  • printmatch: info.version (file format version X), info.alias (format also called X), info.apple_uti (Apple Uniform Type Identifier), info.group_size and info.group_index (if a file has multiple (tentative) hits), info.count (file N)

  • printnomatch: info.count (file N)

The defaults for FIDO 1.0 are:

  • printmatch:

  • "OK,%(info.time)s,%(info.puid)s,%(info.formatname)s,%(info.signaturename)s,%(info.filesize)s,\"%(info.filename)s\",\"%(info.mimetype)s\",\"%(info.matchtype)s\"\n"

  • printnomatch:

  • "KO,%(info.time)s,,,,%(info.filesize)s,\"%(info.filename)s\",,\"%(info.matchtype)s\"\n"

It can be useful to provide an empty string for either, for example to ignore all failed matches, or all successful ones (see examples below). Note that a newline needs to be added to the end of the string using \n.

Matchtypes

FIDO returns the following matchtypes:

  • fail: the object could not be identified with signature or file extension
  • extension: the object could only be identified by file extension
  • signature: the object has been identified with (a) PRONOM signature(s)
  • container: the object has been idenfified with (a) PRONOM container signature(s)

In some cases multiple results are returned.

Examples running FIDO

Identify all files in the current directory and below, sending output into file-info.csv: python fido.py -recurse . > file-info.csv

Do the same as above, but also look inside of zip or tar files: python fido.py -recurse -zip . > file-info.csv

Take input from a list of files:

Linux:

ls > files.txt
python fido.py -input files.txt

Windows:

dir /b > files.txt
python fido.py -input files.txt

Take input from a pipe:

Linux: find . -type f | python fido.py -input -

Windows: dir /b | python fido.py -input -

Only show files that could not be identified: python fido.py -matchprintf "" .

Only show files that could be identified: python fido.py -nomatchprintf "" .

Deep scan of container objects

By default, when FIDO detects that a file is a container (compound) object, it will start a deep (complete) scan of the file using the PRONOM container signatures. When identifying big files, this behaviour can cause FIDO to slow down sigificantly. You can disable deep scanning by invoking FIDO with the -nocontainer argument. While disabling deep scan speeds up identification, it may reduce accuracy.

At the moment (version 1.0) FIDO is not yet able to perform scanning containers which are passed through STDIN. A workaround would be to save the stream to a temporary file and have FIDO identify this file.

License information

See the file "LICENSE.txt" for information on the history of this software, terms & conditions for usage, and a DISCLAIMER OF ALL WARRANTIES...

More Repositories

1

format-corpus

An openly-licensed corpus of small example files, covering a wide range of formats and creation tools.
Java
183
star
2

jhove

File validation and characterisation.
Java
169
star
3

jpylyzer

JP2 (JPEG 2000 Part 1) validator and properties extractor. Jpylyzer was specifically created to check that a JP2 file really conforms to the format's specifications. Additionally jpylyzer is able to extract technical characteristics.
Python
69
star
4

pagelyzer

Suite of tools for detecting changes in web pages and their rendering
Java
53
star
5

scape-xcorrsound

Suite of tools for automated quality assurance of audio migration processes.
C
42
star
6

scape

SCAlable Preservation Environments
Java
39
star
7

ViPER

Dutch Digital Heritage Network virtual research environment set up and provisioning
Shell
16
star
8

nanite

Nanite - a friendly swarm of format-identifying robots.
Java
15
star
9

bitwiser

Bitwise analysis tools
Java
14
star
10

matchbox

Image comparison QA tool for digital preservation workflows.
C++
14
star
11

jpylyzer-test-files

Test files for conformance testing and benchmarking Jpylyzer.
Shell
13
star
12

scout

SCOUT - A preservation watch system
Java
13
star
13

plato

The Preservation Planning Tool Plato
Java
10
star
14

flint

A modular and extendible file/format validation framework
Java
9
star
15

hawarp

HAdoop-based Web Archive Record Processing
Arc
7
star
16

fuzzy-expert-system

Python
6
star
17

scape-apis

SCAPE Project API specifications
6
star
18

policies

Machine readable preservation policy ontology for SCAPE automated planning.
CSS
5
star
19

scape-toolwrapper

SCAPE project for creating debian packages from command line tools.
Java
5
star
20

ToMaR

Wraps command line tasks for parallel execution as Hadoop map reduce jobs.
Java
4
star
21

libmagic-jna-wrapper

A Java/JNA wrapper for calling libmagic.
Java
4
star
22

scape-demo-sites

Web based demonstrators of SCAPE tools.
PHP
3
star
23

odf-validator

Open source Open Document Format (ODF) validation
Java
3
star
24

Tika-identification-Wrapper

Java wrapper for executing Tika format identification across GovDocs.
Java
3
star
25

scape-component-profiles

Holds the SCAPE component profile ontology and profile XML files.
CSS
2
star
26

jpwrappa

Simple Python wrapper for the command-line tool of Aware's JPEG 2000 SDK.
Python
2
star
27

Arc-unpacker

ARC File unpacker for the Hadoop File System.
Arc
2
star
28

video-batch

Java
2
star
29

fits-blackbox-testing

A simple tool for FITS back box testing
Java
2
star
30

par-wikidp.old

PAR registry endpoint for WikiDP/Wikidata
Python
2
star
31

scape-planmanagement-webapp

SCAPE Plan Management Webapp which provides a GUI for the end user
JavaScript
2
star
32

sheets-preservation-spec

OPF Spreadsheets Preservation Specification
2
star
33

pdfPolicyValidate

PDF policy-based validation demo
XSLT
2
star
34

finger-detection-tool

Java
2
star
35

scape-fcrepo4-planmanagement

Plan Management API implementation on top of fedora 4
Java
1
star
36

scape-simulator

The SCAPE simulation environment.
Java
1
star
37

crop-detection-tool

Python
1
star
38

preflightGovdocsSelected

Results of analysis of Govdocs Selected corpus with Apache Preflight and Schematron rules
Shell
1
star
39

carrus

JavaScript
1
star
40

verapdfa

Initial VeraPDFA repository, private for initial months of the design phase.
Java
1
star
41

scape-toolspecs

A home, and version control, for the SCAPE project's tool specifications for the tool-wrapper.
Shell
1
star
42

tabular-data-normaliser

Normalises data from different sources (CSV, XLS and PDF)
Java
1
star
43

fido-update-service

FIDO signature update REST services
Python
1
star
44

par-wikidp

PAR registry that implements a subset of the PAR API/model base on data in WikiData.
Python
1
star