• Stars
    star
    246
  • Rank 164,726 (Top 4 %)
  • Language
    Python
  • License
    MIT License
  • Created over 7 years ago
  • Updated 2 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

File identification library for Python

build status pre-commit.ci status

identify

File identification library for Python.

Given a file (or some information about a file), return a set of standardized tags identifying what the file is.

Installation

pip install identify

Usage

With a file on disk

If you have an actual file on disk, you can get the most information possible (a superset of all other methods):

>>> from identify import identify
>>> identify.tags_from_path('/path/to/file.py')
{'file', 'text', 'python', 'non-executable'}
>>> identify.tags_from_path('/path/to/file-with-shebang')
{'file', 'text', 'shell', 'bash', 'executable'}
>>> identify.tags_from_path('/bin/bash')
{'file', 'binary', 'executable'}
>>> identify.tags_from_path('/path/to/directory')
{'directory'}
>>> identify.tags_from_path('/path/to/symlink')
{'symlink'}

When using a file on disk, the checks performed are:

  • File type (file, symlink, directory, socket)
  • Mode (is it executable?)
  • File name (mostly based on extension)
  • If executable, the shebang is read and the interpreter interpreted

If you only have the filename

>>> identify.tags_from_filename('file.py')
{'text', 'python'}

If you only have the interpreter

>>> identify.tags_from_interpreter('python3.5')
{'python', 'python3'}
>>> identify.tags_from_interpreter('bash')
{'shell', 'bash'}
>>> identify.tags_from_interpreter('some-unrecognized-thing')
set()

As a cli

$ identify-cli --help
usage: identify-cli [-h] [--filename-only] path

positional arguments:
  path

optional arguments:
  -h, --help       show this help message and exit
  --filename-only
$ identify-cli setup.py; echo $?
["file", "non-executable", "python", "text"]
0
$ identify-cli setup.py --filename-only; echo $?
["python", "text"]
0
$ identify-cli wat.wat; echo $?
wat.wat does not exist.
1
$ identify-cli wat.wat --filename-only; echo $?
1

Identifying LICENSE files

identify also has an api for determining what type of license is contained in a file. This routine is roughly based on the approaches used by licensee (the ruby gem that github uses to figure out the license for a repo).

The approach that identify uses is as follows:

  1. Strip the copyright line
  2. Normalize all whitespace
  3. Return any exact matches
  4. Return the closest by edit distance (where edit distance < 5%)

To use the api, install via pip install identify[license]

>>> from identify import identify
>>> identify.license_id('LICENSE')
'MIT'

The return value of the license_id function is an SPDX id. Currently licenses are sourced from choosealicense.com.

How it works

A call to tags_from_path does this:

  1. What is the type: file, symlink, directory? If it's not file, stop here.
  2. Is it executable? Add the appropriate tag.
  3. Do we recognize the file extension? If so, add the appropriate tags, stop here. These tags would include binary/text.
  4. Peek at the first X bytes of the file. Use these to determine whether it is binary or text, add the appropriate tag.
  5. If identified as text above, try to read and interpret the shebang, and add appropriate tags.

By design, this means we don't need to partially read files where we recognize the file extension.

More Repositories

1

pre-commit

A framework for managing and maintaining multi-language pre-commit hooks.
Python
12,636
star
2

pre-commit-hooks

Some out-of-the-box hooks for pre-commit
Python
5,173
star
3

action

a GitHub action to run `pre-commit`
436
star
4

mirrors-mypy

Mirror of mypy for pre-commit
Python
264
star
5

pygrep-hooks

A collection of fast, cheap, regex based pre-commit hooks.
Python
209
star
6

pre-commit.com

Python
190
star
7

mirrors-prettier

mirror of the `prettier` npm package for pre-commit
126
star
8

demo-repo

Ruby
77
star
9

mirrors-eslint

Mirror of eslint node package for pre-commit.
64
star
10

pre-commit-mirror-maker

Scripts for creating mirror repositories that do not have .pre-commit-hooks.yaml
Python
39
star
11

mirrors-clang-format

mirror of https://github.com/ssciwr/clang-format-wheel for pre-commit
Python
38
star
12

mirrors-isort

Mirror of the isort package for pre-commit.
Python
37
star
13

mirrors-autopep8

Mirror of the autopep8 package for pre-commit
Python
31
star
14

mirrors-yapf

Mirror of the yapf package for pre-commit
Python
30
star
15

mirrors-pylint

Mirror of pylint package for pre-commit.
Python
14
star
16

mirrors-puppet-lint

Mirror of puppet-lint gem for pre-commit.
Ruby
7
star
17

pre-commit-docker-flake8

Proof of concept using docker pre-commit hooks
7
star
18

sync-pre-commit-deps

Sync pre-commit hook dependencies based on other installed hooks
Python
7
star
19

mirrors-scss-lint

Mirror of scss-lint gem for pre-commit.
Ruby
4
star
20

pre-commit-installed

runs `pre-commit install` on installation (terrible hack)
Python
4
star
21

mirrors-csslint

Mirror of csslint package for pre-commit.
3
star
22

mirrors-ruby-lint

Mirror of ruby-lint gem for pre-commit.
Ruby
3
star
23

mirrors-jshint

Mirror of jshint package for pre-commit.
3
star
24

mirrors-fixmyjs

Mirror of fixmyjs package for pre-commit.
3
star
25

mirrors-coffeelint

Mirror of coffeelint node package for pre-commit.
2
star
26

demo-repo-universe

Python
2
star
27

.github

centralized github metadata for pre-commit
1
star
28

cron-mirror-creation

A travis-ci cron for updating pre-commit mirrors
Python
1
star