• Stars
    star
    125
  • Rank 286,335 (Top 6 %)
  • Language
    Python
  • License
    GNU General Publi...
  • Created about 8 years ago
  • Updated 5 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A simple text reuse detection CLI tool.

text-matcher

PyPI version DOI

A simple text reuse detection CLI tool. Given a pair of texts or directories of texts, it will find similar text between them. This is good for detection of text reuses such as citation, quotation, intertextuality, and plagiarism.

The pilot experiment that uses this tool is allusion-detection. A new project that uses this tool is middlemarch-critical-histories.

Demo

Does Milton quote from the Bible in his Areopagitica? Let’s find out.

$ text-matcher kjv.txt areopagitica.txt 

1 total matches found.

match 1:
kjv.txt: (4135539, 4135561) Spirit. 5:20 Despise not prophesyings Prove all things; hold fast that which is good. 5:22 Abstain
areopagitica.txt: (25861, 25883) answerable to that of the Apostle to the Thessalonians PROVE ALL THINGS, HOLD FAST THAT WHICH IS GOOD. And he might

Usage

Just run text-matcher and provide the names of the text files you want to compare. You can also provide a directory of files instead of a single file, so if you want to compare textA.txt with every text file in textdir/, run text-matcher textA.txt textdir/.

You can also tweak the matching by providing the ngrams value to match against, and the threshold. From the help:

$ text-matcher --help
Usage: text-matcher [OPTIONS] TEXT1 TEXT2

  This program finds similar text in two text files.

Options:
  -t, --threshold INTEGER    The shortest length of match to include in the
                             list of initial matches.
  -c, --cutoff INTEGER       The shortest length of match to include in the
                             final list of extended matches.
  -n, --ngrams INTEGER       The ngram n-value to match against.
  -m, --mindistance INTEGER  The minimum value for distance between two
                             match.
  -l, --logfile TEXT         The name of the log file to write to.
  --stops                    Include stopwords in matching.
  --verbose                  Enable verbose mode, giving more information.
  --help                     Show this message and exit.

Installation

You can install text-matcher using pip:

pip3 install --user text-matcher

Or globally, with sudo:

sudo pip3 install text-matcher

Alternatively, clone this repo and install locally, using pip:

git clone https://github.com/JonathanReeve/text-matcher
cd text-matcher
pip install .

Or with Pipenv:

git clone https://github.com/JonathanReeve/text-matcher
cd text-matcher
pipenv install .
pipenv run text-matcher

Citation

If you use text-matcher in your research, you can cite it like this, for now:

@misc{Reeve2020,
  author = {Reeve, Jonathan},
  title = {Text-Matcher},
  year = {2020},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/JonathanReeve/text-matcher}},
  commit = {988d9422a63165225ea136fc31427b1e57814505},
  doi = {10.5281/zenodo.3937738}
}

More Repositories

1

chapterize

A simple tool for splitting up an ebook into its chapters. Works well with Project Gutenberg texts. May also be used to clean up books for computational text analysis.
Python
92
star
2

course-computational-literary-analysis

Course materials for Introduction to Computational Literary Analysis, taught at UC Berkeley in Summer 2018, 2019, and 2020, at Columbia University in Fall 2020, and again at UC Berkeley in Summer 2021 and 2022.
Jupyter Notebook
87
star
3

workshop-text-analysis-spacy

Materials for the workshop Advanced Text Analysis with SpaCy and Scikit-Learn, given at NYU during NYCDH Week 2017, at PyData NYC in Nov. 2017, and at Columbia University in 2018 and 2019.
Jupyter Notebook
82
star
4

corpus-db

A textual corpus database for the digital humanities.
Jupyter Notebook
59
star
5

dotfiles

My personal dotfiles, using Nix Flakes to configure my system(s).
Nix
36
star
6

macro-etym

A tool for analyzing the word histories of a text.
Python
31
star
7

gitenberg-experiments

Scripts for scraping metadata from Project Gutenberg books, via GITenberg.
Jupyter Notebook
19
star
8

corpus-list

A structured list of text corpora, created for use with a corpus downloader.
13
star
9

late-style-PCA

An attempt to experimentally test Edward Said's claims about late style using computational text analysis and principal component analysis.
Jupyter Notebook
10
star
10

allusion-detection

Computational intertextuality detection in Python. Fuzzy string matching, approximate string matching.
Jupyter Notebook
9
star
11

cenlab

A corpus of English-language novels combining the ~250 novels of the Corpus of English Novels with the Txtlab corpus of English novels.
Jupyter Notebook
9
star
12

md2mla

A script and accompanying templates to make an MLA-style paper from a markdown file. Requires Pandoc and LaTeX (xetex)..
TeX
8
star
13

milton-analysis

Text analysis of Paradise Lost and other poems by John Milton.
Jupyter Notebook
7
star
14

workshop-word-embeddings

Materials for a workshop in word embeddings, for NYC-DH Week, February 2019
Jupyter Notebook
7
star
15

workshop-dataviz-2017

An Introduction to Text Analysis and Visualization, Art of Data Visualization Week, April 2017, Columbia University
Jupyter Notebook
7
star
16

dissertation

A dissertation in computational literary analysis, called "The Eye of Modernism: Visual Imaginations of British literature, 1880-1930"
Jupyter Notebook
7
star
17

template-research-paper

A template for a research paper, which compiles to many file formats.
TeX
6
star
18

template-dissertation

A template for a modern, best-practices dissertation.
Haskell
5
star
19

jonreeve.com

My personal website, jonreeve.com, written in Haskell, using Ema.
TeX
5
star
20

course-cic-compling

Course materials for the course Computing in Context section in Computational Linguistics. Dept. of Computer Science, Columbia University, Fall 2021. Work-in-progress.
Jupyter Notebook
5
star
21

course-computational-literary-analysis-readings

Syllabus and course readings for Introduction to Computational Literary Analysis, a course taught at UC-Berkeley in Summer 2018, 2019, and 2020, and at Columbia University in Fall 2020.
Haskell
5
star
22

shakespeare-dialog-extractor

An application to extract dialog from Shakespeare plays, as encoded into TEI by the Folger Library.
Python
4
star
23

book-computational-literary-analysis

A textbook for the course, Introduction to Computational Literary Analysis. WIP
Jupyter Notebook
4
star
24

docmap

A project for creating new themes and customization functionality for the Omeka content management system.
PHP
4
star
25

conference-joyce-digital

Website and materials for the conference Joyce in the Digital Age, held at Columbia University on October 1st, 2017.
4
star
26

free-indirect-discourse-model

Modeling free indirect discourse in literature, using AI.
Jupyter Notebook
3
star
27

plato-analysis

Analyses of Platonic dialogues, including a Socratic dialogue generator.
Jupyter Notebook
3
star
28

course-word-embeddings

Course materials for "Meaningful Text Analysis with Word Embeddings," taught at the Digital Humanities Summer Institute, June 2021.
TeX
3
star
29

text-to-time-series

Experiments in text analysis, generating time series from texts.
Jupyter Notebook
2
star
30

occupations-experiment

Experiments in quantifying occupations as they're represented in fiction.
Jupyter Notebook
2
star
31

template-course-website

A website for a university course. Semantic by default.
TeX
2
star
32

sops

Research materials (literature review, bibliography) for the project A Safer Online Public Square
HTML
2
star
33

dissertation-prospectus

My ever-protean dissertation prospectus.
TeX
2
star
34

character-attribution

Probabilistic attribution of character voices in fiction.
Jupyter Notebook
2
star
35

htrc-experiments

Text analysis experiments with Hathi Trust Research Center literary datasets.
Jupyter Notebook
2
star
36

corpus-SHC

A fork of Martin Mueller's Shakespeare His Contemporaries corpus, originally located at https://github.com/martinmueller39/SHC, divided into submodules as an experiment.
1
star
37

html2tei

A tool to extract structured data from novels (starting with Project Gutenberg HTML files)
HTML
1
star
38

sent2tree

Alternative visualizations for SpaCy-parsed sentences, using ETE3.
Python
1
star
39

sentence-trees

Experiments with sentences as trees.
Jupyter Notebook
1
star
40

pg-srp

Stable Random Projections (SRP) of Project Gutenberg texts, for similarity tests
Jupyter Notebook
1
star
41

course-data-ethics

Draft syllabus for a course in data science ethics. WIP.
Jupyter Notebook
1
star
42

course-nyu-pit

Course materials for the New York University Institute in Public Interest Technology (NYU-PIT)
Jupyter Notebook
1
star
43

org-autolinks-mode

An emacs minor mode for automatically linking to org files, after typing the name of the file.
Emacs Lisp
1
star
44

hs-tei-transform

Experiments in transforming TEI XML, using Haskell
Haskell
1
star
45

workshop-intro-haskell

An introduction to functional programming in Haskell. A workshop given in October 2020 at Columbia University.
1
star
46

david-copperfield

An annotated edition of David Copperfield
HTML
1
star
47

dataviz-workshop

Materials for a workshop in text analysis and visualization, originally given at Columbia University in April 2016.
Jupyter Notebook
1
star
48

data-ethics-literature-review

An automated survey of literature and curricula surrounding ethics in data science. WIP.
HTML
1
star
49

chaucer-macro-etym

Macro-etymological analyses of the Canterbury Tales.
Jupyter Notebook
1
star
50

corpus-mansfield-garden-party-TEI

A TEI edition of Katherine Mansfield's short story "The Garden Party."
Jupyter Notebook
1
star
51

persistent-homology

Experiments with NLP and persistent homology.
Jupyter Notebook
1
star
52

course-university-writing

Draft materials for the course "University Writing with Readings in the Data Sciences," taught at Columbia University in the fall of 2017. Students, please refer to CourseWorks instead of this repository.
HTML
1
star
53

course-multilingual-technologies

Course website for Multilingual Technologies and Language Diversity, taught at Columbia University by Prof. Smaranda Muresan and Dr. Isabelle Zaugg
Haskell
1
star