• This repository has been archived on 04/Sep/2024
  • Stars
    star
    115
  • Rank 305,916 (Top 7 %)
  • Language
    Python
  • Created almost 7 years ago
  • Updated 4 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Detect and visualize text reuse

Intertext

Detect and visualize text reuse within collections of plain text or XML documents.

Intertext uses machine learning and interactive visualizations to identify and display intertextual patterns in text collections. The text processing is based on minhashing vectorized strings and the web viewer is based on interactive React components. [Demo]

App preview

Installation

To install Intertext, run the steps below:

# optional: install Anaconda and set up conda virtual environment
conda create --name intertext python=3.7
conda activate intertext

# install the package
pip uninstall intertext -y
pip install https://github.com/yaledhlab/intertext/archive/master.zip

Usage

# search for intertextuality in some documents
intertext --infiles "sample_data/texts/*.txt"

# serve output
python -m http.server 8000

Then open a web browser to http://localhost:8000/output and you'll see any intertextualities the engine discovered!

CUDA Acceleration

To enable Cuda acceleration, we recommend using the following steps when installing the module:

# set up conda virtual environment
conda create --name intertext python=3.7
conda activate intertext

# set up cuda and cupy
conda install cudatoolkit
conda install -c conda-forge cupy

# install the package
pip uninstall intertext -y
pip install https://github.com/yaledhlab/intertext/archive/master.zip

Providing Metadata

To indicate the author and title of matching texts, one should pass the flag to a metadata file to the intertext command, e.g.

intertext --infiles "sample_data/texts/*.txt" --metadata "sample_data/metadata.json"

Metadata files should be JSON files with the following format:

{
  "a.xml": {
    "author": "Author A",
    "title": "Title A",
    "year": 1751,
    "url": "https://google.com?text=a.xml"
  },
  "b.xml": {
    "author": "Author B",
    "title": "Title B",
    "year": 1753,
    "url": "https://google.com?text=b.xml"
  }
}

Deeplinking

If your text documents can be read on another website, you can add a url attribute to each of your files within your metadata JSON file (see example above).

If your documents are XML files and you would like to deeplink to specific pages within a reading environment, you can use the --xml_page_tag flag to designate the tag within which page breaks are identified. Additionally, you should include $PAGE_ID in the url attribute for the given file within your metadata file, e.g.

{
  "a.xml": {
    "author": "Author A",
    "title": "Title A",
    "year": 1751,
    "url": "https://google.com?text=a.xml&page=$PAGE_ID"
  },
  "b.xml": {
    "author": "Author B",
    "title": "Title B",
    "year": 1753,
    "url": "https://google.com?text=b.xml&page=$PAGE_ID"
  }
}

If your page ids are specified within an attribute in the --xml_page_tag tag, you can specify the relevant attribute using the --xml_page_attr flag.

More Repositories

1

pix-plot

A WebGL viewer for UMAP or TSNE-clustered images
JavaScript
528
star
2

lab-workshops

Materials for workshops on text mining, machine learning, and data visualization
Jupyter Notebook
144
star
3

flask-react-boilerplate

Simple boilerplate for a Flask backend and React client
JavaScript
98
star
4

neural-neighbors

A simple web application for browsing similar images
JavaScript
32
star
5

wordmap

Visualize large text collections with WebGL
JavaScript
24
star
6

lexis-nexis-wsk

Convenience wrappers for the Lexis Nexis WSK API
Python
18
star
7

pointgrid

Transform a 2D point distribution to a hex grid to avoid overplotting in data visualizations
Python
17
star
8

iiif-downloader

A simple utility for downloading images from IIIF servers
Python
13
star
9

humanities-data-mining

Materials for YData Course "Humanities Data Mining"
8
star
10

voynich

Analyzing the Voynich Manuscript with computer vision
Jupyter Notebook
7
star
11

shears

Extract image content from historical book scans
Python
7
star
12

vertices

Extract fixed-size vertex data from .obj and .jpg files
Python
6
star
13

ensemble-at-yale

Crowdsourcing the transcription of Yale playbills - http://bit.ly/ensemble-at-yale
CoffeeScript
6
star
14

dancing-with-data

Generating 3D Dance Sequences with Neural Networks
Jupyter Notebook
6
star
15

realtime-image-layout

Visualize images from a IIIF manifest using client-side mobilenet image vectors
JavaScript
6
star
16

image-segmentation

Utilities for image segmentation tasks
Python
5
star
17

vggface

@rcmalli's keras-vggface library updated to Tensorflow 2
Python
5
star
18

lexis-bulk-api

A simple Python wrapper around the Lexis Nexis Bulk API
Python
5
star
19

realtime-layout

Boilerplate for creating TSNE and UMAP layouts with JavaScript in realtime
JavaScript
5
star
20

ani-yun-wiya

αŽ α‚α΄α«α― Theme for Omeka 2.x. Based on Interactive Mechanics’ Omega Starter Theme.
PHP
5
star
21

dhlab-site

dhlab.yale.edu
HTML
4
star
22

nhba

A digital archive of New Haven's architecture
JavaScript
4
star
23

omeka-plugin-Casify

Protect restricted Omeka routes with CAS
PHP
4
star
24

mtcnn

@ipazc's mtcnn library updated to Tensorflow 2
Python
3
star
25

stylegan2-helpers

Automates a bunch of steps to go from a folder of images to a trained network using stylegan2
Python
3
star
26

intertext-client

The client application for https://github.com/YaleDHLab/Intertext
JavaScript
3
star
27

image_datasets

Image datasets for computer vision projects in Python
Python
3
star
28

jekyll-working-group

Because everyone loves a good static file site
3
star
29

voices

Semi-private archives of user-submitted materials
Ruby
3
star
30

facenet

A packaged version of David Sandberg's Facenet implementation
Python
2
star
31

chirila

A database of Australian languages
HTML
2
star
32

minimal-jekyll-starter

Minimal boilerplate for building a Jekyll site with a custom theme
CSS
2
star
33

variant-viewer

Display line-level variants across print/manuscript editions
HTML
2
star
34

dh-rees

Digital Humanities and Russian & East European Studies at Yale
PHP
2
star
35

omeka-plugin-PaginateCollections

Omeka plugin that adds pagination to collections pages within the αŽ α‚α΄α«α― Theme.
PHP
2
star
36

gathering-a-building

Tracing Yale University's campus architecture
JavaScript
2
star
37

three-boilerplate

Minimal boilerplate for three.js apps
JavaScript
1
star
38

bookworm-docker

Docker image for running BookwormDB
Dockerfile
1
star
39

TensorFlow-CUDA

Setting up NVIDIA CUDA-enabled TensorFlow on Ubuntu x64
1
star
40

passages-to-freedom

Mapping the journeys slaves took to freedom
HTML
1
star
41

2018-03-12-YUL

Python
1
star
42

scroll_viewer

Support for zoomified large images + cursor positioning
CSS
1
star
43

ensembleatyale-tools

Backend tools for processing theatrical data at Yale.
Shell
1
star
44

bookworm-pq

Tools for processing vendor data for use in Bookworm
Python
1
star
45

minimal-bookworm

Visualize term distributions over time
Python
1
star
46

daily-mongo-backups

Email yourself mongo db backups every day
Python
1
star
47

development-timelines

Chart GitHub Development Timelines
Python
1
star
48

vectorized-minhash

A fork of @bradhackinen's vminhash that's installable as a module
Python
1
star
49

dante-at-hand

Reading women reading Dante
PHP
1
star
50

trails

Visualizing massive datasets with WebGL
JavaScript
1
star