• Stars
    star
    533
  • Rank 83,238 (Top 2 %)
  • Language
    Jupyter Notebook
  • License
    MIT License
  • Created over 1 year ago
  • Updated 4 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

String-to-String Algorithms for Natural Language Processing

string2string

PyPI - Python Version PyPI version Downloads license arXiv

Table of Contents

Getting Started | Tutorials | Example Usage | Documentation | Tests | Citation | Thanks | Contributing

Abstract

The string2string library is an open-source tool that offers a comprehensive suite of efficient algorithms for a broad range of string-to-string problems. It includes both traditional algorithmic solutions and recent advanced neural approaches to address various problems in pairwise string alignment, distance measurement, lexical and semantic search, and similarity analysis. Additionally, the library provides several helpful visualization tools and metrics to facilitate the interpretation and analysis of these methods.

The library features notable algorithms such as the Smith-Waterman algorithm for pairwise local alignment, the Hirschberg algorithm for global alignment, the Wagner-Fisher algorithm for edit distance, BARTScore and BERTScore for similarity analysis, the Knuth-Morris-Pratt algorithm for lexical search, and Faiss for semantic search. Moreover, it wraps existing highly efficient and widely-used implementations of certain frameworks and metrics, such as sacreBLEU and ROUGE, whenever it is appropriate and suitable.

In general, the string2string library seeks to provide extensive coverage and increased flexibility compared to existing libraries for strings. It can be used for many downstream applications, tasks, and problems in natural-language processing, bioinformatics, and computational social sciences. With its comprehensive suite of algorithms, visualization tools, and metrics, the string2string library is a valuable resource for researchers and practitioners in various fields.

Getting Started

Install the string2string library by running the following command in your terminal:

pip install string2string

Once the installation is complete, you can import the library and start using its functionalities.

Remark: We recommend using Python 3.7+ for the library.

Tutorials

Example Usage

Alignment

In the following example, we illustrate how to align two sequences of strings globally by using the Needleman-Wunsch algorithm. This algorithm, along with other alignment techniques, can be found in the alignment module of the library. The code snippet below demonstrates how to apply the Needleman-Wunsch algorithm to perform a global alignment of two given strings.

This example provides a practical illustration of how to use the Needleman-Wunsch algorithm to solve the problem of sequence alignment, which is a fundamental problem in bioinformatics.

>>> # Import the NeedlemanWunsch class from the alignment module
>>> from string2string.alignment import NeedlemanWunsch

>>> # Create an instance of the NeedlemanWunsch class
>>> nw = NeedlemanWunsch()

>>> # Let's create a list of strings (resembling DNA sequences), but they can be any strings (e.g., words), of course.
>>> seq1 = ['X', 'ATT', 'GC', 'GC', 'A', 'A', 'G']
>>> seq2 = ['ATT', 'G', 'GC', 'GC', 'A', 'C', 'G']

>>> # Compute the alignment between two strings
>>> aligned_seq1, aligned_seq2 = nw.get_alignment(seq1, seq2)

>>> # Print the alignment between the sequences, as computed by the Needleman-Wunsch algorithm.
>>> nw.print_alignment(aligned_seq1, aligned_seq2)
X | ATT | - | GC | GC | A | A | G
- | ATT | G | GC | GC | A | C | G

>>> alg_path, alg_seq1_parts, alg_seq2_parts = nw.get_alignment_strings_and_indices(aligned_seq1, aligned_seq2)
>>> from string2string.misc.plotting_functions import plot_pairwise_alignment
>>> plot_pairwise_alignment(seq1_pieces = alg_seq1_parts, seq2_pieces = alg_seq2_parts, alignment = alg_path, str2colordict = {'-': 'lightgray', 'ATT': 'indianred', 'GC': 'darkseagreen', 'A': 'skyblue', 'G': 'palevioletred', 'C': 'steelblue'}, title = 'Global Alignment Between Two Sequences of Strings')

Distance

The following code snippet demonstrates how to use the Levenshtein edit distance algorithm to compute the edit distance between two strings, at the character level and at the word level.

>>> # Let's create a Levenshtein edit distance class instance, with the default (unit cost) weights, from the distance module
>>> from string2string.distance import LevenshteinEditDistance
>>> edit_dist = LevenshteinEditDistance()

>>> # Let's also create a Tokenizer class instance with the default word delimiter (i.e., space)
>>> from string2string.misc import Tokenizer
>>> tokenizer = Tokenizer(word_delimiter=' ')

>>> # Let's create two strings
>>> text1 = "The quick brown fox jumps over the lazy dog"
>>> text2 = "The kuack brown box jumps over the lazy dog"

>>> # Get the edit distance between them at the character level
>>> edit_dist_score  = edit_dist.compute(text1, text2)

>>> print(f"Edit distance between these two texts at the character level is {edit_dist_score}")
# Edit distance between these two texts at the character level is 3.0

>>> # Tokenize the two texts
>>> text1_tokens = tokenizer.tokenize(text1)
>>> text2_tokens = tokenizer.tokenize(text2)

>>> # Get the distance between them at the word level
>>> edit_dist_score  = edit_dist.compute(text1_tokens, text2_tokens)

>>> print(f"Edit distance between these two texts at the word level is {edit_dist_score}")
# Edit distance between these two texts at the word level is 2.0

Search

The following code snippet demonstrates how to use the Knuth-Morrs-Pratt (KMP) search algorithm to find the index of a pattern in a text.

>>> # Let's create a KMPSearch class instance from the search module
>>> from string2string.search import KMPSearch
>>> knuth_morris_pratt = KMPSearch()

>>> # Let's define a pattern and a text
>>> pattern = Jane Austen'
>>> text = 'Sense and Sensibility, Pride and Prejudice, Emma, Mansfield Park, Northanger Abbey, Persuasion, and Lady Susan were written by Jane Austen and are important works of English literature.'

>>> # Now let's find the index of the pattern in the text, if it exists (otherwise, -1 is returned).
>>> idx = knuth_morris_pratt.search(pattern=pattern,text=text)

>>> print(f'The index of the pattern in the text is {idx}.')
# The index of the pattern in the text is 127.

Faiss Semantic Search

The example below demonstrates how to use the Faiss tool developed by FAIR to perform semantic search. First, we use the BART-Large model from Hugging Face to generate embeddings for a small corpus of 25 sentences. To perform the search, we first encode a query sentence using the same BART model and use it to search the corpus. Specifically, we search for sentences that are most similar in meaning to the query sentence. After performing the search, we print the top thre sentences from the corpus that are most similar to the query sentence.

This approach can be useful in a variety of natural-processing applications, such as question-answering and information retrieval, where it is essential to find relevant information quickly and accurately.

>>> # Let's create a FaissSearch class instance from the search module to perform semantic search
>>> from string2string.search import FaissSearch
>>> faiss_search = FaissSearch(model_name_or_path = 'facebook/bart-large')

>>> # Let's create a corpus of strings (e.g., sentences)
>>> corpus = {
        'text': [
            "Coffee is my go-to drink in the morning.", 
            "I always try to make time for exercise.", 
            "Learning something new every day keeps me motivated.", 
            "The sunsets in my hometown are breathtaking.", 
            "I am grateful for the support of my friends and family.", 
            "The book I'm reading is incredibly captivating.", 
            "I love listening to music while I work.", 
            "I'm excited to try the new restaurant in town.", 
            "Taking a walk in nature always clears my mind.", 
            "I believe that kindness is the most important trait.", 
            "It's important to take breaks throughout the day.", 
            "I'm looking forward to the weekend.", 
            "Reading before bed helps me relax.", 
            "I try to stay positive even in difficult situations.", 
            "Cooking is one of my favorite hobbies.", 
            "I'm grateful for the opportunity to learn and grow every day.", 
            "I love traveling and experiencing new cultures.", 
            "I'm proud of the progress I've made so far.", 
            "A good night's sleep is essential for my well-being.", 
            "Spending time with loved ones always brings me joy.", 
            "I'm grateful for the beauty of nature around me.", 
            "I try to live in the present moment and appreciate what I have.", 
            "I believe that honesty is always the best policy.", 
            "I enjoy challenging myself and pushing my limits.", 
            "I'm excited to see what the future holds."
        ],
    }

>>> # Next we need to initialize and encode the corpus
>>> faiss_search.initialize_corpus(
    corpus=corpus,
    section='text', 
    embedding_type='mean_pooling',
    )

>>> # Let's define a query, and the number of top results we want to retrieve; then, let's perform the semantic search.
>>> query = 'I like going for a run in the morning.'
>>> top_k = 5
>>> top_k_results = faiss_search.search(query=query, k = top_k)

>>> # Let's define a function to print the results of the search.
>>> def print_results(query, results, top_k):
        # Let's first print the query.
        print(f'Query: "{query}"\n')

        # Let's now print the top k results.
        print(f'Top {top_k} most similar sentences in the corpus to the query (smallest score is most similar):')
        for i in range(top_k):
            print(f' - {i+1}: "{results["text"][i]}" with a similarity score of {top_k_results["score"][i]:.2f}')
>>> print_results(query=query, results=top_k_results, top_k=top_k)
# Query: "I like going for a run in the morning."

# Top 3 most similar sentences in the corpus to the query (smallest score is most similar):
#  - 1: "I always try to make time for exercise." with a similarity score of 170.65
#  - 2: "The sunsets in my hometown are breathtaking." with a similarity score of 238.20
#  - 3: "Coffee is my go-to drink in the morning." with a similarity score of 238.85

Cosine Similarity (with Glove and fastText Embeddings)

The following example demonstrates how to use pre-trained GloVe embeddings to calculate the cosine similarity between different pairs of words. Specifically, we compute the cosine similarity between the embeddings of four words: "cat", "dog", "phone", and "computer". We then create a similarity matrix and use the plot_heatmap function in the visualization module to plot this matrix.

Overall, this example provides a practical demonstration of how to use pre-trained embeddings such as GloVe and fastText to quantify the semantic similarity between pairs of words, which can be useful in a variety of natural-language processing tasks.

>>> # Let's create a Cosine Similarity class instance from the similarity module
>>> from string2string.similarity import CosineSimilarity
>>> cosine_similarity = CosineSimilarity()

>>> # Let's also create an instance of the GloVeEmbeddings class from the misc module to compute the embeddings of words
>>> from string2string.misc import GloVeEmbeddings
>>> glove = GloVeEmbeddings(model='glove.6B.200d', dim=50, force_download=True, dir='./models/glove-model/')

>>> # Let's define a list of words
>>> words = ['cat', 'dog', 'phone', 'computer']

>>> # Let's create a list to store the embeddings of the words and compute them
>>> embeds = []
>>> for word in words:
>>>     embedding = glove.get_embedding(word)
>>>     embeds.append(embedding)

>>> # Let's create a similarity matrix to store the cosine similarity between each pair of embeddings
>>> similarity_matrix = np.zeros((len(words), len(words)))
>>> for i in range(len(embeds)):
        similarity_matrix[i, i] = 1
        for j in range(i + 1, len(embeds)):
            result = cosine_similarity.compute(embeds[i], embeds[j], dim=1).item()
            similarity_matrix[i, j] = result
            similarity_matrix[j, i] = result

>>> # Let's visualize the similarity matrix
>>> from string2string.misc.plotting_functions import plot_heatmap
>>> plot_heatmap(
        similarity_matrix, 
        title='Similarity Between GloVe Embeddings',
        x_ticks = words,
        y_ticks = words,
        x_label = 'Words',
        y_label = 'Words',
        valfmt = '{x:.2f}',
        cmap="Blues",
    )

Metrics (sacreBLEU and ROUGE)

In the code snippet below, you can see an example of how to employ the sacreBLEU metric for calculating the BLEU score. This metric is implemented as a wrapper around the sacreBLEU library. The computation is performed by providing a list of candidate sentences and a list of reference sentences as input and calling the compute method of the metric instance.

>>> # Let's import the sacreBLEU metric from the metrics module and create an instance of it
>>> from string2string.metrics import sacreBLEU
>>> sbleu = sacreBLEU()

>>> # Let's define a list of candidate sentences and a list of reference sentences
>>> candidates = ['The sun is shining.', 'The birds are chirping.', 'She is playing the guitar.', 'He is cooking dinner.']
>>> references = [['The sun is shining.', 'The sun is bright.'], ['The birds are singing.', 'The harold is singing.'], ['Julie is playing the flute.', 'She is playing the piano.'], ['Chef is cooking dinner.', 'He is cooking lunch.']]

>>> # Compute the sacreBLEU score
>>> result = sbleu.compute(candidates, references)
>>> print(result)
# {'score': 67.92604743211312, 'counts': [19, 13, 9, 4], 'totals': [21, 17, 13, 9], 'precisions': [90.47619047619048, 76.47058823529412, 69.23076923076923, 44.44444444444444], 'bp': 1.0, 'sys_len': 21, 'ref_len': 21}

Similarly, we can use the ROUGE metric to calculate the ROUGE score. This metric is implemented as a wrapper around the ROUGE library.

>>> # Let's import the ROUGE metric from the metrics module and create an instance of it
>>> from string2string.metrics import ROUGE
>>> rogue = ROUGE()

>>> # Let's define a list of candidate sentences and a list of reference sentences
>>> candidates = ["The cat is sitting on the mat.", "The dog is barking at the mailman.", "The bird is singing in the tree."] 
>>> references = [["The cat is sitting on the mat."], ["The dog is barking at the postman."], ["The bird sings on the tree."]]

>>> # Compute the ROUGE score
>>> result = rogue.compute(candidates, references)
>>> print(result)
# {'rouge1': 0.8241758241758242, 'rouge2': 0.7323232323232324, 'rougeL': 0.8241758241758242, 'rougeLsum': 0.8241758241758242}

BARTScore and BERTScore

This code snippet shows how to utilize the BARTScore and BERTScore metrics to compute their corresponding scores. These similarity metrics serve as wrappers around the BARTScore and BERTScore libraries.

>>> # Importing the BERTScore and BARTScore classes from the similarity module
>>> from string2string.similarity import BERTScore, BARTScore

>>> # Initializing the BARTScore metric
>>> bart_scorer = BARTScore(model_name_or_path='facebook/bart-large-cnn')

>>> # Initializing the BERTScore metric
>>> bert_scorer = BERTScore(lang="en")

>>> # Define the source and target sentences
>>> source_sentences = ["I'm super happy today.", "John von Neumann was a mathematician and physicist."]
>>> target_sentences = ["I feel pretty wonderful today.", "Kurt Vonnegut's Slaughterhouse Five had a profound impact on many people."]

>>> # Compute the BARTScore
>>> score = bart_scorer.compute(source_sentences, target_sentences, agg="mean", batch_size=4)
>>> print(score)
# {'score': array([-2.77911878, -4.60774326])}

>>> # Compute the BERTScore
>>> score = bert_scorer.compute(source_sentences, target_sentences)
>>> print(score)
# {'precision': array([0.943804 , 0.8559195], dtype=float32), 'recall': array([0.943804  , 0.85145634], dtype=float32), 'f1': array([0.943804 , 0.8536821], dtype=float32)}

Potential Future Enhancements and Improvements

  • Use the Numba library to accelerate the execution time of the algorithms.
  • Include additional string-based metrics in the miscellaneous module.
  • Implement BLAST and FASTA algorithms. [Work-in-Progress]
  • Add more customizable and useful visualization features.

Citation

@article{suzgun2023string2string,
  title={string2string: A Modern Python Library for String-to-String Algorithms},
  author={Suzgun, Mirac and Shieber, Stuart M and Jurafsky, Dan},
  journal={arXiv preprint arXiv:2304.14395},
  year={2023}
}

Thanks

Our project owes a debt of gratitude to the following individuals for their contributions, comments, and feedback: Federico Bianchi, Corinna Coupette, Sebastian Gehrmann, Tayfun Gür, Şule Kahraman, Deniz Keleş, Luke Melas-Kyriazi, Christopher Manning, Tolúlopé Ògúnrèmí, Alexander "Sasha" Rush, Kyle Swanson, and Garrett Tanzer.

Contributing

If you are interested in contributing to this library, we encourage you to review our documentation first to gain a better understanding of the codebase's format and structure. Once you are familiar with the codebase, you are welcome to submit a pull request. We are looking for new contributors to help us drive our efforts forward!

More Repositories

1

dspy

DSPy: The framework for programming—not prompting—foundation models
Python
18,220
star
2

CoreNLP

CoreNLP: A Java suite of core NLP tools for tokenization, sentence segmentation, NER, parsing, coreference, sentiment analysis, etc.
Java
9,678
star
3

stanza

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
Python
7,278
star
4

GloVe

Software in C and data files for the popular GloVe model for distributed word representations, a.k.a. word vectors or embeddings
C
6,867
star
5

cs224n-winter17-notes

Course notes for CS224N Winter17
TeX
1,587
star
6

pyreft

ReFT: Representation Finetuning for Language Models
Python
1,137
star
7

treelstm

Tree-structured Long Short-Term Memory networks (http://arxiv.org/abs/1503.00075)
Lua
875
star
8

pyvene

Stanford NLP Python Library for Understanding and Improving PyTorch Models via Interventions
Python
625
star
9

python-stanford-corenlp

Python interface to CoreNLP using a bidirectional server-client interface.
Python
516
star
10

mac-network

Implementation for the paper "Compositional Attention Networks for Machine Reasoning" (Hudson and Manning, ICLR 2018)
Python
494
star
11

phrasal

A large-scale statistical machine translation system written in Java.
Java
208
star
12

spinn

SPINN (Stack-augmented Parser-Interpreter Neural Network): fast, batchable, context-aware TreeRNNs
Python
205
star
13

coqa-baselines

The baselines used in the CoQA paper
Python
176
star
14

cocoa

Framework for learning dialogue agents in a two-player game setting.
Python
158
star
15

stanza-old

Stanford NLP group's shared Python tools.
Python
138
star
16

chirpycardinal

Stanford's Alexa Prize socialbot
Python
131
star
17

stanfordnlp

[Deprecated] This library has been renamed to "Stanza". Latest development at: https://github.com/stanfordnlp/stanza
Python
114
star
18

wge

Workflow-Guided Exploration: sample-efficient RL agent for web tasks
Python
109
star
19

pdf-struct

Logical structure analysis for visually structured documents
Python
81
star
20

edu-convokit

Edu-ConvoKit: An Open-Source Framework for Education Conversation Data
Jupyter Notebook
75
star
21

cs224n-web

http://cs224n.stanford.edu
HTML
60
star
22

ColBERT-QA

Code for Relevance-guided Supervision for OpenQA with ColBERT (TACL'21)
40
star
23

stanza-train

Model training tutorials for the Stanza Python NLP Library
Python
37
star
24

phrasenode

Mapping natural language commands to web elements
Python
37
star
25

contract-nli-bert

A baseline system for ContractNLI (https://stanfordnlp.github.io/contract-nli/)
Python
29
star
26

color-describer

Code for Learning to Generate Compositional Color Descriptions
OpenEdge ABL
26
star
27

stanza-resources

23
star
28

python-corenlp-protobuf

Python bindings for Stanford CoreNLP's protobufs.
Python
20
star
29

miniwob-plusplus-demos

Demos for the MiniWoB++ benchmark
17
star
30

multi-distribution-retrieval

Code for our paper Resources and Evaluations for Multi-Distribution Dense Information Retrieval
Python
14
star
31

huggingface-models

Scripts for pushing models to huggingface repos
Python
11
star
32

nlp-meetup-demo

Java
8
star
33

sentiment-treebank

Updated version of SST
Python
8
star
34

en-worldwide-newswire

An English NER dataset built from foreign newswire
Python
7
star
35

plot-data

datasets for plotting
Jupyter Notebook
6
star
36

contract-nli

ContractNLI: A Dataset for Document-level Natural Language Inference for Contracts
HTML
4
star
37

plot-interface

Web interface for the plotting project
JavaScript
3
star
38

handparsed-treebank

Extra hand parsed data for training models
Perl
2
star
39

coqa

CoQA -- A Conversational Question Answering Challenge
Shell
2
star
40

pdf-struct-models

A repository for hosting models for https://github.com/stanfordnlp/pdf-struct
HTML
2
star
41

chirpy-parlai-blenderbot-fork

A fork of ParlAI supporting Chirpy Cardinal's custom neural generator
Python
2
star
42

wob-data

Data for QAWoB and FlightWoB web interaction benchmarks from the World of Bits paper (Shi et al., 2017).
Python
2
star
43

pdf-struct-dataset

Dataset for pdf-struct (https://github.com/stanfordnlp/pdf-struct)
HTML
1
star
44

nn-depparser

A re-implementation of nndep using PyTorch.
Python
1
star