• Stars
    star
    144
  • Rank 255,590 (Top 6 %)
  • Language
    C++
  • License
    MIT License
  • Created over 6 years ago
  • Updated 8 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

alignment to variation graph inducer

seqwish

build and test install with bioconda

These sequences wish they were squished into a graph.

a variation graph inducer

seqwish implements a lossless conversion from pairwise alignments between sequences to a variation graph encoding the sequences and their alignments. As input we typically take all-versus-all alignments, but the exact structure of the alignment set may be defined in an application specific way. This algorithm uses a series of disk-backed sorts and passes over the alignment and sequence inputs to allow the graph to be constructed from very large inputs that are commonly encountered when working with large numbers of noisy input sequences. Memory usage during construction and traversal is limited by the use of sorted disk-backed arrays and succinct rank/select dictionaries to record a queryable version of the graph.

Citation:

Erik Garrison, Andrea Guarracino, Unbiased pangenome graphs, Bioinformatics, Volume 39, Issue 1, January 2023, btac743, https://doi.org/10.1093/bioinformatics/btac743

squish graph induction algorithm

As input we have Q, which is a concatenation of the sequences from which we will build the graph. We build an compressed suffix array (CSA) mapping sequence names to offsets in Q, and also the inverse using a rank/select dictionary on a bitvector marking the starts of sequences in Q. This allows us to map between positions in the sequences of Q, which is the format in which alignment algorithms typically express alignments, and positions in Q itself, which is the coordinate space we will use as a basis for the generation of our graph. To relate the sequences in Q to each other we apply a function map to generate alignments A. Although these alignments tend to be represented using oriented interval pairs in Q, for simplicity and robustness to graph complexity, we describe A as a vector of pairs of bidirectional positions (sequence offsets and strands) b in Q , such that A = [(bq, br), ... ]. We sort A by the first member (bq) of each pair, ensuring that the entries in A are ordered according to their order in Q.

To query the induced graph we build a rank/select dictionary allowing efficient traversal of A, based on a bit vector Abv of the same length as A such that we record a 1 at those positions which correspond to the first instance of a given bq and record a 0 in Abv otherwise. We record which bq we have processed in the bitvector Qseen which is of the same length as Q. This allows us to avoid a quadratic penalty in the order of the size of the transitive closures in Q given by the map function.

Now we inductively derive the graph implied by the alignments. For each base bq in Q, we find its transitive closure cq := [bq, br1, ... ] via the map operation by traversing the aligned base pairs recorded in A. We write the character of the base bq to a vector S, then for each bc in cq we record a pair [si, bc] into N and its reverse, [bc, si] into P. We mark Qseen for each base in each emitted cluster, and we do not consider marked bases in subsequent transitive closures. By sorting N and P by their first entries, we can build rank/select dictionaries on them akin to that we built on A that allow random access by graph base (as given in S) or input base (as given in Q).

To fully induce the variation graph we need to establish the links between bases in S that would be required for us to find any sequence in the input as a walk through the graph. We do so by rewriting Q (in both the forward and reverse orientation) in terms of pairs of bases in S, then sorting the resulting pairs by their first element, which yields L = [(ba, bb), ... ]. These pairs record the links and their frequencies, which we can emit or filter (such as by frequency) as needed given particular applications. In typical use we take the graph to be given by the unique elements of L.

Our data model encodes the graph using single-base nodes, but often downstream use requires identifying nodes and thus we benefit from compressing the unitigs of the graph into single nodes, which reduces memory used by identifiers in analysis. We can compress the node space of the graph by traversing S, and for each base querying the inbound links. Maintaining a bitvector Sid of length equal to S we mark each base at which we see any link other than one from or to the previous base on the forward or reverse strand, or at bases where we have no incoming links. By building a rank/select dictionary on Sid we can assign a smaller set of node ids to the sequence space of the graph.

Given the id space encoded by Sid we can materialize the graph in a variety of interchange formats, or provide id-based interfaces to the indexed squish graph. To generate graphs in .vg or GFAv1 format, we want to decompose the graph into its nodes (S), edges (L) and paths (P). The nodes are given by S and Sid, and similarly we project L and P through Sid to obtain a compressed variation graph.

using our graph

At present we rest on existing tools, such as vg, GCSA2, GBWT, and other GFAv1-compatible tools to support downstream uses for the graph. We observe that it will be trivial to use this data model as the basis for various graph traversal and modification algorithms. Implementation of these features will follow. We plan to link into the handle graph model developed in the vg project, which should allow the application of any algorithms defined using that interface on the backing graph developed here. Users familiar with concepts in assembly graphs will notice many similarities between the graph induction algorithm and typical steps in assembly algorithms, and we intend to explore these as well using this platform.

building

dependencies

You'll need basic C++ build tools, cmake, and zlib. On Ubuntu (and probably Debian) systems, these can be installed with:

sudo apt install build-essential cmake zlib1g-dev libjemalloc-dev

On Arch Linux, the jemalloc dependency can be installed with:

sudo pacman -S jemalloc     # arch linux

build process

seqwish uses cmake to build itself and its dependencies.

git clone --recursive https://github.com/ekg/seqwish.git
cd seqwish
cmake -H. -Bbuild && cmake --build build -- -j 3

To clean up simply remove build/ and bin/:

rm -rf build bin

A static build can be obtained by setting a flag in the cmake build setup.

cmake -DBUILD_STATIC=1 -H. -Bbuild && cmake --build build -- -j 3

You'll need to set this flag to 0 or remove and rebuild your build directory if you want to unset this behavior. Static builds are unlikely to be supported on OSX, and require appropriate static libraries on linux.

clang

If you want to use clang, be sure to install the correct version of OpenMP. For example, if you have clang version 14, you have to install libomp-14-dev:

sudo apt -y install libomp-14-dev

To build seqwish with clang, execute:

cmake -H. -Bbuild -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=/usr/bin/clang -DCMAKE_CXX_COMPILER=/usr/bin/clang++ && cmake --build build -- -j 3

Notes for distribution and ARM64 systems

If you need machine-specific optimizations, use -DEXTRA_FLAGS to specify your architecture.

For example, on Linux ARM64 systems, do:

cmake -H. -Bbuild -DCMAKE_BUILD_TYPE=Generic -DEXTRA_FLAGS='-march=armv8-a' && cmake --build build -- -j 3

For Docker image distribution, -march=haswell works decently:

cmake -H. -Bbuild -DCMAKE_BUILD_TYPE=Generic -DEXTRA_FLAGS='-march=haswell' && cmake --build build -- -j 3

Docker

Alternatively, you may build a Docker image that contains seqwish.

docker build -t seqwish .

Bioconda

seqwish recipes for Bioconda are available at https://bioconda.github.io/recipes/seqwish/README.html. To install the latest version using Conda execute:

conda install -c bioconda seqwish

Guix

installing via the guix-genomics git repository

First, clone the guix-genomics repository:

git clone https://github.com/ekg/guix-genomics

And install the seqwish package to your default GUIX environment:

GUIX_PACKAGE_PATH=. guix package -i seqwish

Now seqwish is available as a global binary installation.

installing via the guix-genomics channel

Add the following to your ~/.config/guix/channels.scm:

  (cons*
(channel
  (name 'guix-genomics)
  (url "https://github.com/ekg/guix-genomics.git")
  (branch "master"))
%default-channels)

First, pull all the packages, then install seqwish to your default GUIX environment:

guix pull
guix package -i seqwish

If you want to build an environment only consisting of the seqwish binary, you can do:

guix environment --ad-hoc seqwish

For more details about how to handle Guix channels, go to https://git.genenetwork.org/guix-bioinformatics/guix-bioinformatics.git.

usage

seqwish supports PAF format output of several sequence aligners, like wfmash and minimap2. It requires the CIGAR string of the alignment to be provided in the cg:z: optional field. It uses large temporary files during the construction. By default, these are prefixed with the output GFA file name, but this can be changed with the -b[base], --base=[base] command line argument. The input sequences can be in FASTA or FASTQ format, either in plain text or gzipped. It writes GFA1 on its standard output.

wfmash

single PAF file

wfmash x.fa x.fa -X > x.paf
seqwish -s x.fa -p x.paf -g x.gfa

multiple PAF file

wfmash c.fa a.fa > a.paf
wfmash c.fa b.fa > b.paf
cat a.fa b.fa c.fa > abc.fa
seqwish -s abc.fa -p a.paf,b.paf -g abc.gfa

minimap2

minimap2 does not emit the CIGAR string in PAF output by default. To do this, specify the -c flag:

minimap2 x.fa x.fa -c -X > x.paf
seqwish -s x.fa -p x.paf -g x.gfa

TODO

  • describe algorithm
  • implement rank/select dictionary class based on disk-backed radix sort of a binary array
  • implement squish graph induction algorithm
  • explore extensions via graph rewriting, and the handle graph concept
  • explore assembly problems via graph filtering and cleaning operations

More Repositories

1

intervaltree

a minimal C++ interval tree implementation
C++
208
star
2

alignment-and-variant-calling-tutorial

basic walk-throughs for alignment and variant calling from NGS sequencing data
191
star
3

fraction.js

A fraction math library in javascript.
JavaScript
163
star
4

fastahack

utilities for indexing and sequence extraction from FASTA files
C++
57
star
5

bamaddrg

adds sample names and read-group (RG) tags to BAM alignments
C++
48
star
6

glia

a string to graph aligner
C++
40
star
7

thesis

my PhD thesis
TeX
34
star
8

edyeet

base-accurate DNA sequence alignments using edlib and mashmap2
C++
33
star
9

mmmulti

memory mapped multimap, multiset, and implicit interval tree based on an in-place parallel sort
C++
27
star
10

split

split and join for C++ strings
C++
26
star
11

gimbricate

recompute GFA link overlaps
C++
25
star
12

pafplot

base-level dotplots from PAF alignments
Rust
23
star
13

multichoose

generate multiset combinations (n multichoose k)
C++
23
star
14

guix-genomics

guix packages for bioinformatics software
Scheme
22
star
15

HLA-zoo

genome variation graphs constructed from HLA GRCh38 ALTs
21
star
16

gafpack

convert variation graph alignments to coverage maps over nodes
Rust
20
star
17

hhga

haplotypes genotypes and alleles example decision synthesizer
C++
20
star
18

multipermute

efficient multiset permutations
TypeScript
20
star
19

1000G-integration

variant integration methods for the 1000 Genomes Project
Shell
20
star
20

ogap

gap opening realigner for BAM data streams
C++
18
star
21

viral-assembly

exploring viral genome assembly with variation graph tools
Shell
18
star
22

shuntingyard

implementations of Dijkstra's shunting-yard algorithm for infix notation parsing
Python
18
star
23

smithwaterman

smith-waterman-gotoh alignment algorithm
C++
16
star
24

drawbwt

vector illustrations of BWT based searching on small strings
C++
16
star
25

openconnect-sso-docker

connect to a cisco anyconnect server with 2FA via web SSO on a headless server
Shell
16
star
26

mutatrix

genome simulation across a population with zeta-distributed allele frequency, snps, insertions, deletions, and multi-nucleotide polymorphisms
C++
14
star
27

pca

PCA in rust
Rust
14
star
28

fastix

Prefix-renaming FASTA records really fast.
Rust
13
star
29

yllm

Yet another LLM command line interface
Shell
12
star
30

ssh.py

synchronous, serial, and parallel ssh, scp, and ping methods in python
Python
12
star
31

bitcoin-fs

encode arbitrary data in bitcoin transactions
JavaScript
11
star
32

wflign

the we-flyin WFA-guided ultralong tiling sequence aligner
C++
11
star
33

ACAD18

Day 2 of ACAD's 2018 Advanced Bioinformatics Workshop
11
star
34

dozyg

sequence to graph mapper
C++
11
star
35

succinct-graph

compressed, queryable variation graphs
C++
11
star
36

hapviz

indel haplotype visualization on the command line from BAM files
C++
11
star
37

pafcheck

PAF (pairwise alignment format) validator based on extended CIGAR strings
Rust
11
star
38

USDA-SR22-importer

assists in importing the USDA's SR22 nutritional database from MS Access to MySQL
Shell
10
star
39

splitfa

split a FASTA sequence file into shorter sequences
Rust
10
star
40

foxsamply

algorenderer / a script to convert FoxDot code to recorded samples
Shell
10
star
41

yeast-pangenome

yeast pangenome
Shell
9
star
42

loq

talk to type
Shell
9
star
43

wflambda

WFΞ» - wavefront alignment with callback match function
C++
8
star
44

bkmers

binary kmers written numerically
Rust
7
star
45

hamming-fasta

brute force online sequence similarity search
Rust
7
star
46

gaffy

GAF (graph alignment format) command line utility
Rust
7
star
47

dirtyzipf

zipfian int distributions using a fast approximation for pow
C++
7
star
48

interleave-fastq

interleaves fastq files (also gzipped ones!)
Python
7
star
49

beats

snippets of music in foxdot
Python
7
star
50

rsvd

rSVD in rust
Rust
6
star
51

arfer

annotation of variants using genotype likelihoods
C++
6
star
52

kmers

generate kmer frequency information from text streams
C++
6
star
53

xbzipLib

XBzip/XBWT implementation from "Compressing and indexing labeled trees, with applications," P Ferragina et al.
C
6
star
54

jvcf

Joint Variant Call Format -- JSON notation for genetic variant annotation
6
star
55

bamprofile

profiles alignment mismatches and gaps
C++
5
star
56

atomicbitvector

atomic bitset/bitvector with size determined at runtime
C++
5
star
57

sgd2

header-only large graph layout via stochastic gradient descent
C++
4
star
58

3q29

3q29 pangenome build from HPRC year 1 samples
4
star
59

pafcov

pairwise alignment file coverage metric
Rust
4
star
60

pafnet

PAF alignment to network format converter
Rust
4
star
61

poa-motifs

exploring variation graph models for motifs based on partial order alignment
4
star
62

paryfor

parallel_for based on atomic queues and C++11 threads
C++
4
star
63

phred.py

phred quality conversion to and from float and log space
Python
4
star
64

intervalstab

interval stabbing (pointwise range lookup) using the FastStabbing algorithm
C++
4
star
65

edlign

exploring identity-based alignment chaining
C++
4
star
66

pafciggy

use the cigar string in the PAF to correct target and query endpoint issues
Rust
4
star
67

graphappy

a variation graph-oriented fork of WhatsHap
C++
4
star
68

lru_cache

A thread-safe LRU cache implementation in C++
C++
4
star
69

aptertree

Apter Trees in rust
Rust
3
star
70

kmerj

kmer counting comparisons in low memory
C++
3
star
71

entropy

shannon entropy
Python
3
star
72

ssw

striped smith waterman implementation
C
3
star
73

simgenie

diplotype caller simulation on pangenie and freebayes
Shell
3
star
74

datscriptish

it's kind've like make, but not really. and it's for streams!
3
star
75

drosophila

drosophila genome assembly and analysis
Shell
3
star
76

tsvtools

tools for manipulating tab-separated-values files, mostly in C++
C++
3
star
77

endian

c++ header library for manipulating endianness
C
3
star
78

uniqprimers

determines if the primers required to assay VCF records are unique (provided BAM alignments of the primers and a VCF file of the variants)
C++
3
star
79

shastaGFA

scripts for interpreting GFAs made by the shasta assembler
Shell
2
star
80

wikiq

an xml to .tsv converter for wikimedia XML data dumps
C++
2
star
81

leveldb-snappy

demonstrates how to build leveldb with snappy support using the static libraries for each and git submodules
C++
2
star
82

gprof-compare

compare call counts in two gprof profile outputs
Python
2
star
83

vatfilter

extensions for manipulating VCF files annotated with the Variant Annotation Tool (VAT)
Python
2
star
84

qllm

query local LLM openai-API-compatible endpoints
Rust
2
star
85

tsvsplit

split tsv files with headers
2
star
86

nuller

null genotype caller which generates depth of coverage VCF
C++
2
star
87

genmusic

Python
2
star
88

subarch-select

pick an executable based on CPU capabilities
C++
2
star
89

bad-python-list

equal lists should serialize equally, but don't
Python
2
star
90

phage

phage assembly and analysis scripts
R
2
star
91

gatk-swe

Perl
2
star
92

factorial-ln

compute log(n!) of huge numbers
JavaScript
2
star
93

bamquality

read quality distributions for BAM and FASTQ files
C++
2
star
94

strswap

command line tool to swap string patterns in text files
Rust
2
star
95

diatoms

contig ordering using linkage disequilibrium
R
2
star
96

embeddna

test of DNA embeddings using a transformer
Python
2
star
97

vgp

analysis sketches for vertebrate genomes project assemblies
R
2
star
98

protobufstream

protocol buffers streams
C++
1
star
99

tidalcycles-stream

stream a multiuser tidalcycles session
Shell
1
star
100

fasta-utilities

utility scripts and programs for manipulating fasta files
Python
1
star