• Stars
    star
    107
  • Rank 323,587 (Top 7 %)
  • Language
    C
  • License
    GNU General Publi...
  • Created over 7 years ago
  • Updated 7 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

An efficient index for the colored, compacted, de Bruijn graph

C/C++ CI

Table of Contents

What is Puffaligner

Puffaligner is a fast, sensitive and accurate aligner built on top of the Pufferfish index. It tries to occupy a less-well-explored position in the space of read aligners, typically using more memory than BWT-based approaches (unless there are highly repetitive references), but considerably less than very fast but memory-hungry aligners like STAR. Puffaligner is based on hashing relatively long seeds and then extending them to MEMs, and so it is very fast (typically much faster than approaches based on arbitrary pattern matching in the BWT). It takes a seed -> chain -> align approach similar to BWA-MEM and minimap2.

It supports aligning to transcriptome as well as genome, but currently supports only contiguous alignments (i.e. spliced-alignment is not yet implemented). However, the design means that one can use Puffalign to align reads to a collection of genomes (or/and transcriptomes), as well as to a joint index that contains both the genome and the spliced transcripts.

Another feature of the aligner is introducing a list of decoys along with the main sequences to the index. A read is discarded if it aligns better to a decoy sequence rather than a sequence in the main list. One of the usecases of such feature is in improving the transcript alignment accuracy in case of retained introns, processed pseudogenes, etc..

The main steps in the aligner are:

  1. find the first unmapped kmer from the read in the pufferfish index
  2. extend the mapping to a uni-MEM (Maximal Extended Match on a unitig) between read and index
  3. repeat 1 and 2 for the next uni-MEM until reaching the end of the read
  4. project uni-MEMs to reference-based MEMs and compact them.
  5. find the best chain of the MEMs (adopted from minimap2 chaining)
  6. align the gaps between the MEMs and at the edges of the read
  7. find the best pair of the reads in case of paired-end
  8. recover orphans

There are a series of heuristics and best-practices used to improve both the performance and accuracy of the results.

What is Pufferfish

short answer : Pufferfish is a new time and memory-efficient data structure for indexing a compacted, colored de Bruijn graph (ccdBG). You can read more about pufferfish in the paper, which appeared at ISMB 2018.

long answer : Though the de Bruijn Graph (dBG) has enjoyed tremendous popularity as an assembly and sequence comparison data structure, it has only relatively recently begun to see use as an index of the reference sequences (e.g. deBGA, kallisto). Particularly, these tools index the compacted dBG (cdBG), in which all non-branching paths are collapsed into individual nodes and labeled with the string they spell out. This data structure is particularly well-suited for representing repetitive reference sequences, since a single contig in the cdBG represents all occurrences of the repeated sequence. The original positions in the reference can be recovered with the help of an auxiliary "contig table" that maps each contig to the reference sequence, position, and orientation where it appears as a substring. The deBGA paper has a nice description how this kind of index looks (they call it a unipath index, because the contigs we index are unitigs in the cdBG), and how all the pieces fit together to be able to resolve the queries we care about.  Moreover, the cdBG can be built on multiple reference sequences (transcripts, chromosomes, genomes), where each reference is given a distinct color (or colour, if you're of the British persuasion). The resulting structure, which also encodes the relationships between the cdBGs of the underlying reference sequences, is called the compacted, colored de Bruijn graph (ccdBG).  This is not, of course, the only variant of the dBG that has proven useful from an indexing perspective. The (pruned) dBG has also proven useful as a graph upon which to build a path index of arbitrary variation / sequence graphs, which has enabled very interesting and clever indexing schemes like that adopted in GCSA2 (which we won't discuss further here, but which I hope to cover in more detail in a future post).  Also, thinking about sequence search in terms of the dBG has led to interesting representations for variation-aware sequence search backed by indexes like the vBWT (implemented in the excellent gramtools package).

While existing hash-based indices based on the cdBG (and ccdBG) are very efficient for search, they typically occupy a large amount of space in memory (both during construction and even when built). As a result, to make use of such data structures on large reference sequences (e.g., the human genome) or collections of reference sequences (e.g., in a metagenomic context), one typically requires a very large memory machine — if the structures can be built at all. Pufferfish implements a new and much more compact data structure for indexing the ccdBG. While maintaining very efficient queries, this allows Pufferfish to index reference sequences while reducing the memory requirements considerably (by an order-of-magnitude or more). This greatly reduces the memory burden for indexing reference sequences and makes it possible to build hash-based indexes of sequences of size that were not previously feasible.

about pufferfish development: Currently, Pufferfish is the software implementing this efficient ccdBG index, and allowing point (i.e., k-mer) queries. Pufferfish is under active development, but we want to be as open (and as useful to as many people) as possible early on.

branches: The master branch of pufferfish is not necessarily stable, but it should, at any given time contain a working version of the index. That is, breaking changes should not be pushed to master. The develop branch of pufferfish is guaranteed to be neither stable nor working at any given point, but a best-faith effort will be made to not commit broken code to this branch. For feature branches, all bets are off.

For more details about pufferfish, please check out our paper, as well as the blog post here.

How to Install

To build the pufferfish/puffaligner do the following,

> git clone [email protected]:COMBINE-lab/pufferfish.git
> cd pufferfish
> mkdir build
> cd build
> cmake ../
> make

How to Use

Programs used within pufferfish

Building a pufferfish index requires first having a compacted de Bruijn graph, for which we use a modified version of TwoPaCo. However, some modification of the TwoPaCo output is required for pufferfish to properly index the graph (e.g. a k-mer must appear at most once in the graph and palindromic contigs output by TwoPaCo must be removed). Thus we rely on a modified version of TwoPaCo which we bundle with pufferfish in the external directory.

To choose an appropriate filter size to pass to TwoPaCo to build the compacted dBG, we make use the the hyper-log-log implementation of ntCard. Because we use this as a library instead of an executable, and to avoid an external dependency to simply call one function, we bundle a modified version of that code with pufferfish and also include it in the external directory.

We are also dependent on SeqLib and hence all the libraries that it is dependent on such as bz2, lzma, and z for mapping part. So it is required to install these libraries on the system. However, we also have the selected libraries from seqlib that we use bundled with pufferfish repo, so the installation should work without any difficulties.

Core Operations

Building a pufferfish index

To build a pufferfish index, you can use the index command. It is used like so:

pufferfish index -r <fasta_file_to_index> -o <pufferfish index directory>

There are also optional parameters including -k (setting the kmer size -- default:31) , -s (the ability to build a sparser and smaller index), -p (control the number of threads used during construction), and -f (to provide an explicit filter size for TwoPaCo dBG construction).

Aligning via Puffaligner

To align a set of paired-end reads to the reference one can use the following command:

pufferfish align -i <pufferfish_index> -1 <readfile1> -2 <readfile2> -o <outputfile> 

The input read files can be compressed or uncompressed fastq files

Puffaligner can generate different types of output including SAM format. There is also an efficient binary format, which we call pam that can be generated using the option -p.

There are a variety of optional choices for changing the default thresholds for allowing more alignments, higher or lower scored alignments, only the best, or only one best alignment, orphans, discordants etc.


Pufferfish is now the main (and only) index used in Salmon when it is run in mapping-based mode (i.e. with selective-alignment).

More Repositories

1

salmon

🐟 🍣 🍱 Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment
C++
765
star
2

alevin-fry

🐟 🔬🦀 alevin-fry is an efficient and flexible tool for processing single-cell sequencing data, currently focused on single-cell transcriptomics and feature barcoding.
Rust
159
star
3

RapMap

Rapid sensitive and accurate read mapping via quasi-mapping
C++
89
star
4

cuttlefish

Building the compacted de Bruijn graph efficiently from references or reads.
C++
79
star
5

oarfish

long read RNA-seq quantification
Rust
66
star
6

terminus

Rust
57
star
7

kmers

A bit-packed k-mer representation (and relevant utilities) for rust
Rust
47
star
8

simpleaf

A rust framework to make using alevin-fry even simpler
Rust
44
star
9

wasabi

Prepare Sailfish and Salmon output for downstream analysis
R
43
star
10

SalmonTools

Useful tools for working with Salmon output
C++
36
star
11

RapClust

Accurate, Lightweight Clustering of de novo Transcriptomes using Fragment Equivalence Classes
Python
30
star
12

piscem

Rust wrapper for the next generation (still currently in C++)
Rust
20
star
13

shoal

Improved multi-sample transcript abundance estimates using adaptive priors
C++
20
star
14

grangers

Rust
16
star
15

pufferfish2

Rust
15
star
16

EDS

💡 💾 💽 A simple, intuitive and Efficient single cell binary Data Storage format
Rust
15
star
17

pyroe

Python
15
star
18

grouper

Python
15
star
19

piscem-infer

Rust
14
star
20

mazu

A Rust library for building modular, fast and compact indexes over genomic data
Rust
13
star
21

rainbowfish

A succinct colored dBG representation
C++
12
star
22

quark

semi-reference-based short read compression
C++
11
star
23

minnow

C++
10
star
24

usefulaf

Useful scripts and tools related to alevin-fry
Rust
9
star
25

measuresmatter

A treatise on quantification and differential expression from RNA-seq data
8
star
26

combine-lab.github.io

HTML
8
star
27

quantaf

Nextflow
8
star
28

seine-rs

A (rust)🦀 library and suite of tools for manipulating and processing the output of salmon, alevin, and alevin-fry 🐟
Rust
7
star
29

piscem-cpp

A small sparse and fast reference index based on SShash and Tiling encoding
C
6
star
30

sc-census

R
6
star
31

GRASS

Graph-Regularized Annotation via Semi-Supervised learning
Python
6
star
32

TreeTerminus

C
5
star
33

COMBINE-lab.github.io-OLD

Lab website
HTML
5
star
34

matryoshka

Methods for the automated discovery of hierarchically-structured chromatin domains
C++
5
star
35

forseti

Python
5
star
36

pcalib

A small "lightweight" implementation of PCA in C++ using the Eigen library
C++
5
star
37

perplexity

Rust
5
star
38

alevin-paper-pipeline

A simple pipeline using CGAT framework for benchmarking and analysis.
Python
5
star
39

seq_geom_xform

A crate to convert "complex" sequence library geometries to "simple" geometries
Rust
4
star
40

efgdl-spec

Specification for the extended fragment geometry description language
4
star
41

seqproc

Rust
4
star
42

seq_geom_parser

Testing out rust parsing of single-cell library geometry specifications
Rust
4
star
43

roe

R
3
star
44

roers

Rust
3
star
45

protocol-estuary

Jsonnet
3
star
46

xz

A GitHub clone of the XZ repo (http://git.tukaani.org/xz.git) — not from the original author
C
3
star
47

LabRules

A list of rules and hints (on various different subjects) for members of the COMBINE lab. Others may also find them useful.
3
star
48

txome-clustering

Meaningful and efficient clustering of de novo transcriptome assembly results (better name pending).
Python
3
star
49

cqf-rust

An attempt at rustification of the CQF (https://github.com/splatlab/cqf)
C
2
star
50

libgff

C++
2
star
51

splitp

split-seq preprocessing
Rust
2
star
52

quant-tx-diversity

Jupyter Notebook
2
star
53

alevin-fry-paper-scripts

Jupyter Notebook
2
star
54

scrna-ambiguity

Jupyter Notebook
2
star
55

alevin-tutorial

This is a support website for Alevin-tool (part of Salmon).
1
star
56

radicl-cpp

Internal library for basic reading and writing of .rad files in C++14
C++
1
star
57

FastaDigest

Get the "signature" of fasta files
Python
1
star
58

rsrs

Reference Signatures (rs) in Rust (rs)
Rust
1
star
59

libradicl

Rust
1
star
60

simplr

1
star
61

KallistoFormatDescription

Documenting the (binary) output format used by Kallisto
1
star
62

pufferfish_experiments

Jupyter Notebook
1
star
63

CSE549-txpCoverage

C++
1
star
64

QuantAnalysis

Jupyter Notebook
1
star
65

radtk

Various tools for working with RAD files
Rust
1
star
66

lr_quant_benchmarks

A replicable and modular benchmark for long-read RNA transcript quantification methods
Python
1
star