• Stars
    star
    402
  • Rank 107,380 (Top 3 %)
  • Language
    C
  • License
    MIT License
  • Created almost 6 years ago
  • Updated 6 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Sequence-to-graph mapper and graph generator

Build Status

Getting Started

git clone https://github.com/lh3/minigraph
cd minigraph && make
# Map sequence to sequence, similar to minimap2 without base alignment
./minigraph test/MT-human.fa test/MT-orangA.fa > out.paf
# Map sequence to graph
./minigraph test/MT.gfa test/MT-orangA.fa > out.gaf
# Incremental graph generation (-l10k necessary for this toy example)
./minigraph -cxggs -l10k test/MT.gfa test/MT-chimp.fa test/MT-orangA.fa > out.gfa
# Call per-sample path in each bubble/variation (-c not needed for this)
./minigraph -xasm -l10k --call test/MT.gfa test/MT-orangA.fa > orangA.call.bed

# The lossy FASTA representation (requring https://github.com/lh3/gfatools)
gfatools gfa2fa -s out.gfa > out.fa
# Extract localized structural variations
gfatools bubble out.gfa > SV.bed

Table of Contents

Introduction

Minigraph is a sequence-to-graph mapper and graph constructor. For graph generation, it aligns a query sequence against a sequence graph and incrementally augments an existing graph with long query subsequences diverged from the graph. The figure on the right briefly explains the procedure.

Minigraph borrows ideas and code from minimap2. It is fairly efficient and can construct a graph from 90 human assemblies in a couple of days using 24 CPU cores. Older versions of minigraph was unable to produce base alignment. The latest version can. Please add option -c for graph generation as it generally improves the quality of graphs.

Users' Guide

Installation

To install minigraph, type make in the source code directory. The only non-standard dependency is zlib. For better performance, it is recommended to compile with recent compliers.

Sequence-to-graph mapping

To map sequences against a graph, you should prepare the graph in the GFA format, or preferrably the rGFA format. If you don't have a graph, you can generate a graph from multiple samples (see the Graph generation section below). The typical command line for mapping is

minigraph -cx lr graph.gfa query.fa > out.gaf

You may choose the right preset option -x according to input. Minigraph output mappings in the GAF format, which is a strict superset of the PAF format. The only visual difference between GAF and PAF is that the 6th column in GAF may encode a graph path like >MT_human:0-4001<MT_orang:3426-3927 instead of a contig/chromosome name.

The minigraph GFA parser seamlessly parses FASTA and converts it to GFA internally, so you can also provide sequences in FASTA as the reference. In this case, minigraph will behave like minimap2, though likely producing different alignments due to differences between the two implementations.

Graph generation

The following command-line generates a graph in rGFA:

minigraph -cxggs -t16 ref.fa sample1.fa sample2.fa > out.gfa

which is equivalent to

minigraph -cxggs -t16 ref.fa sample1.fa > sample1.gfa
minigraph -cxggs -t16 sample1.gfa sample2.fa > out.gfa

File ref.fa is typically the reference genome (e.g. GRCh38 for human). It can also be replaced by a graph in rGFA. Minigraph assumes sample1.fa to be the whole-genome assembly of an individual. This is an important assumption: minigraph only considers 1-to-1 orthogonal regions between the graph and the individual FASTA. If you use raw reads or put multiple individual genomes in one file, minigraph will filter out most alignments as they cover the input graph multiple times.

The output rGFA can be converted to a FASTA file with gfatools:

gfatools gfa2fa -s graph.gfa > out.stable.fa

The output out.stable.fa will always include the initial reference ref.fa and may additionally add new segments diverged from the initial reference.

Calling structural variations

A minigraph graph is composed of chains of bubbles with the reference as the backbone. Each bubble represents a structural variation. It can be multi-allelic if there are multiple paths through the bubble. You can extract these bubbles with

gfatools bubble graph.gfa > var.bed

The output is a BED-like file. The first three columns give the position of a bubble/variation and the rest of columns are:

  • (4) # GFA segments in the bubble including the source and the sink of the bubble
  • (5) # all possible paths through the bubble (not all paths present in input samples)
  • (6) 1 if the bubble involves an inversion; 0 otherwise
  • (7) length of the shortest path (i.e. allele) through the bubble
  • (8) length of the longest path/allele through the bubble
  • (9-11) please ignore
  • (12) list of segments in the bubble; first for the source and last for the sink
  • (13) sequence of the shortest path (* if zero length)
  • (14) sequence of the longest path (NB: it may not be present in the input samples)

Given an assembly, you can find the path/allele of this assembly in each bubble with

minigraph -cxasm --call graph.gfa sample-asm.fa > sample.bed

On each line in the BED-like output, the last colon separated field gives the alignment path through the bubble, the path length in the graph, the mapping strand of sample contig, the contig name, the approximate contig start and contig end. The number of lines in the file is the same as the number of lines in the output of gfatools bubble. You can use the paste Unix command to piece multiple samples together.

Prebuilt graphs

Prebuilt human graphs in the rGFA format can be found at Zenodo.

Algorithm overview

In the following, minigraph command line options have a dash ahead and are highlighted in bold. The description may help to tune minigraph parameters.

  1. Read all reference bases, extract (-k,-w)-minimizers and index them in a hash table.

  2. Read -K [=500M] query bases in the mapping mode, or read all query bases in the graph construction mode. For each query sequence, do step 3 through 5:

  3. Find colinear minimizer chains using the minimap2 algorithm, assuming segments in the graph are disconnected. These are called linear chains.

  4. Perform another round of chaining, taking each linear chain as an anchor. For a pair of linear chains, minigraph tries to connect them by doing graph wavefront alignment algorithm (GWFA). If minigraph fails to find an alignment within an edit distance threshold, it will find up to 15 shortest paths between the two linear chains and chooses the path of length closest to the distance on the query sequence. Chains found at this step are called graph chains.

  5. Identify primary chains and estimate mapping quality with a method similar to the one used in minimap2. Perform base alignment.

  6. In the graph construction mode, collect all mappings longer than -d [=10k] and keep their query and graph segment intervals in two lists, respectively.

  7. For each mapping longer than -l [=100k], finds poorly aligned regions. A region is filtered if it overlaps two or more intervals collected at step 6.

  8. Insert the remaining poorly aligned regions into the input graph. This constructs a new graph.

Limitations

  • A complex minigraph subgraph is often suboptimal and may vary with the order of input samples. It may not represent the evolution history or the functional relevance at the locus. Please do not overinterpret complex subgraphs. If you are interested in a particular subgraph, it is recommended to extract the input contig subsequences involved in the subgraph with the --call option and manually curated the results.

  • Minigraph needs to find strong colinear chains first. For a graph consisting of many short segments (e.g. one generated from rare SNPs in large populations), minigraph will fail to map query sequences.

  • The base alignment in the current version of minigraph is slow for species of high diversity.

More Repositories

1

minimap2

A versatile pairwise aligner for genomic and spliced nucleotide sequences
C
1,754
star
2

bwa

Burrow-Wheeler Aligner for short-read alignment (see minimap2 for long-read alignment)
C
1,504
star
3

seqtk

Toolkit for processing sequences in FASTA/Q formats
C
1,363
star
4

bioawk

BWK awk modified for biological data
C
590
star
5

miniprot

Align proteins to genomes with splicing and frameshift
C
316
star
6

miniasm

Ultrafast de novo assembly for long noisy reads (though having no consensus step)
TeX
302
star
7

wgsim

Reads simulator
C
258
star
8

gfatools

Tools for manipulating sequence graphs in the GFA and rGFA formats
C
208
star
9

biofast

Benchmarking programming languages/implementations for common tasks in Bioinformatics
C
175
star
10

readfq

Fast multi-line FASTA/Q reader in several programming languages
C
170
star
11

cgranges

A C/C++ library for fast interval overlap queries (with a "bedtools coverage" example)
C
165
star
12

kmer-cnt

Code examples of fast and simple k-mer counters for tutorial purposes
C++
162
star
13

pangene

Constructing a pangenome gene graph
C
158
star
14

psmc

Implementation of the Pairwise Sequentially Markovian Coalescent (PSMC) model
C
147
star
15

bedtk

A simple toolset for BED files (warning: CLI may change before bedtk becomes stable)
C
134
star
16

ksw2

Global alignment and alignment extension
C
127
star
17

yak

Yet another k-mer analyzer
C
111
star
18

fermikit

De novo assembly based variant calling pipeline for Illumina short reads
TeX
108
star
19

minimap

This repo is DEPRECATED. Please use minimap2, the successor of minimap.
C
106
star
20

hickit

TAD calling, phase imputation, 3D modeling and more for diploid single-cell Hi-C (Dip-C) and general Hi-C
C
105
star
21

bgt

Flexible genotype query among 30,000+ samples whole-genome
C
96
star
22

dipcall

Reference-based variant calling pipeline for a pair of phased haplotype assemblies
JavaScript
92
star
23

srf

SRF: Satellite Repeat Finder
TeX
86
star
24

unimap

A EXPERIMENTAL fork of minimap2 optimized for assembly-to-reference alignment
C
85
star
25

minipileup

Simple pileup-based variant caller
C
79
star
26

fermi

A WGS de novo assembler based on the FMD-index for large genomes
C
74
star
27

dna-nn

Model and predict short DNA sequence features with neural networks
C
72
star
28

fermi-lite

Standalone C library for assembling Illumina short reads in small regions
C
72
star
29

bfc

High-performance error correction for Illumina resequencing data
TeX
68
star
30

ropebwt2

Incremental construction of FM-index for DNA sequences
TeX
67
star
31

tabtk

Toolkit for processing TAB-delimited format
C
59
star
32

gwfa

Proof-of-concept implementation of GWFA for sequence-to-graph alignment
C
56
star
33

CHM-eval

TeX
49
star
34

miniwfa

A reimplementation of the WaveFront Alignment algorithm at low memory
C
49
star
35

jstreeview

Interactive phylogenetic tree viewer/editor
JavaScript
46
star
36

samtools

This is *NOT* the official repository of samtools.
C
46
star
37

etrf

Exact Tandem Repeat Finder (not a TRF replacement)
C
45
star
38

ref-gen

Human reference genome analysis sets
Makefile
44
star
39

bioseq-js

For live demo, see http://lh3lh3.users.sourceforge.net/bioseq.shtml
HTML
37
star
40

lv89

C implementation of the Landau-Vishkin algorithm
C++
35
star
41

partig

An experimental tool to estimate the similarity between all pairs of contigs
C
35
star
42

asub

A unified array job submitter for LSF, SGE/UGE and Slurm
Perl
32
star
43

klib.nim

Experimental getopt, gzip reader, FASTA/Q parser and interval queries in nim-lang
Nim
32
star
44

calN50

Compute N50/NG50 and auN/auNG
JavaScript
31
star
45

sdust

Symmetric DUST for finding low-complexity regions in DNA sequences
C
31
star
46

gffio

C
31
star
47

pre-pe

Preprocessing paired-end reads produced with experiment-specific protocols
C
31
star
48

hapdip

The CHM1-NA12878 benchmark for single-sample SNP/INDEL calling from WGS Illumina data
JavaScript
30
star
49

fermi2

C
26
star
50

misc

Useful small programs
C
26
star
51

varcmp

The first CHM1 paper (Li, 2014)
TeX
25
star
52

minisv

Lightweight mosaic/somatic SV caller for long reads (WIP)
JavaScript
24
star
53

lianti

Tools to process LIANTI sequence data
C
23
star
54

rtgeval

Wrapper for RTG's vcfeval; DEPRECATED!
Shell
21
star
55

nasw

Dynamic programming for aa-to-nt alignment with affine gap, splicing and frameshift
C
18
star
56

sgdp-fermi

FermiKit small variant calls for public SGDP samples
17
star
57

gfa1

This repo is deprecated. Please use gfatools instead.
C
16
star
58

pubLRasm

16
star
59

PortableCrystal

Portable Crystal binary distributions for Linux on x86_64
15
star
60

foreign

Modified or extracted from other programs
C
15
star
61

trimadap

Fast but inaccurate adapter trimmer for Illumina reads
C
14
star
62

lh3-snippets

C
14
star
63

ropebwt3

Construction and utility of BWT for DNA string sets
C
13
star
64

treebest

TreeBeST: Tree Building guided by Species Tree
C
13
star
65

unicall

A wrapper for calling small variants from human germline high-coverage single-sample Illumina data
Perl
12
star
66

fastARG

Fast heuristic ARG construction
C
12
star
67

proot-wrapper

Demonstrating the PRoot program
Perl
12
star
68

rmaxcut

An experimental tool to find approximate max-cuts in a large graph
C
11
star
69

bwa-docker

Minimal docker image for bwa. Not developed any more.
11
star
70

sdg

EXPERIMENTAL implementation of side graph
C
10
star
71

naivepca

Naive PCA for genotype data
C
10
star
72

mdust

mdust from DFCI Gene Indices Software Tools (archived for a historical record only)
C
10
star
73

editdist-U85

Fast implementation of Ukkenon's O(ND) algorithm for computing edit distance
C
9
star
74

lh3.github.com

TeX
9
star
75

libdivsufsort

Automatically exported from code.google.com/p/libdivsufsort
C
8
star
76

mem-paper

Manuscript for BWA-MEM
6
star
77

bcf2

Experimental bcftools port to support BCF2; DEPRECATED by htslib and htsbox
C
6
star
78

thesis

PhD thesis
TeX
5
star
79

ropebwt

C
4
star
80

fermi-paper

The first fermi paper (Li, 2012)
3
star
81

crlf

Concise Run-Length Format for small alphabets; DEPRECATED
C
3
star
82

psnw

prototype
C
2
star
83

centos5-vm

Instructions on how to deploy CentOS 5 virtual machines
2
star
84

mag2gfa

DEPRECATED. Code has been moved to lh3/gfa1/misc
C
2
star
85

ibsget

Download files from Illumina BaseSpace (*OUTDATED* as BaseSpace has changed APIs)
C
2
star
86

smtl-paper

Samtools statistics paper (Li, 2011)
Lua
1
star
87

mssa-bench

Evaluating the performance of multi-string SA construction
C
1
star
88

samtools-legacy

For testing only. DON'T USE!
C
1
star