• Stars
    star
    148
  • Rank 249,983 (Top 5 %)
  • Language
    C
  • License
    MIT License
  • Created over 8 years ago
  • Updated almost 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Detect and visualize target mutations by scanning FastQ files directly

install with conda

MutScan

Detect and visualize target mutations by scanning FastQ files directly

Features

  • Ultra sensitive, guarantee that all reads supporting the mutations will be detected
  • Can be 50X+ faster than normal pipeline (i.e. BWA + Samtools + GATK/VarScan/Mutect).
  • Very easy to use and need nothing else. No alignment, no reference genome, no variant call, no...
  • Contains built-in most actionable mutation points for cancer-related mutations, like EGFR p.L858R, BRAF p.V600E...
  • Beautiful and informative HTML report with informative pileup visualization.
  • Multi-threading support.
  • Supports both single-end and pair-end data.
  • For pair-end data, MutScan will try to merge each pair, and do quality adjustment and error correction.
  • Able to scan the mutations in a VCF file, which can be used to visualize called variants.
  • Can be used to filter false-positive mutations. i.e. MutScan can handle highly repetive sequence to avoid false INDEL calling.

Application scenarios:

  • you are interested in some certain mutations (like cancer drugable mutations), and want to check whether the given FastQ files contain them.
  • you have no enough confidence with the mutations called by your pipeline, so you want to visualize and validate them to avoid false positive calling.
  • you worry that your pipeline uses too strict filtering and may cause some false negative, so you want to check that in a fast way.
  • you want to visualize the called mutation and take a screenshot with its clear pipeup information.
  • you called a lot of INDEL mutations, and you worry that mainly they are false positives (especially in highly repetive region)
  • you want to validate and visualize every record in the VCF called by your pipeline.
  • ...

Take a quick glance

mutscan -1 R1.fq.gz -2 R2.fq.gz

Get MutScan

install with Bioconda

install with conda

conda install -c bioconda mutscan

download binary

This binary is only for Linux systems: http://opengene.org/MutScan/mutscan

# this binary was compiled on CentOS, and tested on CentOS/Ubuntu
wget http://opengene.org/MutScan/mutscan
chmod a+x ./mutscan

or compile from source

# get source (you can also use browser to download from master or releases)
git clone https://github.com/OpenGene/MutScan.git

# build
cd mutscan
make

# Install
sudo make install

Windows version (may be not the latest version)

If you want to compile MutScan on Windows, you should use cygwin. We already built one with cygwin-2.6.0/g++ 5.4, and it can be downloaded from:   http://opengene.org/MutScan/windows_mutscan.zip

HTML report

  • A HTML report will be generated, and written to the given filename. See http://opengene.org/MutScan/report.html for an example.
  • If you run the command in your Linux server and want to view the HTML report on your local system. DO remember to copy all of the xxxx.html and xxxx.html.files and keep them in the same folder, then click xxxx.html to view it in browser.
  • The default file name is mutscan.html, and a folder mutscan.html.files will be also generated.
  • By default, an indivudal HTML file will be generated for each found mutation. But you can specify -s or --standalone to contain all mutations in a single HTML file. Be caution with this mode if you are scanning too many records (for example, scanning VCF), it will give you a very big HTML file and is not loadable by browser.
  • Here is a screenshot for the pileup of a mutation (EGFR p.T790M) generated by MutScan:

image

  • An pileup of EGFR p.T790M mutation is displayed above. EGFR p.T790M is a very important drugable mutation for lung cancer.
  • The color of each base indicates its quality, and the quality will be shown when mouse over.
  • In first column, d means the edit distance of match, and --> means forward, <-- means reverse

JSON report

JSON report is disabled by default. You can enable it by specifying a JSON file name using -j or --json. A JSON report is like this:

{
	"command":"./mutscan -1 /Users/shifu/data/fq/S010_20170320003-4_ffpedna_pan-cancer-v1_S10_R1_001.fastq -2 /Users/shifu/data/fq/S010_20170320003-4_ffpedna_pan-cancer-v1_S10_R2_001.fastq -h z.html -j z.json -v --simplified=off ",
	"version":"1.14.0",
	"time":"2018-05-15  15:48:21",
	"mutations":{
		"NRAS-neg-1-115258747-2-c.35G>C-p.G12A-COSM565":{
			"chr":"chr1",
			"ref":["TGGATTGTCAGTGCGCTTTTCCCAACACCA","G","CTGCTCCAACCACCACCAGTTTGTACTCAG"],
			"reads":[
				{
					"breaks":[31,61,62,76], 
					"seq":"ATATTCATCTACAAAGTGGTTCTGGATTAGCTGGATTGTCAGTGCGCTTTTCCCAACACCAGCTGCTCCAACCACC",
					"qual":"eeeeeiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiieiiiiiiiiiiieieeeee"
				},
				{
					"breaks":[31,61,62,76], 
					"seq":"ATATTCATCTACAAAGTGGTTCTGGATTAGCTGGATTGTCAGTGCGCTTTTCCCAACACCAGCTGCTCCAACCACC",
					"qual":"eeeeeiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiieeeee"
				}
			]
		},
		"PIK3CA-pos-3-178936082-9-c.1624G>A-E542K-COSM760":{
			"chr":"chr3",
			"ref":["AAAGCAATTTCTACACGAGATCCTCTCTCT","A","AAATCACTGAGCAGGAGAAAGATTTTCTAT"],
			"reads":[
				{
					"breaks":[22,52,53,83], 
					"seq":"GGAAAATGACAAAGAACAGCTCAAAGCAATTTCTACACGAGATCCTCTCTCTAAAATCACTGAGCAGGAGAAAGATTTTCCAAAGATGTTTCTCAGAACGCTGCAGTCTGCAATTTGTATGAATTCCC",
					"qual":"eeeeeiiiQiiiiiieiiiieiSeiiiiiie`iiii`i`iiiiiiiiiiiiii`iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiaiiiiiiiiiiiiiiiiiieiiiiiieeeee"
				},
				{
					"breaks":[0,27,28,58], 
					"seq":"GCAATTTCTACACGAGATCCTCTCTCTAAAATCACTGCGCAGGAGAAAGATTTTCTATGGACCACAGGTAAGTGCTAAAATGGAGATTCTCTGTTTCTTTTTCTTTATTACAGAAAAAATAACTGACTTTGGCTGATCTCAGCATGTTTTTACCATACC",
					"qual":"AAAAAEEEEiieiiieiiiiiiiiiieiiiiiiiie``iiiiiieiiiiiiiiiieiiiieiieieeiiiSiiiiiieiiiiiiiiiiiiiieiiiiiSiiiiiiiiiiiiieiiiiiiiiiiii`ieiiieiii`ieiiiii`eS``eieEEEAAAAA"
				}
			]
		}
	}
}

All options

usage: mutscan -1 <read1_file> -2 <read2_file> [options]...
options:
  -1, --read1                read1 file name, required
  -2, --read2                read2 file name
  -m, --mutation             mutation file name, can be a CSV format or a VCF format
  -r, --ref                  reference fasta file name (only needed when mutation file is a VCF)
  -h, --html                 filename of html report, default is mutscan.html in work directory
  -j, --json                 filename of JSON report, default is no JSON report (string [=])
  -t, --thread               worker thread number, default is 4
  -S, --support              min read support required to report a mutation, default is 2.
  -k, --mark                 when mutation file is a vcf file, --mark means only process the records with FILTER column is M
  -l, --legacy               use legacy mode, usually much slower but may be able to find a little more reads in certain case
  -s, --standalone           output standalone HTML report with single file. Don't use this option when scanning too many target mutations (i.e. >1000 mutations)
  -n, --no-original-reads    dont output original reads in HTML and text output. Will make HTML report files a bit smaller
  -?, --help                 print this message

The plain text result, which contains the detected mutations and their support reads, will be printed directly. You can use > to redirect output to a file, like:

mutscan -1 <read1_file_name> -2 <read2_file_name> > result.txt

MutScan generate a very informative HTML file report, default is mutscan.html in the work directory. You can change the file name with -h argument, like:

mutscan -1 <read1_file_name> -2 <read2_file_name> -h report.html

single-end and pair-end

For single-end sequencing data, -2 argument is omitted:

mutscan -1 <read1_file_name>

multi-threading

-t argument specify how many worker threads will be launched. The default thread number is 4. Suggest to use a number less than the CPU cores of your system.

Mutation file

  • Mutation file, specified by -m, can be a CSV file, or a VCF file.
  • If no -m specified, MutScan will use the built-in default mutation file with about 60 cancer related mutation points.
  • If a CSV is provided, no reference genome assembly needed.
  • If a VCF is provided, corresponding reference genome assembly should be provided (i.e. ucsc.hg19.fasta), and should not be zipped.

CSV-format mutation file

A CSV file with columns of name, left_seq_of_mutation_point, mutation_seq, right_seq_of_mutation_point and chromosome(optional)

#name, left_seq_of_mutation_point, mutation_seq, right_seq_of_mutation_point, chromosome
NRAS-neg-1-115258748-2-c.34G>A-p.G12S-COSM563, GGATTGTCAGTGCGCTTTTCCCAACACCAC, T, TGCTCCAACCACCACCAGTTTGTACTCAGT, chr1
NRAS-neg-1-115252203-2-c.437C>T-p.A146V-COSM4170228, TGAAAGCTGTACCATACCTGTCTGGTCTTG, A, CTGAGGTTTCAATGAATGGAATCCCGTAAC, chr1
BRAF-neg-7-140453136-15-c.1799T>A -V600E-COSM476, AACTGATGGGACCCACTCCATCGAGATTTC, T, CTGTAGCTAGACCAAAATCACCTATTTTTA, chr7
EGFR-pos-7-55241677-18-c.2125G>A-p.E709K-COSM12988, CCCAACCAAGCTCTCTTGAGGATCTTGAAG, A, AAACTGAATTCAAAAAGATCAAAGTGCTGG, chr7
EGFR-pos-7-55241707-18-c.2155G>A-p.G719S-COSM6252, GAAACTGAATTCAAAAAGATCAAAGTGCTG, A, GCTCCGGTGCGTTCGGCACGGTGTATAAGG, chr7
EGFR-pos-7-55241707-18-c.2155G>T-p.G719C-COSM6253, GAAACTGAATTCAAAAAGATCAAAGTGCTG, T, GCTCCGGTGCGTTCGGCACGGTGTATAAGG, chr7

testdata/mutations.csv gives an example of CSV-format mutation file

VCF-format mutation file

A standard VCF can be used as a mutation file, with file extension .vcf or .VCF. If the mutation file is a VCF file, you should specify the reference assembly file by -r <ref.fa>. For example the command can be:

mutscan -1 R1.fq -2 R2.fq -m target.vcf -r hg19.fa

Work with BAM/CRAM

If you want to run MutScan with BAM/CRAM files, you can use samtools to convert them to FASTQ files using samtools fastq command, both single-end and paired-end data are supported by latest version of samtools fastq.

Remarks

  • MutScan requires at least 50 bp long reads, if your reads are too short, do not use it
  • If you want to extract mutations even with only one read support, add -S 1 or --support=1 in the command
  • Feel free to raise an issue if you meet any problem

Cite MutScan

Shifu Chen, Tanxiao Huang, TieXiang Wen, Hong Li, Mingyan Xu and Jia Gu. MutScan: fast detection and visualization of target mutations by scanning FASTQ data. BMC Bioinformatics. https://doi.org/10.1186/s12859-018-2024-6

More Repositories

1

fastp

An ultra-fast all-in-one FASTQ preprocessor (QC/adapters/trimming/filtering/splitting/merging...)
C++
1,840
star
2

awesome-bio-datasets

awesome-bio-datasets
211
star
3

AfterQC

Automatic Filtering, Trimming, Error Removing and Quality Control for fastq data
Python
203
star
4

repaq

A fast lossless FASTQ compressor with ultra-high compression ratio
C
122
star
5

GeneFuse

Gene fusion detection and visualization
C
114
star
6

gencore

Generate duplex/single consensus reads to reduce sequencing noises and remove duplications
C++
111
star
7

fastv

An ultra-fast tool for identification of SARS-CoV-2 and other microbes from sequencing data. This tool can be used to detect viral infectious diseases, like COVID-19.
C++
110
star
8

scrnapip

A Systematic and Dynamic Pipeline for Single-Cell RNA Sequencing Analysis
HTML
98
star
9

OpenGene.jl

(No maintenance) OpenGene, core libraries for NGS data analysis and bioinformatics in Julia
Julia
64
star
10

CfdnaPattern

Pattern Recognition for Cell-free DNA
Python
58
star
11

UniqueKMER

Generate unique KMERs for every contig in a FASTA file
C
43
star
12

ctdna-pipeline

A simplified pipeline for ctDNA sequencing data analysis
Shell
36
star
13

VisualMSI

Detect and visualize microsatellite instability(MSI) from NGS data
C++
31
star
14

defq

Please switch to https://github.com/OpenGene/defastq
C
28
star
15

MrBam

Query Mutated Reads from a Bam
Python
26
star
16

FusionDirect.jl

(No maintenance) Detect gene fusion directly from raw fastq files
Julia
25
star
17

SeqMaker.jl

(No maintenance) Next Generation Sequencing Simulation with SNP, Variation and Sequencing Error Integrated
Julia
24
star
18

dedup

Deduplication for cfDNA sequencing data
Python
10
star
19

defastq

Ultra-fast Multi-threaded FASTQ Demultiplexing
C++
7
star
20

pecheck

check paired-end FASTQ data integrity
C
6
star
21

slicer

Slice a text file (like FastQ) to smaller files by lines, with gzip supported
C
6
star
22

ACMSI

The shiny-based app for Fragment Analysis, especially for MSI analysis
R
4
star
23

novelbio-bioinfo

Java
2
star
24

novelbio-base

Java
2
star
25

IRDProc

Process genomic data downloaded from influenza research database for unique k-mer generating
Python
1
star