• Stars
    star
    114
  • Rank 308,031 (Top 7 %)
  • Language
    C
  • License
    MIT License
  • Created almost 8 years ago
  • Updated almost 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Gene fusion detection and visualization

install with conda

GeneFuse

A tool to detect and visualize target gene fusions by scanning FASTQ files directly. This tool accepts FASTQ files and reference genome as input, and outputs detected fusion results in TEXT, JSON and HTML formats.

Take a quick glance of the informative report

Get genefuse program

install with Bioconda

install with conda

conda install -c bioconda genefuse

download binary

This binary is only for Linux systems, http://opengene.org/GeneFuse/genefuse

# this binary was compiled on CentOS, and tested on CentOS/Ubuntu
wget http://opengene.org/GeneFuse/genefuse
chmod a+x ./genefuse

or compile from source

# get source (you can also use browser to download from master or releases)
git clone https://github.com/OpenGene/genefuse.git

# build
cd genefuse
make

# Install
sudo make install

Usage

You should provide following arguments to run genefuse

  • the reference genome fasta file, specified by -r or --ref=
  • the fusion setting file, specified by -f or --fusion=
  • the fastq file(s), specified by -1 or --read1= for single-end data. If dealing with pair-end data, specify the read2 file by -2 or --read2=
  • use -h or --html= to specify the file name of HTML report
  • use -j or --json= to specify the file name of JSON report
  • the plain text result is directly printed to STDOUT, you can pipe it to a file using a >

Example

genefuse -r hg19.fasta -f genes/druggable.hg19.csv -1 genefuse.R1.fq.gz -2 genefuse.R2.fq.gz -h report.html > result

Reference genome

The reference genome should be a single whole FASTA file containg all chromosome data. This file shouldn't be compressed. For human data, typicall hg19/GRch37 or hg38/GRch38 assembly is used, which can be downloaded from following sites:

Fusion file

The fusion file is a list of coordinated target genes together with their exons. A sample is:

>EML4_ENST00000318522.5,chr2:42396490-42559688
1,42396490,42396776
2,42472645,42472827
3,42483641,42483770
4,42488261,42488434
5,42490318,42490446
...

>ALK_ENST00000389048.3,chr2:29415640-30144432
1,30142859,30144432
2,29940444,29940563
3,29917716,29917880
4,29754781,29754982
5,29606598,29606725
...

The coordination system should be consistent with the reference genome.

Fusion files provided in this package

Four fusion files are provided with genefuse:

  1. genes/druggable.hg19.csv: all druggable fusion genes based on hg19/GRch37 reference assembly.
  2. genes/druggable.hg38.csv: all druggable fusion genes based on hg38/GRch38 reference assembly.
  3. genes/cancer.hg19.csv: all COSMIC curated fusion genes (http://cancer.sanger.ac.uk/cosmic/fusion) based on hg19/GRch37 reference assembly.
  4. genes/cancer.hg38.csv: all COSMIC curated fusion genes (http://cancer.sanger.ac.uk/cosmic/fusion) based on hg38/GRch38 reference assembly.

Notes:

  • genefuse runs much faster with druggable genes than cancer genes, since druggable genes are only a small subset of cancer genes. Use this one if you only care about the fusion related personalized medicine for cancers.
  • The cancer genes should be enough for most cancer related studies, since all COSMIC curated fusion genes are included.
  • If you want to create a custom gene list, please follow the instructions given on next section.

Create a fusion file based on hg19 or hg38

If you'd like to create a custom fusion file, you can use scripts/make_fusion_genes.py
As the script uses refFlat.txt file to determine genomic coordinates of exons, you need to download a refFlat.txt file from UCSC Genome Browser in advance. Of course, the choice of using either hg19 or hg38 is up to you.

Please make sure unzip the file to txt format before you continue

As for the input gene list file, all genes should be listed in separate lines. By default, the longest transcript will be used. However, you can specify a different transcript by adding the transcript ID to the end of a gene. The gene and its transcript should be separated by a tab or a space. Please note that each gene should be the HGNC official gene symbol, and each transcript should be NCBI RefSeq transcript ID.

An example of gene list file:

BRCA2	NM_000059
FAM155A
IRS2

When both input gene list file (gene_list.txt) and refFlat.txt file are prepared, you can use following command to generate a user-defined fusion file (fusion.csv):

python3 scripts/make_fusion_genes.py gene_list.txt -r /path/to/refflat -o fusion.csv

HTML report

GeneFuse can generate very informative and interactive HTML pages to visualize the fusions with following information:

  • the fusion genes, along with their transcripts.
  • the inferred break point with reference genome coordinations.
  • the inferred fusion protein, with all exons and the transcription direction.
  • the supporting reads, with all bases colorized according to their quality scores.
  • the number of supporting reads, and how many of them are unique (the rest may be duplications)

A HTML report example

image
See the HTML page of this picture: http://opengene.org/GeneFuse/report.html

All options

options:
  -1, --read1       read1 file name (string)
  -2, --read2       read2 file name (string [=])
  -f, --fusion      fusion file name, in CSV format (string)
  -r, --ref         reference fasta file name (string)
  -u, --unique      specify the least supporting read number is required to report a fusion, default is 2 (int [=2])
  -d, --deletion    specify the least deletion length of a intra-gene deletion to report, default is 50 (int [=50])
  -h, --html        file name to store HTML report, default is genefuse.html (string [=genefuse.html])
  -j, --json        file name to store JSON report, default is genefuse.json (string [=genefuse.json])
  -t, --thread      worker thread number, default is 4 (int [=4])
  -?, --help        print this message

Cite GeneFuse

If you used GeneFuse in you work, you can cite it as:

Shifu Chen, Ming Liu, Tanxiao Huang, Wenting Liao, Mingyan Xu and Jia Gu. GeneFuse: detection and visualization of target gene fusions from DNA sequencing data. International Journal of Biological Sciences, 2018; 14(8): 843-848. doi: 10.7150/ijbs.24626

More Repositories

1

fastp

An ultra-fast all-in-one FASTQ preprocessor (QC/adapters/trimming/filtering/splitting/merging...)
C++
1,840
star
2

awesome-bio-datasets

awesome-bio-datasets
211
star
3

AfterQC

Automatic Filtering, Trimming, Error Removing and Quality Control for fastq data
Python
203
star
4

MutScan

Detect and visualize target mutations by scanning FastQ files directly
C
148
star
5

repaq

A fast lossless FASTQ compressor with ultra-high compression ratio
C
122
star
6

gencore

Generate duplex/single consensus reads to reduce sequencing noises and remove duplications
C++
111
star
7

fastv

An ultra-fast tool for identification of SARS-CoV-2 and other microbes from sequencing data. This tool can be used to detect viral infectious diseases, like COVID-19.
C++
110
star
8

scrnapip

A Systematic and Dynamic Pipeline for Single-Cell RNA Sequencing Analysis
HTML
98
star
9

OpenGene.jl

(No maintenance) OpenGene, core libraries for NGS data analysis and bioinformatics in Julia
Julia
64
star
10

CfdnaPattern

Pattern Recognition for Cell-free DNA
Python
58
star
11

UniqueKMER

Generate unique KMERs for every contig in a FASTA file
C
43
star
12

ctdna-pipeline

A simplified pipeline for ctDNA sequencing data analysis
Shell
36
star
13

VisualMSI

Detect and visualize microsatellite instability(MSI) from NGS data
C++
31
star
14

defq

Please switch to https://github.com/OpenGene/defastq
C
28
star
15

MrBam

Query Mutated Reads from a Bam
Python
26
star
16

FusionDirect.jl

(No maintenance) Detect gene fusion directly from raw fastq files
Julia
25
star
17

SeqMaker.jl

(No maintenance) Next Generation Sequencing Simulation with SNP, Variation and Sequencing Error Integrated
Julia
24
star
18

dedup

Deduplication for cfDNA sequencing data
Python
10
star
19

defastq

Ultra-fast Multi-threaded FASTQ Demultiplexing
C++
7
star
20

pecheck

check paired-end FASTQ data integrity
C
6
star
21

slicer

Slice a text file (like FastQ) to smaller files by lines, with gzip supported
C
6
star
22

ACMSI

The shiny-based app for Fragment Analysis, especially for MSI analysis
R
4
star
23

novelbio-bioinfo

Java
2
star
24

novelbio-base

Java
2
star
25

IRDProc

Process genomic data downloaded from influenza research database for unique k-mer generating
Python
1
star