• Stars
    star
    452
  • Rank 96,761 (Top 2 %)
  • Language
  • License
    MIT License
  • Created over 7 years ago
  • Updated about 4 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Bioinformatics one liners from Ming Tang

bioinformatics-one-liners

my collection of bioinformatics one liners that is useful in my day-to-day work

I came across the bioinformatics one-liners on the biostar forum and gathered them here.

I also added some of my own tricks

05/21/2015.

get the sequences length distribution form a fastq file using awk

zcat file.fastq.gz | awk 'NR%4 == 2 {lengths[length($0)]++} END {for (l in lengths) {print l, lengths[l]}}'  

add barcode to 10x single cell R1 read

cat test.fq | awk 'NR%4 == 2 {$0="xxx"$0}{print}'
@D00365:1187:HMM2FBCX2:1:1103:1258:2132 1:N:0:CGTGCAGA
xxxTATTACCAGATGAGAGCATGGTTAGG
+
DDDDDIIIIIIIHIIIIIIIIIIIII
@D00365:1187:HMM2FBCX2:1:1103:1472:2136 1:N:0:CGTGCAGA
xxxAACCATGAGTGTCCCGCTGGCATCGC
+
DDDADGHHIIHIIGIHHHFCHHIIII
@D00365:1187:HMM2FBCX2:1:1103:1822:2139 1:N:0:CGTGCAGA
xxxGTGCATATCATGTAGCGTATTATACT
+
DDDDDIIIIIIIIIIIIIIIIIIIII
@D00365:1187:HMM2FBCX2:1:1103:1943:2145 1:N:0:CGTGCAGA
xxxGATTCAGTCTCCAACCTCTCCTTTGT
+
DDDDDHIIIIIIIIIIHIIIIHIIII
@D00365:1187:HMM2FBCX2:1:1103:1917:2147 1:N:0:CGTGCAGA
xxxCCTTCGACAAGTTGTCAGGTGCGGTC
+
DDDDDHIIIIIIIIIIIIIIGIIHHH

Reverse complement a sequence (I use that a lot when I need to design primers)

echo 'ATTGCTATGCTNNNT' | rev | tr 'ACTG' 'TGAC'

split a multifasta file into single ones with csplit:

csplit -z -q -n 4 -f sequence_ sequences.fasta /\>/ {*}  

Split a multi-FASTA file into individual FASTA files by awk

awk '/^>/{s=++d".fa"} {print > s}' multi.fa

linearize multiline fasta

cat file.fasta | awk '/^>/{if(N>0) printf("\n"); ++N; printf("%s\t",$0);next;} {printf("%s",$0);}END{printf("\n");}'
awk 'BEGIN{RS=">"}NR>1{sub("\n","\t"); gsub("\n",""); print RS$0}' file.fa

fastq2fasta

zcat file.fastq.gz | paste - - - - | perl -ane 'print ">$F[0]\n$F[2]\n";' | gzip -c > file.fasta.gz

bam2bed

samtools view file.bam | perl -F'\t' -ane '$strand=($F[1]&16)?"-":"+";$length=1;$tmp=$F[5];$tmp =~ s/(\d+)[MD]/$length+=$1/eg;print "$F[2]\t$F[3]\t".($F[3]+$length)."\t$F[0]\t0\t$strand\n";' > file.bed

bam2wig

samtools mpileup -BQ0 file.sorted.bam | perl -pe '($c, $start, undef, $depth) = split;if ($c ne $lastC || $start != $lastStart+1) {print "fixedStep chrom=$c start=$start step=1 span=1\n";}$_ = $depth."\n";($lastC, $lastStart) = ($c, $start);' | gzip -c > file.wig.gz

Number of reads in a fastq file

cat file.fq | echo $((`wc -l`/4))

Single line fasta file to multi-line fasta of 60 characteres each line

awk -v FS= '/^>/{print;next}{for (i=0;i<=NF/60;i++) {for (j=1;j<=60;j++) printf "%s", $(i*60 +j); print ""}}' file

fold -w 60 file

Sequence length of every entry in a multifasta file

awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen = seqlen +length($0)}END{print seqlen}' file.fa

Reproducible subsampling of a FASTQ file. srand() is the seed for the random number generator - keeps the subsampling the same when the script is run multiple times. 0.01 is the % of reads to output.

cat file.fq | paste - - - - | awk 'BEGIN{srand(1234)}{if(rand() < 0.01) print $0}' | tr '\t' '\n' > out.fq

or look at the Hengli's Seqtk

Deinterleaving a FASTQ:

cat file.fq | paste - - - - - - - - | tee >(cut -f1-4 | tr '\t'  
'\n' > out1.fq) | cut -f5-8 | tr '\t' '\n' > out2.fq

Using mpileup for a whole genome can take forever. So, handling each chromosome separately and parallely running them on several cores will speed up your pipeline. Using xargs you can easily realize it.

Example usage of xargs (-P is the number of parallel processes started - don't use more than the number of cores you have available):

samtools view -H yourFile.bam | grep "\@SQ" | sed 's/^.*SN://g' | cut -f 1 | xargs -I {} -n 1 -P 24 sh -c "samtools mpileup -BQ0 -d 100000 -uf yourGenome.fa -r {} yourFile.bam | bcftools view -vcg - > tmp.{}.vcf"

To merge the results afterwards, you might want to do something like this:

samtools view -H yourFile.bam | grep "\@SQ" | sed 's/^.*SN://g' | cut -f 1 | perl -ane 'system("cat tmp.$F[0].bcf >> yourFile.vcf");'

split large file by id/label/column

awk '{print >> $1; close($1)}' input_file

split a bed file by chromosome:

cat nexterarapidcapture_exome_targetedregions_v1.2.bed | sort -k1,1 -k2,2n | sed 's/^chr//' | awk '{close(f);f=$1}{print > f".bed"}'

#or
awk '{print $0 >> $1".bed"}' example.bed

sort vcf file with header

cat my.vcf | awk '$0~"^#" { print $0; next } { print $0 | "sort -k1,1V -k2,2n" }'

Rename a file, bash string manipulation

for file in *gz
do zcat $file > ${file/bed.gz/bed}

gnu sed print invisible characters

cat my_file | sed -n 'l'
cat -A

exit a dead ssh session

~.

copy large files, copy the from_dir directory inside the to_dir directory

rsync -av from_dir  to_dir

## copy every file inside the frm_dir to to_dir
rsync -av from_dir/ to_dir

##re-copy the files avoiding completed ones:

rsync -avhP /from/dir /to/dir

make directory using the current date

mkdir $(date +%F)

all the folders' size in the current folder (GNU du)

du -h --max-depth=1

this one is a bit different, try it and see the difference

du -ch

the total size of current directory

du -sh .

disk usage

df -h

the column names of the file, install csvkit https://csvkit.readthedocs.org/en/0.9.1/

csvcut -n

open top with human readable size in Mb, Gb. install htop for better visualization

top -M

how many memeory are used in Gb

free -mg

print out unique rows based on the first and second column

awk '!a[$1,$2]++' input_file

sort -u -k1,2 file It will sort based on unique first and second column

do not wrap the lines using less

less -S

pretty output

fold -w 60
cat file.txt | column -t | less -S

pass tab as delimiter http://unix.stackexchange.com/questions/46910/is-it-a-bug-for-join-with-t-t

-t $'\t'

awk with the first line printed always

awk ' NR ==1 || ($10 > 1 && $11 > 0 && $18 > 0.001)' input_file

delete blank lines with sed

sed /^$/d

delete the last line

sed $d

awk to join files based on several columns

my github repo

### select lines from a file based on columns in another file
## http://unix.stackexchange.com/questions/134829/compare-two-columns-of-different-files-and-print-if-it-matches
awk -F"\t" 'NR==FNR{a[$1$2$3]++;next};a[$1$2$3] > 0' file2 file1 

Finally learned about the !$ in unix: take the last thing (word) from the previous command.
echo hello, world; echo !$ gives 'world'

Create a script of the last executed command:
echo "!!" > foo.sh

Reuse all parameter of the previous command line:
!*

find bam in current folder (search recursively) and copy it to a new directory using 5 CPUs
find . -name "*bam" | xargs -P5 -I{} rsync -av {} dest_dir

ls -X will group files by extension.

loop through all the chromosomes

for i in {1..22} X Y 
do
  echo $i
done

for i in in {01..22} will expand to 01 02 ...

change every other newline to tab:

paste is used to concatenate corresponding lines from files: paste file1 file2 file3 .... If one of the "file" arguments is "-", then lines are read from standard input. If there are 2 "-" arguments, then paste takes 2 lines from stdin. And so on.

cat test.txt  
0    ATTTTATTNGAAATAGTAGTGGG
0    CTCCCAAAATACTAAAATTATAA
1    TTTTAGTTATTTANGAGGTTGAG
1    CNTAATCTTAACTCACTACAACC
2    TTATAATTTTAGTATTTTGGGAG
2    CATATTAACCAAACTAATCTTAA
3    GGTTAATATGGTGAAATTTAAT
3    ACCTCAACCTCNTAAATAACTAA

cat test.txt| paste - -                               
0    ATTTTATTNGAAATAGTAGTGGG    0    CTCCCAAAATACTAAAATTATAA
1    TTTTAGTTATTTANGAGGTTGAG    1    CNTAATCTTAACTCACTACAACC
2    TTATAATTTTAGTATTTTGGGAG    2    CATATTAACCAAACTAATCTTAA
3    GGTTAATATGGTGAAATTTAAT     3    ACCTCAACCTCNTAAATAACTAA

ORS: output record seperator in awk var=condition?condition_if_true:condition_if_false is the ternary operator.

cat test.txt| awk 'ORS=NR%2?"\t":"\n"'          

0    ATTTTATTNGAAATAGTAGTGGG    0    CTCCCAAAATACTAAAATTATAA
1    TTTTAGTTATTTANGAGGTTGAG    1    CNTAATCTTAACTCACTACAACC
2    TTATAATTTTAGTATTTTGGGAG    2    CATATTAACCAAACTAATCTTAA
3    GGTTAATATGGTGAAATTTAAT     3    ACCTCAACCTCNTAAATAACTAA

awk

We can also use the concept of a conditional operator in print statement of the form print CONDITION ? PRINT_IF_TRUE_TEXT : PRINT_IF_FALSE_TEXT. For example, in the code below, we identify sequences with lengths > 14:

cat data/test.tsv
blah_C1	ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG
blah_C2	ACTTTATATATT
blah_C3	ACTTATATATATATA
blah_C4	ACTTATATATATATA
blah_C5	ACTTTATATATT	

awk '{print (length($2)>14) ? $0">14" : $0"<=14";}' data/test.tsv
blah_C1	ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG>14
blah_C2	ACTTTATATATT<=14
blah_C3	ACTTATATATATATA>14
blah_C4	ACTTATATATATATA>14
blah_C5	ACTTTATATATT<=14

awk 'NR==3{print "";next}{printf $1"\t"}{print $1}' data/test.tsv
blah_C1	blah_C1
blah_C2	blah_C2

blah_C4	blah_C4
blah_C5	blah_C5

You can also use getline to load the contents of another file in addition to the one you are reading, for example, in the statement given below, the while loop will load each line from test.tsv into k until no more lines are to be read:

awk 'BEGIN{while((getline k <"data/test.tsv")>0) print "BEGIN:"k}{print}' data/test.tsv
BEGIN:blah_C1	ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG
BEGIN:blah_C2	ACTTTATATATT
BEGIN:blah_C3	ACTTATATATATATA
BEGIN:blah_C4	ACTTATATATATATA
BEGIN:blah_C5	ACTTTATATATT
blah_C1	ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG
blah_C2	ACTTTATATATT
blah_C3	ACTTATATATATATA
blah_C4	ACTTATATATATATA
blah_C5	ACTTTATATATT

merge multiple fasta sequences in two files into a single file line by line

see post

linearize.awk:

/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}
paste <(awk -f linearize.awk file1.fa ) <(awk -f linearize.awk file2.fa  )| tr "\t" "\n"

grep fastq reads containing a pattern but maintain the fastq format

grep -A 2 -B 1 'AAGTTGATAACGGACTAGCCTTATTTT' file.fq | sed '/^--$/d' > out.fq

# or
zcat reads.fq.gz \
| paste - - - - \
| awk -v FS="\t" -v OFS="\n" '$2 ~ "AAGTTGATAACGGACTAGCCTTATTTT" {print $1, $2, $3, $4}' \
| gzip > filtered.fq.gz

count how many columns of a tsv files:

cat file.tsv | head -1 | tr "\t" "\n" | wc -l  
csvcut -n -t  file.tsv (from csvkit)
awk '{print NF; exit}' file.tsv
awk -F "\t" 'NR == 1 {print NF}' file.tsv

combine info to the fasta header

from biostar post

cat myfasta.txt 
>Blap_contig79
MSTDVDAKTRSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSI
>Bluc_contig23663
MSTNVDAKARSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSI
>Blap_contig7988
MSTDVDAKTRSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSI
>Bluc_contig1223663
MSTNVDAKARSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSI

cat my_info.txt 
info1
info2
info3
info4

paste <(cat my_info.txt) <(cat myfasta.txt| paste - - | cut -c2-) | awk '{printf(">%s_%s\n%s\n",$1,$2,$3);}'
>info1_Blap_contig79
MSTDVDAKTRSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSI
>info2_Bluc_contig23663
MSTNVDAKARSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSI
>info3_Blap_contig7988
MSTDVDAKTRSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSI
>info4_Bluc_contig1223663
MSTNVDAKARSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSI

count how many columns in a tsv file

cat file.tsv | head -1 | tr "\t" "\n" | wc -l  

##(from csvkit)
csvcut -n -t file.

## emulate csvcut -n -t
less files.tsv | head -1| tr "\t" "\n" | nl

awk -F "\t" 'NR == 1 {print NF}' file.tsv
awk '{print NF; exit}'

change fasta header

see https://www.biostars.org/p/53212/

The fasta header is like >7 dna:chromosome chromosome:GRCh37:7:1:159138663:1 convert to >7:

cat Homo_sapiens_assembly19.fasta | gawk '/^>/ { b=gensub(" dna:.+", "", "g", $0); print b; next} {print}' > Homo_sapiens_assembly19_reheader.fasta

mkdir and cd into that dir shortcut

mkdir blah && cd $_

cut out columns based on column names in another file

http://crazyhottommy.blogspot.com/2016/10/cutting-out-500-columns-from-26g-file.html

#! /bin/bash

set -e
set -u
set -o pipefail

#### Author: Ming Tang (Tommy)
#### Date 09/29/2016
#### I got the idea from this stackOverflow post http://stackoverflow.com/questions/11098189/awk-extract-columns-from-file-based-on-header-selected-from-2nd-file

# show help
show_help(){
cat << EOF
  This is a wrapper extracting columns of a (big) dataframe based on a list of column names in another
  file. The column names must be one per line. The output will be stdout. For small files < 2G, one 
  can load it into R and do it easily, but when the file is big > 10G. R is quite cubersome. 
  Using unix commands on the other hand is better because files do not have to be loaded into memory at once.
  e.g. subset a 26G size file for 700 columns takes around 30 mins. Memory footage is very low ~4MB.

  usage: ${0##*/} -f < a dataframe  > -c < colNames> -d <delimiter of the file>
        -h display this help and exit.
		-f the file you want to extract columns from. must contain a header with column names.
		-c a file with the one column name per line.
		-d delimiter of the dataframe: , or \t. default is tab.  
		
		e.g. 
		
		for tsv file:
			${0##*/} -f mydata.tsv -c colnames.txt -d $'\t' or simply ommit the -d, default is tab.
		
		for csv file: Note you have to specify -d , if your file is csv, otherwise all columns will be cut out.
			${0##*/} -f mydata.csv -c colnames.txt -d ,
        
EOF
}

## if there are no arguments provided, show help
if [[ $# == 0 ]]; then show_help; exit 1; fi

while getopts ":hf:c:d:" opt; do
  case "$opt" in
    h) show_help;exit 0;;
    f) File2extract=$OPTARG;;
    c) colNames=$OPTARG;;
    d) delim=$OPTARG;;
    '?') echo "Invalid option $OPTARG"; show_help >&2; exit 1;;
  esac
done
	

## set up the default delimiter to be tab, Note the way I specify tab 

delim=${delim:-$'\t'}

## get the number of columns in the data frame that match the column names in the colNames file.
## change the output to 2,5,6,22,... and get rid of the last comma  so cut -f can be used
 
cols=$(head -1 "${File2extract}" | tr "${delim}" "\n" | grep -nf "${colNames}" | sed 's/:.*$//' | tr "\n" "," | sed 's/,$//')

## cut out the columns 
cut -d"${delim}" -f"${cols}" "${File2extract}"

or use csvtk from Shen Wei:

csvtk cut -t -f $(paste -s -d , list.txt) data.tsv

merge all bed files and add a column for the filename.

awk '{print $0 "\t" FILENAME}' *bed 

add or remove chr from the start of each line

# add chr
sed 's/^/chr/' my.bed

# or
awk 'BEGIN {OFS = "\t"} {$1="chr"$1; print}'

# remove chr
sed 's/^chr//' my.bed

check if a tsv files have the same number of columns for all rows

awk '{print NF}' test.tsv | sort -nu | head -n 1

Parallelized samtools mpileup

https://www.biostars.org/p/134331/

BAM="yourFile.bam"
REF="reference.fasta"
samtools view -H $BAM | grep "\@SQ" | sed 's/^.*SN://g' | cut -f 1 | xargs -I {} -n 1 -P 24 sh -c "samtools mpileup -BQ0 -d 100000 -uf $REF -r \"{}\" $BAM | bcftools call -cv > \"{}\".vcf"

convert multiple lines to a single line

This is better than tr "\n" "\t" because somtimes I do not want to convert the last newline to tab.

cat myfile.txt | paste -s 

merge multiple files with same header by keeping the header of the first file

I usually do it in R, but like the quick solution.

https://stackoverflow.com/questions/16890582/unixmerge-multiple-csv-files-with-same-header-by-keeping-the-header-of-the-firs

awk 'FNR==1 && NR!=1{next;}{print}' *.csv 

# or

awk '
    FNR==1 && NR!=1 { while (/^<header>/) getline; }
    1 {print}
' file*.txt >all.txt

insert a field into the first line

cut -f1-4 F5.hg38.enhancers.expression.usage.matrix | head
CNhs11844	CNhs11251	CNhs11282	CNhs10746
chr10:100006233-100006603	1	0	0
chr10:100008181-100008444	0	0	0
chr10:100014348-100014634	0	0	0
chr10:100020065-100020562	0	0	0
chr10:100043485-100043744	0	0	0
chr10:100114218-100114567	0	0	0
chr10:100148595-100148922	0	0	0
chr10:100182422-100182522	0	0	0
chr10:100184498-100184704	0	0	0

sed '1 s/^/enhancer\t/' F5.hg38.enhancers.expression.usage.matrix | cut -f1-4 | head
enhancer	CNhs11844	CNhs11251	CNhs11282
chr10:100006233-100006603	1	0	0
chr10:100008181-100008444	0	0	0
chr10:100014348-100014634	0	0	0
chr10:100020065-100020562	0	0	0
chr10:100043485-100043744	0	0	0
chr10:100114218-100114567	0	0	0
chr10:100148595-100148922	0	0	0
chr10:100182422-100182522	0	0	0
chr10:100184498-100184704	0	0	0

extract PASS calls from vcf file

cat my.vcf | awk -F '\t' '{if($0 ~ /\#/) print; else if($7 == "PASS") print}' > my_PASS.vcf

replace a pattern in a specific column

## column5 
awk '{gsub(pattern,replace,$5)}1' in.file

## http://bioinf.shenwei.me/csvtk/usage/#replace
csvtk replace -f 5 -p pattern -r replacement 

move a process to a screen session

https://www.linkedin.com/pulse/move-running-process-screen-bruce-werdschinski/

1. Suspend: Ctrl+z
2. Resume: bg
3. Disown: disown %1
4. Launch screen
5. Find pid: prep BLAH
6. Reparent process: reptyr ###

count uinque values in a column and put in a new

https://www.unix.com/unix-for-beginners-questions-and-answers/270526-awk-count-unique-element-array.html

# input
blabla_1 A,B,C,C
blabla_2 A,E,G
blabla_3 R,Q,A,B,C,R,Q

# output
blabla_1 3
blabla_2 3
blabla_3 5


awk '{split(x,C); n=split($2,F,/,/); for(i in F) if(C[F[i]]++) n--; print $1, n}' file

get the promoter regions from a gtf file

https://twitter.com/David_McGaughey/status/1106371758142173185

Create TSS bed from GTF in one line:

zcat gencode.v29lift37.annotation.gtf.gz | awk '$3=="gene" {print $0}' | grep protein_coding | awk -v OFS="\t" '{if ($7=="+") {print $1, $4, $4+1} else {print $1, $5-1, $5}}' > tss.bed

or 5kb flanking tss

zcat gencode.v29lift37.annotation.gtf.gz | awk '$3=="gene" {print $0}' | grep protein_coding | awk -v OFS="\t" '{if ($7=="+") {print $1, $4, $4+5000} else {print $1, $5-5000, $5}}' > promoters.bed

caveat: some genes are at the end of the chromosomes, add or minus 5000 may go beyond the point, use bedtools slop with a genome size file to avoid that.

download fetchChromSizes from http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/

fetchChromSizes hg19 > chrom_size.txt

zcat gencode.v29lift37.annotation.gtf.gz | awk '$3=="gene" {print $0}' |  awk -v OFS="\t" '{if ($7=="+") {print $1, $4, $4+1} else {print $1, $5-1, $5}}' | bedtools slop -i - -g chrom_size.txt -b 5000 > promoter_5kb.bed

reverse one column of a txt file

reverse column 3 and put it to column5

awk -v OFS="\t" '{"echo "$3 "| rev" | getline $5}{print $0}' 

#or use perl reverse second column
perl -lane 'BEGIN{$,="\t"}{$rev=reverse $F[2];print $F[0],$F[1],$rev,$F[3]}

get the full path of a file

realpath file.txt
readlink -f file.txt 

pugz unizp in parallel

https://github.com/Piezoid/pugz

Contrary to the pigz program which does single-threaded decompression (see https://github.com/madler/pigz/blob/master/pigz.c#L232), pugz found a way to do truly parallel decompression.

run singularity on a multi-user HPC

#! /bin/bash
set -euo pipefail

module load singularity
# Need a unique /tmp for this job for /tmp/rstudio-rsession & /tmp/rstudio-server
WORKDIR=/liulab/${USER}/singularity_images
mkdir -m 700 -p ${WORKDIR}/tmp2
mkdir -m 700 -p ${WORKDIR}/tmp

PASSWORD='xyz' singularity exec --bind "${WORKDIR}/tmp2:/var/run/rstudio-server" --bind "${WORKDIR}/tmp:/tmp" --bind="/liulab/${USER}" geospatial_4.0.2.simg rserver --www-port 8888 --auth-none=0  --auth-pam-helper-path=pam-helper  --www-address=127.0.0.1

add ServerAliveInterval 60 to avoid dropping from your ssh session

Add the following on the top of your ~/.ssh/config to prevent drop off the ssh session

Host *
 ServerAliveInterval 60
 

I use screen/tmux and also mosh as well.

More Repositories

1

getting-started-with-genomics-tools-and-resources

Unix, R and python tools for genomics and data science
Shell
1,143
star
2

RNA-seq-analysis

RNAseq analysis notes from Ming Tang
Python
867
star
3

ChIP-seq-analysis

ChIP-seq analysis notes from Ming Tang
Python
670
star
4

scRNAseq-analysis-notes

scRNAseq analysis notes from Ming Tang
626
star
5

awesome_spatial_omics

tools and notes for spatial omics
209
star
6

The-world-of-faculty

resources for faculty
207
star
7

TCR-BCR-seq-analysis

T/B cell receptor sequencing analysis notes
202
star
8

DNA-seq-analysis

DNA sequencing analysis notes from Ming Tang
Shell
139
star
9

scATACseq-analysis-notes

my notes for scATACseq analysis
113
star
10

pyflow-ChIPseq

a snakemake pipeline to process ChIP-seq files from GEO or in-house
Python
101
star
11

scclusteval

Single Cell Cluster Evaluation
R
85
star
12

pyflow-ATACseq

ATAC-seq snakemake pipeline
Python
82
star
13

DNA-methylation-analysis

DNA methylation analysis notes from Ming Tang
78
star
14

machine-learning-resource

70
star
15

papers_with_data_to_mine

published papers with a lot of data
61
star
16

oneliner_100day_challenge

Bioinformatics one-liner for 100 days
43
star
17

scATACutils

R/Bioconductor package for working with 10x scATACseq data
R
38
star
18

scRNA-seq-workshop-Fall-2019

Harvard FAS informatics scRNAseq workshop website
R
36
star
19

biotech_resource

some resources for startup companies
36
star
20

compbio_tutorials

My youtube programming scripts
HTML
33
star
21

compbio_resources_chatomics

24
star
22

pyflow-RNAseq

RNAseq pipeline based on snakemake
Python
22
star
23

Machine_learning_drug_discovery

21
star
24

awesome-long-reads

tools and notes for long reads analysis
19
star
25

pyflow-scATACseq

snakemake workflow for post-processing scATACseq data
Python
19
star
26

crazyhottommy

17
star
27

Coursera_Bioinformatics_for_Beginners

python scripts for the Coursera Bioinformatics for Beginners
Python
17
star
28

pyflow-cellranger

A Snakemake pipeline for cellranger to process 10x single-cell RNAseq data
Python
15
star
29

scripts-general-use

HTML
15
star
30

single-cell-DNAseq-notes

14
star
31

pyflow_seurat_parameter

cluster stability measurement by subsampling and reclustering with Seurat V3 and V4
R
11
star
32

immunotherapy_scRNAseq_papers

11
star
33

CV

my CV using pagedown
JavaScript
10
star
34

awesome-single-cell-proteomics

9
star
35

mixed_histology_lung_cancer

8
star
36

cloud_computing_resources

7
star
37

MIT6.00.1x-Introduction-to-Computer-Science-and-Programming-Using-Python

my notes for the homework
Python
5
star
38

immunology_tools

5
star
39

pyflow-single-cell

single-cell RNAseq ATACseq processing pipeline
Python
5
star
40

writing-tips

5
star
41

scATACtools

R, python, unix tools for 10x scATACseq data
R
5
star
42

wholebrain_docker

docker file for wholebrain http://www.wholebrainsoftware.org/cms/installing-wholebrain-on-ubuntudebian/
Dockerfile
5
star
43

Genrich_compare

snakemake workflow comparing Genrich and MACS2
Python
5
star
44

phantompeakqualtools

Automatically exported from code.google.com/p/phantompeakqualtools
R
4
star
45

computation_wiki

Tommy's computation wiki
HTML
4
star
46

flowcytometry_analysis_notes

4
star
47

mixing_histology_lung_cancer

HTML
3
star
48

pyflow-chromForest

snakemake workflow for random forest based feature selection on chromHMM data
Python
3
star
49

odyssey_dot_files

my dot files on Harvard Odyssey HPC
Shell
3
star
50

primer3_scATAC_peaks

batch design primers for scATACseq differential peaks
Shell
3
star
51

seurat_v3_dockerfile

docker file for seurat v3
Dockerfile
2
star
52

PRADA_pipeline_Verhaak_lab

Shell
2
star
53

STAT115_HW

Tommy's homework
2
star
54

rocker_tidyvese_jpeg_cairo

docker file to extend rocker tidyverse
Dockerfile
2
star
55

ucn3_neuron_microarray_analysis

2
star
56

epigenomics_concept_learning

2
star
57

CIDC_single_cell

snakemake single cell pipeline for CIDC
Python
2
star
58

ucn3_neuron_microarray

2
star
59

EvaluateSingleCellClustering

examples for using scclusteval
R
2
star
60

bulk-RNAseq-workshop

HTML
2
star
61

compbio_challenges

2
star
62

machine_learning_datasets

2
star
63

rosalind_problems_python_solutions

Python
1
star
64

Epigenome_RoadTrip

my RoadTrip project
Python
1
star
65

data-science-machine-learning-project

HTML
1
star
66

ChIP-seq-carpentry

Development of the ChIPseq workshop for Data Carpentry
Python
1
star
67

one-click-hugo-cms

SCSS
1
star
68

nextjs-blog-theme

JavaScript
1
star
69

hodgkin_lymphoma_publication_scRNAseq_analysis

Jupyter Notebook
1
star