• Stars
    star
    1,143
  • Rank 40,627 (Top 0.9 %)
  • Language
    Shell
  • Created about 9 years ago
  • Updated 3 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Unix, R and python tools for genomics and data science
Table of content

Table of content

General

Courses

Some biology

If you are from fields outside of biology, places to get you started:

Some statistics

linear algebra

Bayesian Statistics

Learning Latex

Linux commands

Theory and quick reference

There are 3 file descriptors, stdin, stdout and stderr (std=standard).

Basically you can:

redirect stdout to a file redirect stderr to a file redirect stdout to a stderr redirect stderr to a stdout redirect stderr and stdout to a file redirect stderr and stdout to stdout redirect stderr and stdout to stderr 1 'represents' stdout and 2 stderr. A little note for seeing this things: with the less command you can view both stdout (which will remain on the buffer) and the stderr that will be printed on the screen, but erased as you try to 'browse' the buffer.

  • stdout 2 file

This will cause the ouput of a program to be written to a file.

     ls -l > ls-l.txt

Here, a file called 'ls-l.txt' will be created and it will contain what you would see on the screen if you type the command 'ls -l' and execute it.

  • stderr 2 file

This will cause the stderr ouput of a program to be written to a file.

     grep da * 2> grep-errors.txt

Here, a file called 'grep-errors.txt' will be created and it will contain what you would see the stderr portion of the output of the 'grep da *' command.

  • stdout 2 stderr

This will cause the stderr ouput of a program to be written to the same filedescriptor than stdout.

     grep da * 1>&2

Here, the stdout portion of the command is sent to stderr, you may notice that in differen ways.

  • stderr 2 stdout

This will cause the stderr ouput of a program to be written to the same filedescriptor than stdout.

     grep * 2>&1

Here, the stderr portion of the command is sent to stdout, if you pipe to less, you'll see that lines that normally 'dissapear' (as they are written to stderr) are being kept now (because they're on stdout).

  • stderr and stdout 2 file

This will place every output of a program to a file. This is suitable sometimes for cron entries, if you want a command to pass in absolute silence.

     rm -f $(find / -name core) &> /dev/null

This (thinking on the cron entry) will delete every file called 'core' in any directory. Notice that you should be pretty sure of what a command is doing if you are going to wipe it's output.

  • change permissions of files
    each digit is for: user, group and other.

chmod 754 myfile: this means the user has read, write and execute permssion; member in the same group has read and execute permission but no write permission; other people in the world only has read permission.

4 stands for "read",
2 stands for "write",
1 stands for "execute", and
0 stands for "no permission."
So 7 is the combination of permissions 4+2+1 (read, write, and execute), 5 is 4+0+1 (read, no write, and execute), and 4 is 4+0+0 (read, no write, and no execute).

It is sometimes hard to remember. one can use the letter:The letters u, g, and o stand for "user", "group", and "other"; "r", "w", and "x" stand for "read", "write", and "execute", respectively.

chmod u+x myfile
chmod g+r myfile

Do not give me excel files!

How to name files

It is really important to name your files correctly! see a ppt by Jenny Bryan.

Three principles for (file) names:

  • Machine readable (do not put special characters and space in the name)
  • Human readable (Easy to figure out what the heck something is, based on its name, add slug)
  • Plays well with default ordering:
  1. Put something numeric first

  2. Use the ISO 8601 standard for dates (YYYY-MM-DD)

  3. Left pad other numbers with zeros

If you have to rename the files...

  • brename A cross-platform command-line tool for safely batch renaming files/directories via regular expression (supporting Windows, Linux and OS X) from ShenWei is very useful!

Good naming of your files can help you to extract meta data from the file name

  • dirdf Create tidy data frames of file metadata from directory and file names.
> dir("examples/dataset_1/")
[1] "2013-06-26_BRAFWTNEG_Plasmid-Cellline-100_A01.csv"
[2] "2013-06-26_BRAFWTNEG_Plasmid-Cellline-100_A02.csv"
[3] "2014-02-26_BRAFWTNEG_FFPEDNA-CRC-1-41_D08.csv"
[4] "2014-03-05_BRAFWTNEG_FFPEDNA-CRC-REPEAT_H03.csv"
[5] "2016-04-01_BRAFWTNEG_FFPEDNA-CRC-1-41_E12.csv"

> library("dirdf")
> dirdf("examples/dataset_1/", template="date_assay_experiment_well.ext")
        date     assay           experiment well ext                                          pathname
1 2013-06-26 BRAFWTNEG Plasmid-Cellline-100  A01 csv 2013-06-26_BRAFWTNEG_Plasmid-Cellline-100_A01.csv
2 2013-06-26 BRAFWTNEG Plasmid-Cellline-100  A02 csv 2013-06-26_BRAFWTNEG_Plasmid-Cellline-100_A02.csv
3 2014-02-26 BRAFWTNEG     FFPEDNA-CRC-1-41  D08 csv     2014-02-26_BRAFWTNEG_FFPEDNA-CRC-1-41_D08.csv
4 2014-03-05 BRAFWTNEG   FFPEDNA-CRC-REPEAT  H03 csv   2014-03-05_BRAFWTNEG_FFPEDNA-CRC-REPEAT_H03.csv

parallelization

Using these tool will greatly improve your working efficiency and get rid of most of your for loops.

  1. xargs
  2. GNU parallel. one of my post here
  3. gxargs by Brent Pedersen. Written in GO.
  4. rush A cross-platform command-line tool for executing jobs in parallel by Shen Wei. I use his other tools such as brename and csvtk.
  5. future: Unified Parallel and Distributed Processing in R for Everyone
  6. furrr Apply Mapping Functions in Parallel using Futures

Statistics

Data transfer

a blog post by Mark Ziemann http://genomespot.blogspot.com/2018/03/share-and-backup-data-sets-with-dat.html

Website

updating R

# Install new version of R (lets say 3.5.0 in this example)

# Create a new directory for the version of R
fs::dir_create("~/Library/R/3.5/library")

# Re-start R so the .libPaths are updated

# Lookup what packages were in your old package library
pkgs <- fs::dirname(fs::dir_ls("~/Library/R/3.4/library"))

# Filter these packages as needed

# Install the packages in the new version
install.packages(pkgs)

Better R code

Shiny App

profile R code

  • profvis Interactive Visualizations for Profiling R Code.
  • proffer The proffer package profiles R code to find bottlenecks.
  • rco - The R Code Optimizer Make your R code run faster! rco analyzes your code and applies different optimization strategies that return an R code that runs faster.

R tools for data wrangling, tidying and visualizing.

If you already know the mapping in advance (like the above example) you should use the .data pronoun from rlang to make it explicit that you are referring to the drv in the layer data and not some other variable named drv (which may or may not exist elsewhere). To avoid a similar note from the CMD check about .data, use #' @importFrom rlang .data in any roxygen code block (typically this should be in the package documentation as generated by usethis::use_package_doc()).

  • If you know the mapping or facet specification is col in advance, use aes(.data$col) or vars(.data$col).
  • If col is a variable that contains the column name as a character vector, use aes(.data[[col]] or vars(.data[[col]]).
  • If you would like the behaviour of col to look and feel like it would within aes() and vars(), use aes({{ col }}) or vars({{ col }}).

Genomic data visulization

  • karyoploteR Really powerful and versatile tool.
  • Bentobox BentoBox empowers users to programmatically and flexibly generate multi-panel figures.

Sankey graph

Handling big data in R

Write your own R package

Documentation

  • This is a must read for writing good documentations: A blog post. I saved it to a pdf and uploaded to this repo.

handling arguments at the command line

visualization in general

Javascript

python tips and tools

machine learning

Amazon cloud computing

Intro to AWS Cloud Computing

Genomics-visualization-tools

There are many online web based tools for visualization of (cancer) genomic data. I put my collections here. I use R for visulization. see a nice post by using python by Radhouane Aniba:Genomic Data Visualization in Python

  • UCSC cancer genome browser It has many data including TCGA data buit in, and can be very handy for both bench scientist and bioinformaticians.
  • UCSC Xena. A new tool developed by UCSC team as well. Poteintially very useful, but need more tutorials to follow.
  • UCSC genome browser. One of the most famous genome browser and my favoriate. Every person studying genetics, genomics and molecular biology needs to know how to use it. Tutorials from OpenHelix.
  • Epiviz 3 is an interactive visualization tool for functional genomics data. It supports genome navigation like other genome browsers, but allows multiple visualizations of data within genomic regions using scatterplots, heatmaps and other user-supplied visualizations.
  • Mutation Annotation & Genome Interpretation TCGA: MAGA
  • GeneProteinViz (GPViz) is a versatile Java-based software for dynamic gene-centered visualization of genomic regions and/or variants.
  • ProteinPaint: Web Application for Visualizing Genomic Data The software developed for this project highlights critical attributes about the mutations, including the form of protein variant (e.g. the new amino acid as a result of missense mutation), the name of sample from which the mutation was identified, whether the mutation is somatic or germline,

Databases

Large data consortium data mining

Integrative analysis

Interactive visualization

Tutorials

See https://t.co/yxCb85ctL1: "MDS best choice for preserving outliers, PCA for variance, & T-SNE for clusters" @mikelove @AndrewLBeam

— Rileen Sinha (@RileenSinha) August 25, 2016
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

paper: Outlier Preservation by Dimensionality Reduction Techniques

"MDS best choice for preserving outliers, PCA for variance, & T-SNE for clusters"

MOOC(Massive Open Online Courses)

git and version control

blogs

data management

Automate your workflow, open science and reproducible research

Automation wins in the long run.

STEP 6 is usually missing!

The pic was downloaded from http://biobungalow.weebly.com/bio-bungalow-blog/everybody-knows-the-scientific-method

Workflow languages

Reviews
Snakemake

I am using snakemake and so far is very happy about it!

Nextflow

Reproducible research

As an early adopter of the Figshare repository, I came up with a strategy that serves both our open-science and our reproducibility goals, and also helps with this problem: for the main results in any new paper, we would share the data, plotting script and figure under a CC-BY license, by first uploading them to Figshare.

Survival curve

Organize research for a group

  • slack:A messaging app for teams.
  • Ryver.
  • Trello lets you work more collaboratively and get more done.

Clustering

CRISPR related

vector arts for life sciences

More Repositories

1

RNA-seq-analysis

RNAseq analysis notes from Ming Tang
Python
867
star
2

ChIP-seq-analysis

ChIP-seq analysis notes from Ming Tang
Python
670
star
3

scRNAseq-analysis-notes

scRNAseq analysis notes from Ming Tang
626
star
4

bioinformatics-one-liners

Bioinformatics one liners from Ming Tang
452
star
5

awesome_spatial_omics

tools and notes for spatial omics
209
star
6

The-world-of-faculty

resources for faculty
207
star
7

TCR-BCR-seq-analysis

T/B cell receptor sequencing analysis notes
202
star
8

DNA-seq-analysis

DNA sequencing analysis notes from Ming Tang
Shell
139
star
9

scATACseq-analysis-notes

my notes for scATACseq analysis
113
star
10

pyflow-ChIPseq

a snakemake pipeline to process ChIP-seq files from GEO or in-house
Python
101
star
11

scclusteval

Single Cell Cluster Evaluation
R
85
star
12

pyflow-ATACseq

ATAC-seq snakemake pipeline
Python
82
star
13

DNA-methylation-analysis

DNA methylation analysis notes from Ming Tang
78
star
14

machine-learning-resource

70
star
15

papers_with_data_to_mine

published papers with a lot of data
61
star
16

oneliner_100day_challenge

Bioinformatics one-liner for 100 days
43
star
17

scATACutils

R/Bioconductor package for working with 10x scATACseq data
R
38
star
18

scRNA-seq-workshop-Fall-2019

Harvard FAS informatics scRNAseq workshop website
R
36
star
19

biotech_resource

some resources for startup companies
36
star
20

compbio_tutorials

My youtube programming scripts
HTML
33
star
21

compbio_resources_chatomics

24
star
22

pyflow-RNAseq

RNAseq pipeline based on snakemake
Python
22
star
23

Machine_learning_drug_discovery

21
star
24

awesome-long-reads

tools and notes for long reads analysis
19
star
25

pyflow-scATACseq

snakemake workflow for post-processing scATACseq data
Python
19
star
26

crazyhottommy

17
star
27

Coursera_Bioinformatics_for_Beginners

python scripts for the Coursera Bioinformatics for Beginners
Python
17
star
28

pyflow-cellranger

A Snakemake pipeline for cellranger to process 10x single-cell RNAseq data
Python
15
star
29

scripts-general-use

HTML
15
star
30

single-cell-DNAseq-notes

14
star
31

pyflow_seurat_parameter

cluster stability measurement by subsampling and reclustering with Seurat V3 and V4
R
11
star
32

immunotherapy_scRNAseq_papers

11
star
33

CV

my CV using pagedown
JavaScript
10
star
34

awesome-single-cell-proteomics

9
star
35

mixed_histology_lung_cancer

8
star
36

cloud_computing_resources

7
star
37

MIT6.00.1x-Introduction-to-Computer-Science-and-Programming-Using-Python

my notes for the homework
Python
5
star
38

immunology_tools

5
star
39

pyflow-single-cell

single-cell RNAseq ATACseq processing pipeline
Python
5
star
40

writing-tips

5
star
41

scATACtools

R, python, unix tools for 10x scATACseq data
R
5
star
42

wholebrain_docker

docker file for wholebrain http://www.wholebrainsoftware.org/cms/installing-wholebrain-on-ubuntudebian/
Dockerfile
5
star
43

Genrich_compare

snakemake workflow comparing Genrich and MACS2
Python
5
star
44

phantompeakqualtools

Automatically exported from code.google.com/p/phantompeakqualtools
R
4
star
45

computation_wiki

Tommy's computation wiki
HTML
4
star
46

flowcytometry_analysis_notes

4
star
47

mixing_histology_lung_cancer

HTML
3
star
48

pyflow-chromForest

snakemake workflow for random forest based feature selection on chromHMM data
Python
3
star
49

odyssey_dot_files

my dot files on Harvard Odyssey HPC
Shell
3
star
50

primer3_scATAC_peaks

batch design primers for scATACseq differential peaks
Shell
3
star
51

seurat_v3_dockerfile

docker file for seurat v3
Dockerfile
2
star
52

PRADA_pipeline_Verhaak_lab

Shell
2
star
53

STAT115_HW

Tommy's homework
2
star
54

rocker_tidyvese_jpeg_cairo

docker file to extend rocker tidyverse
Dockerfile
2
star
55

ucn3_neuron_microarray_analysis

2
star
56

epigenomics_concept_learning

2
star
57

CIDC_single_cell

snakemake single cell pipeline for CIDC
Python
2
star
58

ucn3_neuron_microarray

2
star
59

EvaluateSingleCellClustering

examples for using scclusteval
R
2
star
60

bulk-RNAseq-workshop

HTML
2
star
61

compbio_challenges

2
star
62

machine_learning_datasets

2
star
63

rosalind_problems_python_solutions

Python
1
star
64

Epigenome_RoadTrip

my RoadTrip project
Python
1
star
65

data-science-machine-learning-project

HTML
1
star
66

ChIP-seq-carpentry

Development of the ChIPseq workshop for Data Carpentry
Python
1
star
67

one-click-hugo-cms

SCSS
1
star
68

nextjs-blog-theme

JavaScript
1
star
69

hodgkin_lymphoma_publication_scRNAseq_analysis

Jupyter Notebook
1
star