• Stars
    star
    161
  • Rank 225,026 (Top 5 %)
  • Language
    Python
  • License
    GNU General Publi...
  • Created over 4 years ago
  • Updated 12 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

🎼 Integrate multiple high-dimensional datasets with fuzzy k-means and locally linear adjustments.

harmonypy

Latest PyPI Version PyPI Downloads tests DOI

Harmony is an algorithm for integrating multiple high-dimensional datasets.

harmonypy is a port of the harmony R package by Ilya Korsunsky.

Example

This animation shows the Harmony alignment of three single-cell RNA-seq datasets from different donors.

→ How to make this animation.

Installation

This package has been tested with Python 3.7.

Use pip to install:

pip install harmonypy

Usage

Here is a brief example using the data that comes with the R package:

# Load data
import pandas as pd

meta_data = pd.read_csv("data/meta.tsv.gz", sep = "\t")
vars_use = ['dataset']

# meta_data
#
#                  cell_id dataset  nGene  percent_mito cell_type
# 0    half_TGAAATTGGTCTAG    half   3664      0.017722    jurkat
# 1    half_GCGATATGCTGATG    half   3858      0.029228      t293
# 2    half_ATTTCTCTCACTAG    half   4049      0.015966    jurkat
# 3    half_CGTAACGACGAGAG    half   3443      0.020379    jurkat
# 4    half_ACGCCTTGTTTACC    half   2813      0.024774      t293
# ..                   ...     ...    ...           ...       ...
# 295  t293_TTACGTACGACACT    t293   4152      0.033997      t293
# 296  t293_TAGAATTGTTGGTG    t293   3097      0.021769      t293
# 297  t293_CGGATAACACCACA    t293   3157      0.020411      t293
# 298  t293_GGTACTGAGTCGAT    t293   2685      0.027846      t293
# 299  t293_ACGCTGCTTCTTAC    t293   3513      0.021240      t293

data_mat = pd.read_csv("data/pcs.tsv.gz", sep = "\t")
data_mat = np.array(data_mat)

# data_mat[:5,:5]
#
# array([[ 0.0071695 , -0.00552724, -0.0036281 , -0.00798025,  0.00028931],
#        [-0.011333  ,  0.00022233, -0.00073589, -0.00192452,  0.0032624 ],
#        [ 0.0091214 , -0.00940727, -0.00106816, -0.0042749 , -0.00029096],
#        [ 0.00866286, -0.00514987, -0.0008989 , -0.00821785, -0.00126997],
#        [-0.00953977,  0.00222714, -0.00374373, -0.00028554,  0.00063737]])

# meta_data.shape # 300 cells, 5 variables
# (300, 5)
#
# data_mat.shape  # 300 cells, 20 PCs
# (300, 20)

# Run Harmony
import harmonypy as hm
ho = hm.run_harmony(data_mat, meta_data, vars_use)

# Write the adjusted PCs to a new file.
res = pd.DataFrame(ho.Z_corr)
res.columns = ['X{}'.format(i + 1) for i in range(res.shape[1])]
res.to_csv("data/adj.tsv.gz", sep = "\t", index = False)

More Repositories

1

ggrepel

📍 Repel overlapping text labels away from each other in your ggplot2 figures.
R
1,169
star
2

awesome-vdj

📚 Tools and databases for analyzing HLA and VDJ genes.
167
star
3

snakefiles

🐍 Snakefiles for common RNA-seq data analysis workflows (STAR and Kallisto).
Python
87
star
4

tftargets

🎯 Human transcription factor target genes from 6 databases in convenient R format.
R
84
star
5

pytabix

📇 Retrieve data in genomic intervals with a Python interface for tabix.
C
81
star
6

picardmetrics

🚦 Run Picard on BAM files and collate 90 metrics into one file.
Shell
39
star
7

snpsea

📊 Identify cell types and pathways affected by genetic risk loci.
C++
33
star
8

proxysnps

🔖 Get SNP proxies from the 1000 Genomes Project.
R
27
star
9

CENTIPEDE.tutorial

🐛 How to use CENTIPEDE to determine if a transcription factor is bound.
R
25
star
10

cellguide

🧭 Navigate single-cell RNA-seq datasets in your web browser.
JavaScript
25
star
11

homerkit

Read HOMER motif analysis output in R.
R
15
star
12

allelefrequencies

📂 HLA allele frequencies in tab-delimited format, downloaded from AFND.
Python
14
star
13

hlabud

🐶 hlabud: HLA genotype analysis in R
R
13
star
14

snpbook

📙 Explore 1000 Genomes variant data with JavaScript.
JavaScript
8
star
15

fibrotime

🍊 View the gene expression response to TNF and IL-17A with vanilla Javascript and HTML.
Jupyter Notebook
6
star
16

circles

🌈
HTML
5
star
17

xlmhg

📉 Non-parametric rank enrichment test for binary data.
C++
5
star
18

saturation

🧽 Estimate sequencing saturation for GEX, VDJ, and ADT data from the 10x Genomics platform.
R
3
star
19

dotfiles

Kamil's dotfiles
Vim Script
3
star
20

slowkow.com

Personal website.
HTML
3
star
21

pubmed-pairs

✌️ Search PubMed for each pair of terms from two lists.
JavaScript
3
star
22

arrayqc

Quality control metrics for microarray data.
R
2
star
23

covid-stats

New daily Covid cases and deaths in USA. Data from usafacts.org
R
1
star
24

fern

🌿 Barnsley fern
HTML
1
star