• Stars
    star
    914
  • Rank 49,973 (Top 1.0 %)
  • Language
  • License
    Other
  • Created over 5 years ago
  • Updated 2 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

The complete sequence of a human genome

Telomere-to-telomere consortium CHM13 project

We have sequenced the CHM13hTERT human cell line with a number of technologies. Human genomic DNA was extracted from the cultured cell line. As the DNA is native, modified bases will be preserved. The data includes 30x PacBio HiFi, 120x coverage of Oxford Nanopore, 70x PacBio CLR, 50x 10X Genomics, as well as BioNano DLS and Arima Genomics HiC. Most raw data is available from this site, with the exception of the PacBio data which was generated by the University of Washington/PacBio and is available from NCBI SRA.

A UCSC browser is available for v2.0 (as well as legacy v1.0 and v1.1 versions). An interactive dotplot visualization of all genomic repeats is also available from resgen.io. Known issues identified in the assembly are tracked at CHM13 issues.

Latest assembly release

T2T-CHM13v2.0 (T2T-CHM13+Y)

Complete T2T reconstruction of a human genome with Y. Changes from v1.1 is the addition of a finished chromosome Y from the GIAB HG002/NA24385 sample, sequenced both by GIAB and HPRC. This genome is also available at NCBI (GCA_009914755.4) and at UCSC. Note that even though the UCSC browser shows the Genbank accessions as sequence names on the browser itself, it can load annotations in BED/bigBed/BAM/CRAM/bigWig and other formats or search using the "chr1/2/etc" names.

Previous assembly releases are available below:

Downloads

Sequencing data

The sequencing dataset generated for CHM13 has been moved to this page.

Analysis set

Analysis set for using T2T-CHM13v2.0 (T2T-CHM13+Y) as a reference for mapping based research is available at aws with a README.

  • chm13v2.0.fa.gz: T2T-CHM13v2.0 assembly with sequences soft-masked using the repeat models discovered by the T2T team. The original sequence accession numbers are shown in the FASTA header.
  • chm13v2.0_noY.fa.gz: excluding the Y chromosome. This file only contains sequences derived from the CHM13 cell line and is identical to T2T-CHM13v1.1. Use this file for benchmarking assemblies of CHM13.
  • chm13v2.0_PAR.bed: pseudoautosomal regions (PARs)
  • chm13v2.0_maskedY.fa.gz: PARs on chrY hard masked to "N"
  • chm13v2.0_maskedY.rCRS.fa.gz: PARs on chrY hard masked to "N" and mitochodrion replaced with rCRS (AC:NC_012920.1)

Sep. 28 2022 update: all analysis-set fa.gz files have been re-compressed with bgzip. Index files are available at aws with updated md5s in the README.

Gene annotation

Repeat annotation

Epigenetic profile

Variant calls

Liftover resources

Notes on downloading files

Files are generously hosted by Amazon Web Services under s3://human-pangenomics/T2T/CHM13 and through this web interface.

Although available as straight-forward HTTP links, download performance is improved by using the Amazon Web Services command-line interface. References should be amended to use the s3:// addressing scheme, i.e. replace https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/ with s3://human-pangenomics/T2T to download. For example, to download CHM13_prep5_S13_L002_I1_001.fastq.gz to the current working directory use the following command.

aws s3 --no-sign-request cp s3://human-pangenomics/T2T/CHM13/10x/CHM13_prep5_S13_L002_I1_001.fastq.gz .

or to download the full dataset use the following command.

aws s3 --no-sign-request sync s3://human-pangenomics/T2T/CHM13/ .

The s3 command can also be used to get information on the dataset, for example reporting the size of every file in human-readable format.

aws s3 --no-sign-request ls --recursive --human-readable --summarize s3://human-pangenomics/T2T/CHM13/ 

or to obtain technology-specific sizes.

aws s3 --no-sign-request ls --recursive --human-readable --summarize s3://human-pangenomics/T2T/CHM13/nanopore/fast5
aws s3 --no-sign-request ls --recursive --human-readable --summarize s3://human-pangenomics/T2T/CHM13/nanopore/rel2
aws s3 --no-sign-request ls --recursive --human-readable --summarize s3://human-pangenomics/T2T/CHM13/assemblies

Amending the max_concurrent_requests etc. settings as per this guide will improve download performance further.

Contact

Please raise issues on this Github repository concerning this dataset.

Data reuse and license

All data is released to the public domain (CC0) and we encourage its reuse. We would appreciate if you would acknowledge and cite the "Telomere-to-Telomere" (T2T) Consortium for the creation of this data. More information about our consortium can be found on the T2T homepage and a list of related citations is available below:

T2T-CHM13v2.0, datasets released along the v2.0 and the T2T-Y chromosome

The complete sequence of a human genome and companion papers (T2T-CHM13v0.9-v1.1):

  1. Nurk S, Koren S, Rhie A, Rautiainen M, et al. The complete sequence of a human genome. Science, 2022.
  2. Vollger MR, et al. Segmental duplications and their variation in a complete human genome. Science, 2022.
  3. Gershman A, et al. Epigenetic Patterns in a Complete Human Genome. Science, 2022.
  4. Aganezov S, Yan SM, Soto DC, Kirsche M, Zarate S, et al. A complete reference genome improves analysis of human genetic variation. Science, 2022.
  5. Hoyt SJ, et al. From telomere to telomere: the transcriptional and epigenetic state of human repeat elements. Science, 2022.
  6. Altemose N, et al. Complete genomic and epigenetic maps of human centromeres. Science, 2022.
  7. Wagner J, et al. Curated variation benchmarks for challenging medically relevant autosomal genes. Nat Biotechnol, 2022.
  8. McCartney AM, Shafin K, Alonge M, et al. Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies. Nat Methods, 2022.
  9. Formenti G, Rhie A, et al. Merfin: improved variant filtering, assembly evaluation and polishing via k-mer validation. Nat Methods, 2022.
  10. Jain C, et al. Long-read mapping to repetitive reference sequences using Winnowmap2. Nat Methods, 2022.
  11. Altemose N, Maslan A, Smith OK et al. DiMeLo-seq: a long-read, single-molecule method for mapping protein–DNA interactions genome wide. Nat Methods, 2022.

Earlier citations:

  1. Vollger MR, et al. Improved assembly and variant detection of a haploid human genome using single-molecule, high-fidelity long reads. Annals of Human Genetics, 2019.
  2. Miga KH, Koren S, et al. Telomere-to-telomere assembly of a complete human X chromosome. Nature, 2020.
  3. Nurk S, Walenz BP, et al. HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Research, 2020.
  4. Logsdon GA, et al. The structure, function, and evolution of a complete human chromosome 8. Nature, 2021.

History

* rel1 and 2: 2nd March 2019. Initial release.
* asm v0.6 and canu rel2 assembly: 28th May 2019. Assembly update.
* Hi-C data added: 25th July 2019. Data update.
* asm v0.6 alignments of rel2 added: 30th Aug 2019. Data Update
* rel3: 16th Sept 2019. Data update.
* chrX v0.7, canu 1.9 and flye 2.5 rel3 assembly: 24th Oct 2019. Assembly update.
* shasta rel3 assembly: 20th Dec 2019. Assembly update.
* chr8 v3, rel4 data: 21 Feb 2020. Data and assembly update.
* update rel3 partition names since some tars included more than a single partition. 16 Apr 2020.
* add CLR/HiFi mappings to chrX v0.7. 8 May 2020.
* update partitions 23,28,30,53,55 and add 227-231 (data was missing from upload). 13 May 2020. Data update.
* add rel5 guppy 3.6.0 data: 4 Jun 2020. Data update.
* add chr8 v9. Aug 26 2020. Assembly update.
* add v0.9/v1.0 genome releases. Sept 22 2020. Assembly update.
* add v0.9/v1.0 alignment files. Sept 29 2020. Assembly update.
* add new UW data. Oct 6 2020. Data update.
* add rna-seq data. Dec 4 2020. Data update.
* add repeat and telomere annotations for v1.0. Dec 17 2020. Assembly annotation update.
* v1.1 assembly and related files. May 7 2021. Assembly update.
* v2.0 assembly and related files. Dec 2 2022. Assembly and annotation update.
* 1KGP variant calls for all chromosomes. Jan. 3 2023. Annotation update.
* 1KGP and SGDP bam / vcf released publicly on [AnVIL_T2T_CHRY](https://anvil.terra.bio/#workspaces/anvil-datastorage/AnVIL_T2T_CHRY). May 23, 2023. Data Update.

More Repositories

1

canu

A single molecule sequence assembler for genomes large and small.
C++
600
star
2

Krona

Interactively explore metagenomes and more from a web browser.
JavaScript
419
star
3

Mash

Fast genome and metagenome distance estimation using MinHash
C++
355
star
4

verkko

Telomere-to-telomere assembly of accurate long reads (PacBio HiFi, Oxford Nanopore Duplex, HERRO corrected Oxford Nanopore Simplex) and Oxford Nanopore ultra-long reads.
Python
274
star
5

MashMap

A fast approximate aligner for long DNA sequences
C++
210
star
6

merqury

k-mer based assembly evaluation
Shell
201
star
7

Winnowmap

Long read / genome alignment software
C
187
star
8

SALSA

SALSA: A tool to scaffold long read assemblies with Hi-C data
Python
178
star
9

parsnp

Parsnp was designed to align the core genome of hundreds to thousands of bacterial genomes within a few minutes to few hours. Input can be both draft assemblies and finished genomes, and output includes variant (SNP) calls, core genome phylogeny and multi-alignments. Parsnp leverages contextual information provided by multi-alignments surrounding SNP sites for filtration/cleaning, in addition to existing tools for recombination detection/filtration and phylogenetic reconstruction.
C++
124
star
10

ModDotPlot

Python
102
star
11

MHAP

MinHash Alignment Process (MHAP, pronounced MAP): locality-sensitive hashing to detect long-read overlaps and utilities
Java
95
star
12

HG002

A complete diploid human genome
94
star
13

metAMOS

A metagenomic and isolate assembly and analysis pipeline built with AMOS
Roff
93
star
14

meryl

A genomic k-mer counter (and sequence utility) with nice features.
C
78
star
15

harvest

50
star
16

MetaCompass

MetaCompass: Reference-guided Assembly of Metagenomes
Python
38
star
17

Primates

Complete assemblies of non-human primate genomes
38
star
18

MetagenomeScope

Visualization tool for (meta)genome assembly graphs
JavaScript
25
star
19

seqrequester

A tool for summarizing, extracting, generating and modifying DNA sequences.
C
23
star
20

rukki

Extracting paths from assembly graphs
Rust
22
star
21

CHM13-issues

CHM13 human reference genome issue tracking
HTML
18
star
22

T2T-Browser

Genome browser hub for the T2T genomes and resources
HTML
15
star
23

VALET

A pipeline for detecting mis-assemblies in metagenomic assemblies.
TeX
14
star
24

gingr

C++
13
star
25

MetaCarvel

MetaCarvel: A scaffolder for metagenomes
C++
13
star
26

MUMmer3

MUMmer3
C++
11
star
27

binnacle

Binnacle: Using Scaffolds to Improve the Contiguity and Quality of Metagenomic Bins
Python
10
star
28

HG002-issues

HG002 human reference genome issue tracking and polishing
10
star
29

harvest-tools

C++
8
star
30

ATLAS

outlier detection in BLAST hits
Python
3
star