VDJDB: A curated database of T-cell receptor sequences of known antigen specificity
The primary goal of VDJdb is to facilitate access to existing information on T-cell receptor antigen specificities, i.e. the ability to recognize certain epitopes in certain MHC contexts.
Our mission is to both aggregate the scarce TCR specificity information available so far and to create a curated repository to store such data.
In addition to routine database updates providing the most up-to-date information, we make our best to ensure data consistency and fight irregularities in TCR specificity reporting with a complex database validation scheme:
- We take into account all available information on experimental setup used to identify antigen-specific TCR sequences and assign a single confidence score to highligh most reliable records at the database generation stage.
- Each database record is also automatically checked against a database of V/J segment germline sequences to ensure standardized and consistent reporting of V-J junctions and CDR3 sequences that define T-cell clones.
This repository hosts the submissions to database and scripts to check, fix and build the database itself.
To build database directly from submissions, go to src
directory and run groovy -cp . BuildDatabase.groovy
script (requires Groovy).
To query the database for your immune repertoire sample(s) use the vdjmatch software.
A web-based GUI for the database can be found in VDJdb-web repository.
Citing
Please cite the database using the most recent NAR paper Dmitry V Bagaev, Renske M A Vroomans, Jerome Samir, Ulrik Stervbo, Cristina Rius, Garry Dolton, Alexander Greenshields-Watson, Meriem Attaf, Evgeny S Egorov, Ivan V Zvyagin, Nina Babel, David K Cole, Andrew J Godkin, Andrew K Sewell, Can Kesmir, Dmitriy M Chudakov, Fabio Luciani, Mikhail Shugay, VDJdb in 2019: database extension, new analysis infrastructure and a T-cell receptor motif compendium, Nucleic Acids Research, gkz874.
doi:10.1093/nar/gkz874.
Submission guide
To submit previously published sequence follow the steps below:
-
Create an issue(s) labeled as
paper
and named by the paper pubmed id,PMID:XXXXXXX
. Note that if paper is a meta-study, you can mark it asmeta-paper
and link issues for its references in a reply to this issue. Also note that in case submitting unpublished sequences, choose any appropriate issue name with details on submitter (name, organization, etc) in issue comments. -
Create new branch and add chunk(s) for corresponding papers named as
PMID_XXXXXXX
. Don't forget to close/reference corresponding issues in the commit message. -
Create a pull request for the branch and check if it passes the CI build. If there are any issues, modify them by fixing/removing entries as necessary.
The structure of submission chunk is provided below, but first a couple of notes:
STYLE Try avoiding spaces (e.g.
TRBV7,TRBV5
, notTRBV7, TRBV5
) and leave fields that have no information as blank (don't use any placeholder). Stick to listed field values at all cost! In case a critical part of your submission doesn't fit in current specification: 1) Create an issue in the issues section (and tag it asmaintainance
), 2) provide us with an example (e.g. open a pull request). Do not insert critical information into the comment field.
FORMAT Please ensure that Variable/Joining and MHC names in your submission come from IMGT nomenclature (this does not apply to donor MHC typing fields).
The BuildDatabase
routine will be executed during CI tests upon each submission and prior to every database release implements table format checks, CDR3 sequence checks and fixes (if possible), and VDJdb confidence score assignment (see below).
To view the list of papers that were not yet processed follow here.
An XLS template is available here.
CAUTION make sure that nothing is messed up (
x/X
frequencies are transformed to dates, bad encoding, etc) when importing from XLS template. The format of all fields is pre-set to text to prevent this case.
Database specification
Each database submission in chunks/
folder should have the following header and columns:
Complex information columns (required)
These columns convey full information about TCR:peptide:MHC complex and are mandatory for any submission.
column name | description |
---|---|
cdr3.alpha | TCR alpha CDR3 amino acid sequence. Complete sequence starting with C and ending with F/W should be provided if possible. Trimmed sequences will be fixed at database building stage in case sufficient V/J germline parts are present |
v.alpha | TCR alpha Variable (V) segment id, up to best resolution possible (TRAVX*XX , e.g. TRAV7 , TRAV7*01 , TRAV7*02 ...). Strictly IMGT nomenclature. Can be left blank if unknown. |
j.alpha | TCR alpha Joining (J) segment id |
cdr3.beta | TCR beta CDR3 amino acid sequence |
v.beta | TCR beta V segment id |
j.beta | TCR beta J segment id |
species | TCR parent species (HomoSapiens , MusMusculus ,...) |
mhc.a | First MHC chain allele, to best resolution possible, HLA-X*XX:XX , e.g. HLA-A*02:01 |
mhc.b | Second MHC chain allele (B2M for MHCI) |
mhc.class | MHCI or MHCII |
antigen.epitope | Amino acid sequence of the epitope |
antigen.gene | Parent gene of the epitope sequence (e.g. pp24 ) |
antigen.species | Parent species of the antigen, to the best clade resolution possible (e.g. HIV-1 , HIV-1*HXB2 ) |
reference.id | Pubmed id, doi, etc |
submitter | Name of submitting person/organization |
Notes:
In case given record represents a clonotype with either TCR alpha or beta sequence unknown, missing CDR3/V/(D)/J fields should be left blank.
V/(D)/J fields can be left blank, however this will abrogate CDR3 fixing/verification procedure for a given record.
Any record should have at least one of CDR3 alpha/beta fields that are not blank.
Method information columns (optional)
Optional columns (i.e. it is not required to fill them, but they should be present in table header) that ensure correct confidence ranking of a given entry. Used to calculate a single confidence score based on various factors, e.g. fraction of a given TCRab sequence among tetramer+ clones sequenced and verification experiments performed.
column name | description |
---|---|
method.identification | tetramer-sort , dextramer-sort , pelimer-sort , pentamer-sort , etc for sorting-based identification. For molecular assays use: antigen-loaded-targets (if T cells specificity was analysed against cells incubatetd with antigenic peptide), antigen-expressing-targets (if T cells specificity was analysed against cells tranformed with antigenic organism, protein or peptide, e.g. BCL transformed with EBV). For magnetic cell separation use beads keyword. Add cultured-T-cells or limiting-dilution-cloning if T cells were cultured before sequencing as in this case method.frequency will have completely different meaning. Use comma to separate phrases. |
method.frequency | Frequency in isolated antigen-specific population, reported as X/X if possible, e.g. 7/30 if a given V/D/J/CDR3 is encountered in 7 out of 30 tetramer+ clones. Formats X% , X.X% and X.X are also supported. |
method.singlecell | yes if single cell sequencing was performed, blank otherwise |
method.sequencing | Sequencing method: sanger , rna-seq or amplicon-seq |
method.verification | tetramer-stain , dextramer-stain , pelimer-stain , pentamer-stain , etc for methods that include TCR cloning and re-staining with multimers. For magnetic cell separation use beads keyword. antigen-loaded-targets , antigen-expressing-targets for molecular assays that validate specificity of cloned T-cell receptors. direct in case pMHC binding T-cells are directly subject to single-cell sequencing. Several comma-separated verification methods can be specified. |
Notes:
In case
method.identification
is left blank, the record is automatically assigned with a lowest confidence score possible.
For special cases such as CD8-null tetramers that utilize HLA with mutated residues that abrogate CD8 binding, specify
cd8null-tetramer
inmethod.identification
field rather than usingmhc.a
field.
During database build phase, the information from columns mentioned above is collapsed to a JSON string and stored in a single method
column, e.g.:
{
"identification":"tetramer-sort",
"frequency":"5/13",
"sequencing":"sanger",
"verification":"antigen-loaded-targets"
}
Meta-information columns (optional)
column name | description |
---|---|
meta.study.id | Internal study id |
meta.cell.subset | T-cell subset, free style, e.g. CD8+ , CD4+CD25+ |
meta.subset.frequency | Frequency of a given TCR sequence in specified cell subset, e.g. 5% means that the TCR sequence represents an expanded clone occupying 5% of CD8+ cells |
meta.subject.cohort | Subject cohort, free style, e.g. healthy or HIV+ . If possible, specify to what extent a healthy donor is healthy, e.g. CMV-seronegative . |
meta.subject.id | Subject id (e.g. donor1 , donor2 ,...) |
meta.replica.id | Replicate sample coming from the same donor, also applies for different time points, etc (e.g. 5mo ) |
meta.clone.id | T-cell clone id |
meta.epitope.id | Epitope id (e.g. FL10 ) |
meta.tissue | Tissue used to isolate T-cells: PBMC , spleen , etc. or TCL (T-cell culture) if isolated from re-stimulated T-cells |
meta.donor.MHC | Donor MHC list if available, blank otherwise. IMGT nomenclature (e.g. HLA-A*02:01) is preferable. Allele group names (e.g. A02 , B18 ) is also acceptable (don't use asterisk in such cases). Use comma to separate alleles. |
meta.donor.MHC.method | Donor MHC typing method if available, blank otherwise |
meta.structure.id | PDB structure ID if exists, or blank. Records having a structural data associated with them will automatically get the highest confidence score. |
comment | Plain text comment, maximum 140 characters |
Note:
While these columns are optional, subject identifier, replica identifier, etc are used when scanning submission for duplicates. Normally duplicate records (with identical complex information columns) are not allowed, but they will not be considered as duplicates in case they have distinct id fields mentioned above.
During database build phase, the information from columns mentioned above is collapsed to a JSON string and stored in a single meta
column, e.g.:
{
"cell.subset":"CD8+",
"subject.cohort":"HSV-2+",
"subject.id":12,
"clone.id":46,
"tissue":"PBMC"
}
Database processing
CDR3 sequence fixing
At this stage, a series of checks is performed for CDR3 sequence and reported V/J segments:
- In case of canonical (starting with conserved
C
and ending withF/W
) CDR3 sequences: checks if 5' and 3' germline parts match corresponding V/J segment sequences. - In case of truncated CDR3 sequences: adds conserved
C/F/W
residues. Can add more missing residues in case a relatively large contiguous V/J germline match is present. - In case excessive germline part is reported (e.g.
FGXG
instead of simplyF
at CDR3 3' part), excessive residues are removed. - Can correct mismatches in V/J germline regions in case a reliable non-contiguous V/J match is found.
The main reason behind that is that current immune repertoire sequencing (RepSeq) data processing software reports canonical clonotype sequences, high number antigen-specific TCR sequences present in literature are reported inconsistently. The latter greatly complicates annotation of RepSeq data using known antigen-specific TCR sequences.
In case of good V/J germline matching and errors in CDR3 sequence, the final CDR3 sequence in the database is replaced by its fixed version. The following report of CDR3 fixer is placed under cdr3fix.alpha
and cdr3fix.beta
columns, e.g.
{
"fixNeeded":true,
"good":false,
"cdr3":"CASSQDVGTGGVFALYF",
"cdr3_old":"CASSQDVGTGGVFALY",
"jFixType":"FixAdd",
"jId":"TRBJ1-6*01",
"jCanonical":true,
"jStart":14,
"vFixType":"FailedBadSegment",
"vId":null,
"vCanonical":true,
"vEnd":-1
}
and
{
"fixNeeded":true,
"good":true,
"cdr3":"CASSLSRGGNQPQYF",
"cdr3_old":"CASSLSRGGNQPQY",
"jFixType":"FixAdd",
"jId":"TRBJ1-5*01",
"jCanonical":true,
"jStart":9,
"vFixType":"NoFixNeeded",
"vId":"TRBV14*01",
"vCanonical":true,
"vEnd":4
}
Field descriptions:
field | description |
---|---|
fixNeeded |
true if corrected CDR3 sequence differs from the original one, false otherwise |
good |
true if the fix can be applied, false if the fix cannot be applied due to bad V/J entry or no V/J matching |
cdr3 |
Fixed CDR3 sequence |
cdr3_old |
Original CDR3 sequence |
jFixType |
Type of fix applied to CDR3 J germline part |
jCanonical |
true if CDR3 ends with F or W , false otherwise |
jId |
J segment identifier |
jStart |
A 0-based index of first CDR3 amino acid that belongs to J segment |
vFixType |
Type of fix applied to CDR3 V germline part |
vCanonical |
true if CDR3 starts with C , false otherwise |
vId |
V segment identifier |
vEnd |
A 0-based index of the last CDR3 amino acid of V segment plus one |
Note:
Possible V and J fix types:
NoFixNeeded
,FixAdd
,FixReplace
,FixTrim
,FailedReplace
(too many mismatches),FailedBadSegment
(bad segment entry),FailedNoAlignment
(no alignment at all)
VDJdb scoring
At the final stage of database processing, TCR:peptide:MHC complexes are assigned with confidence scores. Scores are computed according to reported method entries.
VDJdb scoring is performed by evaluating TCR sequence, identification and verification confidence based on the following criteria:
- Ensuring TCR sequence is correctly identified according to
method.sequencing
andmethod.singlecell
(1-3 points)- sanger - several cells sequenced (2+ cells sequenced according to
method.frequency
) - 2 points, otherwise 1 - amplicon-seq - frequency is higher than
0.01
- 2 points, otherwise 0 - single-cell - 3 points if performed
- sanger - several cells sequenced (2+ cells sequenced according to
- Initial identification of TCR:pMHC is correct according to
method.identification
(0-1 point)- sort-based - frequency is higher than
0.1
according tomethod.frequency
) - culture-based - frequency is higher than
0.5
- limiting dilution/culture prior to sequencing - the
method.frequency
becomes somewhat ambigous, check if it is higher than0.5
- sort-based - frequency is higher than
- Verification T-cell specificity (0-3 points)
- direct method - 3 points, e.g. has PDB id (
meta.structure.id
is not empty) or some other method that directly evaluates TCR:pMHC binding - target stimulation-based - 2 points
- staining-based - 1 points
- If verification is performed, then the TCR sequence is assumed to be known, so score from
1.
is set to 3
- direct method - 3 points, e.g. has PDB id (
The final score is then calculated as minimal between score from part 1.
and sum of scores from part 2.
and part 3.
.
Maximal score is then selected among different records (independent submissions, replicas, etc) pointing to the same unique complex entry (i.e. set of unique complex fields).
score | description |
---|---|
0 | Low confidence/no information - a critical aspect of sequencing/specificity validation is missing |
1 | Moderate confidence - no verification / poor TCR sequence confidence |
2 | High confidence - has some specificity verification, good TCR sequence confidence |
3 | Very high confidence - has extensive verification or structural data |
Database build contents
The final database assembly can be found in the database/
folder upon execution of BuildDatabase.groovy
script:
vdjdb_full.txt
- combined chunks with TCRalpha/beta records, antigen information, etc. All method and meta information are collapsed into two columns with corresponding names. VDJdb scores and CDR3 fixing information for TCR alpha and beta are given in separate columns. This is the raw version of VDJdb.vdjdb.txt
- a collapsed version of database used for annotation of single-chain TCR sequencing data by VDJdb-standalone software. Each line corresponds to either TCR alpha or TCR beta record as specified by thegene
column. TCR records coming from the same alpha-beta pair have the same index incomplex.id
column. In casecomplex.id
is equal to0
a record doesn't have either TCRalpha or TCRbeta chain information. This table is used by VDJdb-standalone and VDJdb-server.vdjdb.meta.txt
- metadata forvdjdb.txt
table, used by VDJdb-standalone and VDJdb-server.vdjdb.slim.txt
- a slim database used for annotation of single-chain TCR sequencing data by VDJdb-standalone software. This is a collapsed version ofvdjdb.txt
containing unique records for each CDR3:antigen pair and comma-separated lists of values for other columns (*.segm
,mhc.*
,complex.id
andreference.id
). This table can be easily parsed with R and Python/Pandas, it is intended for end users exploring VDJdb.vdjdb.slim.meta.txt
- metadata forvdjdb.slim.txt
table.motif_pwms.txt
andcluster_members.txt
- position-weight matrices for antigen-specific TCR motifs and representative sets of TCR sequences that constitute them. These tables are computed separately using code from vdjdb-motifs repository.
Note that some statistics can be generated by running R markdown templates in summary/
folder.
Building database release and generating summary figures
First make sure that you clone both vdjdb-db repo and vdjdb-motifs repo to the same folder, say ~/vcs
.
Then navigate to vdjdb-db
and run bash release.sh
. You can then find the output in ~/vcs/vdjdb-db/database
, ~/vcs/vdjdb-db/summary
and ~/vcs/vdjdb-motifs
folders. Note that you have to check .Rmd
files that will be executed and manually install missing R packages, as well as get VDJtools binary and place it in the path specified in ~/vcs/vdjdb-motifs/compute_vdjdb_motifs.Rmd
.
Database build process with Docker
The repository contains Dockerfile
to simplify the database building process. Dockerfile
instantiates the correct environment needed to build the database.
If you have Docker Desktop installed and running on your machine use the following command to build local Docker image:
docker build -t vdjdbdb .
In order to build the database using the newly created local Docker image create some folder (e.g. /tmp/output
) and use it as a external volume when running Docker image. Docker image always puts the result in /root/output
folder within docker container.
NOTE: Host path, e.g. /tmp/output
, should be absolute.
NOTE: Database building process requires at least 64GB of RAM.
mkdir -p /tmp/output
docker run -v /tmp/output:/root/output vdjdbdb