T and B cell repertoires are collections of lymphocytes, each characterized by its antigen-specific receptor. The resources available to generate the potential repertoires are described by the genomic T cell receptor (TR) and immunoglobulin (IG) loci. TR and IG are produced by random somatic rearrangements of V, D, and J genes during lymphocyte differentiation. The product of the V-(D)-J joining, called the complementarity determining region 3 (CDR3) and corresponding to the signature of the rearrangement, binds the antigen and is responsible for the specificity of the recognition. During their differentiation, lymphocytes are subjected to selective processes, which lead to deletion of most auto-reactive cells, selection, export, and expansion, of mature T and B cells to the periphery. Primary IG and TR repertoires are therefore shaped to generate the available peripheral or mucosal repertoires. In addition, several different functional T and B cells subsets have been identified, with differential dynamics and antigen-specific patterns. These available repertoires are dramatically modified during antigen-driven responses especially in the inflammatory context of pathogen infections, autoimmune syndromes, and cancer to shape actual repertoires. When considering the importance of efficient adaptive immune responses to get rid of infections naturally or to avoid auto-reactive damages, but also for therapeutic purposes such as vaccination or cell therapy, one realizes the relevance of understanding how lymphocyte repertoires are selected during differentiation, from ontogeny to aging, and upon antigenic challenge. However, immune repertoires of expressed antigen receptors are built by an integrated system of genomic recombination and controlled expression, and follow complex time-space developmental patterns. Thus, an efficient repertoire analysis requires both (1) methods that sample and describe the diversity of receptors at different levels for an acceptable cost and from a little amount of material and (2) analysis strategies that reconstitute the best multidimensional picture of the immune diversity from the partial information provided by the repertoire description as reviewed in Ref. (1). In the following sections, we summarize technologies developed over the past decades to describe lymphocyte repertoires and we present the growing number of analysis tools, evolving from basic to sophisticated statistics and modeling strategies with regards to the level of complexity of the data produced.
Methods Developed to Describe the IG and TR Repertoires
B and T lymphocyte repertoires can be studied from different lymphoid tissues and at various biological levels, such as cell membrane or secreted proteins, transcripts or genes, according to the techniques used. Fluorescence microscopy or flow cytometry techniques allow to track and sort particular cell phenotypes and to quantify the expressed repertoire at the single-cell level with V subgroup-specific monoclonal antibodies. Alternatively, the IG or TR diversity may be also analyzed using proteomics methods from either the serum (for IG) or dedicated cell extracts. Finally, molecular biology techniques assess the repertoire at the genomic DNA or transcriptional levels, qualitatively and/or quantitatively.
Analysis of IG and TR Repertoires at the Protein Level
Flow cytometry single-cell repertoire analysis
The frequency of lymphocytes expressing a given IG or TR can be determined using flow cytometry when specific monoclonal antibodies are available. This technique allows for the combined analysis of the antigen receptor and of other cell surface markers. Currently, using flow cytometry, up to 13 parameters can be routinely studied at once, reaching 20 parameters with the last generation flow cytometers and 70–100 parameters with mass cytometry (2). Seminal studies in mice using specific anti-TRBV antibodies have led to the characterization of the central tolerance selection processes that occur in thymus (3–5). Later on, a comprehensive description of the human TRBV repertoire was setup (6), when monoclonal antibodies became available for most of the TRBV subgroups. Repertoire analysis with flow cytometry provides a qualitative and quantitative analyses of the variable region, often done on heterogeneous cell populations, in order to decipher, for example, selection events related to aging, perturbations, and treatments (7). However, this technology is naturally limited by the availability of specific monoclonal antibodies, and does not address more detailed issues such as junction diversity. Furthermore, polymorphism of the IG or TR genes (8, 9) may constitute a serious limitation for a systematic survey using these approaches.
Proteomic repertoire analysis for serum immunoglobulins
Recent developments of proteomics tools now offer sensitivity levels applicable to IG repertoire analysis. Such a description at the protein level takes into account all post transcriptional and translational modifications.
PANAMA-blot technology. A semi-quantitative immunoblot, called the PANAMA-blot technique (10), allows for the identification of the antibody reactivities present in collection of sera (or cell culture supernatant) against a given source of antigens (10–12). Briefly, a selected source of antigens is subjected to preparative SDS-PAGE, transferred onto nitrocellulose membranes, then incubated with the serum to be tested allowing for the revelation of the bound antibodies using an appropriate secondary antibody coupled to alkaline phosphatase. Computer-assisted analysis of the densitometric profiles allows for the rescaling and the quantitative comparison of patterns of antibody reactivity from individuals in different groups. A large amount of data is generated when testing a range of sera against various sources of antigens. Statistical analyses are included in the PANAMA-Blot approach (as described further). This global analysis helped to reveal that the IgM repertoire in mice is selected by internal ligands and independent of external antigens (13).
This method can also lead to identify IG reactivity patterns specific for a type of pathology or clinical status and has been applied to both fundamental and clinical analysis. In particular, it was used to analyze human self-reactive antibody repertoires and their potential role for down-modulating autoimmune processes (14–16).
Antigen micro-array chips. More recently, antigen micro-array-based technology coupled to a complex two-way clustering bioinformatics analysis was developed to evaluate the serum repertoire antibodies from diabetes-prone individuals and revealed their predictive or diagnostic value. In brief, a range of antigens (proteins, peptides, nucleotides, phospholipids…) were plated onto glass plates and incubated with sera from individuals (human diabetes patients or mice in an experimental model of diabetes). The intensity of reactivity of the serum IG for each peptide was determined and scored against the control reactivity. Clustering analysis was then implemented to determine a potential antigen signature that significantly sorts out diabetes from non-diabetes individuals. In this way, it was found that the patterns of IgG antibodies expressed early in male NOD mice can mark susceptibility or resistance to diabetes induced later and that it is different than the pattern characteristic of healthy or diabetic mice after disease induction (17). Similarly, this clustering approach was applied in humans to successfully separate human subjects that are already diabetic from healthy people (18).
Repertoire Analysis at the Genomic DNA Level
Other strategies that cover IG or TR repertoire analyses have been developed at the genomic DNA level. Firstly, CDR3 spectratyping studies (detailed in the following section) have been carried out at the DNA level mostly to address issues related to B or T cell development (19, 20). More recently, an original multiplex genomic PCR assay coupled to real-time PCR analysis was developed to provide a comprehensive description of the mouse T cell receptor alpha (TRA) repertoire during development (21). Although these approaches can be applied to all IG isotypes and TR, they have not been used as much as transcript CDR3 spectratyping due to sensitivity and heterozygosity issues.
Immunoglobulin or T cell receptor repertoires can also be assessed by following the diversity of rearrangement deletion circles. Since they are produced by the V-(D)-J recombination machinery when the joint signal is formed and diluted in daughter cells, they give a good representation of recently generated T or B cells. This technique has been particularly useful for describing the restoration of T cell diversity following highly active antiretroviral therapy in HIV-infected patients (22) and has been used to model thymic export (23, 24) as well as to demonstrate continued contribution of the thymus to repertoire diversity, even in older individuals (25). It also reveals that thymic output is genetically determined, and related to the extent of proliferation of T cells at DN4 stage in mice (26). However, their analysis does not provide much insight into the level of diversity since the signal joint does not vary for a given combination of genes. Therefore, the interest of such analyses is reached when combined with CDR3 spectratyping analyses to know whether a repertoire perturbation is rather attributable to newly produced T cells or peripheral T cell proliferation.
V-(D)-J Junction Analysis of IG and TR Transcript Repertoires
Original molecular-based strategies for analyzing repertoire diversity relied on cloning and hybridization of molecular probes specific for IGHV gene subgroups first by RNA colony blot assay (27). This led to the observation that IGHV gene usage is characteristic of mouse strain and is a process of random genetic combination by equiprobable expression of IGHV genes (28). The study of selection processes revealed that the IGHV region-dependent selection determines clonal persistence of B cells (29) and that selection with age leads to biased IGHV gene expression (30).
In situ hybridization on single-cells revealed that during mouse ontogeny and early development of B cells in bone marrow, there is a non-random position-dependent IGHV gene expression, favoring D-proximal IGHV gene subgroup usage (31). Thereafter, sequencing of PCR-amplified cDNA collections were obtained from samples of interest. Although fastidious, these early studies have been useful in defining the basis of human IG and TR repertoires in terms of overall distribution, CDR3-length distribution, and V-(D)-J use (32–35), sometimes leading to the identification of new IG or TR genes. Later, more practical techniques have been developed for large-scale analysis of lymphocyte repertoires, such as quantitative PCR, micro-array, and junction length spectratyping, as described below.
Quantitative RT-PCR for repertoire analysis
In parallel to qualitative CDR3 spectratyping techniques (see section below), quantitative PCR strategies were developed (36). Coupling the two techniques for all V domain-C region combinations provides a complete qualitative and quantitative picture of the repertoire (37–39) described by up to 2,000 measurements per IG isotype or TR for one sample. With the development of real-time quantitative PCR, this approach opened the possibility for a more precise evaluation of repertoire diversity (39–41). Complementary tools have been also developed in order to allow normalization of spectratype analysis such as studies by Liu et al. (42) and Mugnaini et al. (43).
Matsutani et al. (44) developed another method to quantify the expression of the human TRAV and TRBV repertoires based on hybridization with gene specific primers coated plates. The cDNA from PBMC extracted RNA are ligated to a universal adaptor which allows for a global amplification of all TRAV or TRBV cDNAs. The PCR products are then transferred onto microplates coated with oligonucleotides specific for each TRAV or TRBV regions, and the amount of hybridized material is quantified. This technique was used to analyze the TR repertoire diversity of transplanted patients (45) and adapted to the study of mouse TRAV and TRBV repertoires (46). VanderBorght et al. also developed a semi-quantitative PCR-ELISA-based method for the human TRAV and TRBV repertoire analysis (38). The combined usage of digoxigenin (DIG)-coupled nucleotides and DIG-coupled reverse TRAC or TRBC primers allowed for a quantitative measurement of the amount of amplified DNA by a sandwich ELISA.
Du et al. (47) later setup a megaplex PCR strategy to characterize the antigen-specific TRBV repertoire from sorted IFNγ-producing cells after Mycobacterium infection. The clonotypic TRBV PCR products were used for Taqman probes design to quantify the expression of the corresponding clonotypes from ATLAS-amplified SMART cDNAs.
Direct measurement of lymphocyte diversity using micro-arrays
Another technology, similar to the one just discussed, has been developed by the group of Cascalho et al. which allows for a direct measurement of the entire population of lymphocyte-receptors. This is accomplished by hybridization of lymphocyte-receptor specific cRNA of a lymphocyte population of interest to random oligonucleotides on a gene chip; the number of sites undergoing hybridization corresponds to the level of diversity. This method was validated and calibrated using control samples of random oligonucleotides of known diversity (1, 103, 106, 109) (48, 49) and successfully demonstrated that central and peripheral diversification of T lymphocytes is dependent on the diversity of the circulating IG repertoire (49, 50). Similarly, a highly sensitive micro-array-based method has been proposed to monitor TR repertoire at the single-cell level (51).
CDR3 spectratyping techniques
Immunoscope technology. Among various techniques used to analyze the T or B cell repertoires, Immunoscope, also known as CDR3 spectratyping (52, 53) consists in the analysis of the CDR3-length usage so that antigen-specific receptor repertoires can be described by thousands of measurements. In the case of naive murine repertoires, T cell populations are polyclonal and analysis typically yields eight-peak regular bell-shaped CDR3 displays (wrongly assumed to be Gaussian), each peak corresponding to a given CDR3-length. When an immune response occurs, this regular polyclonal display can be perturbed: one can see one or several prominent peaks that correspond to the oligoclonal or clonal expansion of lymphocytes. A complete description of this technique and its applications to clinical studies has been published elsewhere (54).
In the original Immunoscope publication, Cochet et al. (55) analyzed the T cell repertoire after the immunization of mice with the pigeon cytochrome c. They provided the first description of an ex vivo follow-up of a primary T cell specific response in a mouse model. Their second paper analyzed the average CDR3-lengths as a function of TRBV-TRBJ combinations. In particular, the authors found a correlation between TRBV CDR1 and major histocompatibility (MH) haplotype (52). This group later published a large amount of original studies in various models such as lymphocyte development (40, 56–63), kinetics of antigen-specific responses (64–67), viral infection (68, 69), autoimmunity (70, 71), tumor-associated disease (72), and analysis of allogeneic T cell response and tolerance after transplantation (73). Notably, the combination of CDR3 spectratyping with flow cytometry-based IG or TR V frequency analysis provides a more comprehensive assessment, such as in Pilch et al. (74). For example, such an approach revealed the constriction of repertoire diversity through age-related clonal CD8 expansion (75). Similarly, a combination of CDR3 spectratyping, flow cytometry, and TR deletion circle analysis has allowed to define age-dependent incidence on thymic renewal in patients (76) or to evaluate the effects of caloric restriction in monkeys to preserve repertoire diversity (77). CDR3-length spectratyping was also used in other models, such as rainbow trout, to analyze TRB repertoire and its modifications induced by viral infection (78–80). While no tool such as monoclonal antibodies to T cell marker(s) was available in this model, this approach demonstrated that fish could mount specific T cell responses against virus, which could be found in all individuals (public clonotypes) or not (private clonotypes). Similar strategies, developed by other groups (81) and following the same approach in parallel, analyzed the IG repertoire in Xenopus at different stages of development, describing a more restricted IG junction diversity in the tadpole compared to the adult.
Gorski et al. (82) developed their own CDR3 spectratyping technique to analyze the complexity and stability of circulating αβT cell repertoires in patients following bone marrow transplantation as compared to normal adults. They showed that repertoire complexity of bone marrow recipients correlates with their state of immune function; in particular, individuals suffering from recurrent infections associated with T cell impairment exhibited contractions and gaps in repertoire diversity. The detailed procedure for this technique has been published in Maslanka et al. (83). A variation of this technique has been reported later by Lue et al. (84), relying on a compact glass cassette, a simpler device than the usual automated plate DNA sequencers.
Alternative technologies. Alternative CDR3 spectratyping techniques have been described such as single-strand conformation polymorphism (85–87) and heteroduplex analysis (88–91). These methods differ from the CDR3 spectratyping/Immunoscope technique mostly in the way PCR products are analyzed by performing non-denaturing polyacrylamide electrophoresis. The main advantage of these techniques is a more direct assessment of clonal expansion since PCR products migrate according to their conformation properties; therefore, presence of a predominant peak is strongly indicative of clonality when a smear migration pattern indicates polyclonality. However, these techniques have been less widely used probably because of the difficulty to make clear correlations between the expanded peaks across samples.
Another original alternative technique has been described by Bouffard et al. (92), analyzing products obtained after in vitro translation of PCR-amplified TR-specific products by isoelectric focusing. With this technique, clonality can also directly be assessed by looking at the obtained migration profile.
IG/TR Rearrangement Sequencing: From Cloning-Based- to Next-Generation-Sequencing
In order to get a better description of IG/TR diversity at the nucleotide sequence level, thus providing fine-tuned description of the actual diversity, Sanger sequencing approaches relying on bacterial cloning of rearrangements were performed in physiological conditions globally (60, 93–99) or partially to characterize particular expansions identified by other technologies such as CDR3 spectratyping (40, 59, 100–102), flow cytometry (103). They were also used in pathological/infectious conditions (104–107) sometimes leading to antigen-specific T cell TR identification and quantification through the combination of antigen-specific T cell stimulation and cytometry-based cell sorting, anchor-PCR, and bacterial cloning-based sequencing (108).
These studies pioneered the description of the repertoire and provided fruitful information regarding the extent and modification of the diversity. However, besides being time and cost-extensive, such approaches have allowed for the analysis of 102–103 sequences, far under the estimated diversity reaching 106–107 unique clonotypes in mice and humans (40, 59, 109).
In the last decade, DNA sequencing technologies have made tremendous progresses (110) with the development of so called next-generation sequencers, already reaching four generations (111). Those instruments are designed to sequence mixtures of up to millions of DNA molecules simultaneously, instead of individual clones separately. Second generation sequencers became affordable in the last 5 years and have been used for immune repertoire analysis, starting with the seminal work of Weinstein et al. (112) where the IG repertoire of Zebrafish has been described by large-scale sequencing. Consequently, exploratory works by other groups provided an overview of the complex sequence landscape of immune repertoires in humans (113–118). More recent work aimed at addressing fundamental questions such as lineage cells commitment (119–122), generation of the diversity processes (123–125), and diversity sharing between individuals (126, 127). Finally, the power of this technology has been validated in the clinic as well (128, 129).
As seen above for other technologies, combinations of approaches have been applied to NGS. Notably, deep sequencing has been used in combination with CDR3-length spectratyping by some groups to study human (130) or rainbow trout IG (131) repertoire modifications after vaccination against bacteria or viruses. In the latter, pyrosequencing performed for relevant VH/Cμ or VH/Cτ junctions identified the clonal structure of responses, and showed, for example, that public responses are made of different clones identified by (1) distinct V-(D)-J junctions encoding the same protein sequence or (2) distinct V-(D)-J sequences differing by one or two conservative amino acid changes (131) as described for public response in mammals (132, 133). These studies showed that NGS and traditional spectratyping techniques lead to remarkably similar CDR3 distributions.
Several NGS have been developed in the past years using different sequencing technologies characterized with different speed, deepness and read length. Metzker thoroughly reviewed their principles and properties (134). Among them, three platforms, all offering benchtop sequencers with reduced cost and setup, fit with immune repertoire analysis in terms of read length and deepness. The 454/Roche platform uses pyrosequencing technology (135), which combines single nucleotide addition (SNA) with chemoluminescent detection on templates that are clonally amplified by emulsion PCR and loaded on a picotiter plate. Pyrosequencing currently has a 500 bp (GS Junior) to 700 bp (GS FLX) sequencing capacity with a respective deepness of 150,000–3,000,000 reads per run (134). The Illumina/Solexa platform technology is based on cyclic reversible termination (CRT) sequencing (an adaptation of Sanger sequencing) performed on templates clonally amplified on solid-phase bridge PCR. Protected fluorescent nucleotides are added, imaged, delabeled, and deprotected cyclically (134), providing a deeper sequencing (from 15 to 6 billion reads per run for the MiSeq to the HiSeq2500/2000) of shorter reads (100–250 bp for the very recent MiSeq) with the possibility to perform pair-end sequencing (two-side sequencing) to increase the read length after aligning the generated complementary sequences. A more recent platform, Ion Torrent/Life Technologies using an imaging free detection system may open a new era in terms of deepness (one billion reads per run) of 200 bp reads (136) in a very short time and on a benchtop sequencer. Importantly, depending on the technology, errors due to the PCR-based sample preparation and the sequencing are of major concern. Bolotin et al. (137) evaluated this issue on TR repertoire analysis of the same donor performed on the three platforms described previously; algorithms for error correction have been developed. Indeed, PCR- and sequencing-related errors represent the major concern for immune repertoire diversity analysis as they may generate artificial diversity. Illumina and 454 appear to be the most robust technologies, with Illumina having the highest throughput and 454 generating the longest reads. The currently available Ion Torrent platform, although very promising, has been shown to display the highest rate of errors in TR (137) and bacterial DNA (138) sequencing. However, such error corrections must be used with caution since they may inadvertently underestimate repertoire diversity by removing rare sequences.
With the power of such approach for genomics and transcriptomics studies in general, constant improvements are achieved to increase the sequencing deepness and read length as well as to reduce the cost, therefore offering multitude of biological explorations (139). NGS now permits a comprehensive and quantitative view of IG and TR diversity by combining and improving the sensitivity of classical approaches with accurate and large-scale sequencing. NGS has the power to identify IG or TR specific for given antigens (in combination with antigen-specific assays) and to define more complex signatures (i.e., TR sets) related to disease and/or treatment from heterogeneous T and B cell populations. Still, most of the deep sequencing efforts have been limited to only one chain of the receptor at the repertoire level (usually the β chain for TR and the heavy chain for IG). Indeed, current high-throughput approaches do not allow one to assign which combination of chains (TRA and TRB, or IGH and IGK or IGL) belong to which cell (140). A recent development by DeKosky et al. proposed a reasonably high-throughput technology to assess massively paired IG VH and VL from bulk population (141). In parallel, Turchaninova et al. (142) have proposed a similar approach for the paired analysis of the TRA and TRB chains. The parallel development of high-throughput microfluidic-based single-cell sorting will certainly push forward new developments in the field (143).
However, despite the technological advance, studies so far have mainly reported CDR3 counting and identification of major expansions. The complexity of immune repertoires is still a matter that such approach cannot completely overcome, due to the paucity of powerful analytical methods. Besides data management tools, studies are now starting to extract most of the benefit from such approach to model the immune repertoire diversity and dynamics (144), an approach that may help in understanding the interplay between cells and repertoire shaping. Accurate and powerful statistical analyses are required to manage such amount of information. Current state will be reviewed in the following sections.
Potential and Genomic Repertoires: A Question of Ontology and Orthology
Immune repertoires sensu stricto are expressed by lymphocyte clones, each carrying a single receptor for the antigen. Such receptors comprise IG and TR in jawed vertebrates (8, 9) and VLR in Agnathans (145). The sequences of these receptors are available in databases such as GenBank or EMBL, which are difficult to use for transversal studies due to inconsistent annotation. The IMGT® information system (see below) has largely solved this problem setting standardized gene nomenclatures, ontologies and a universal numbering of the IG/TR V and C domains, thus giving a common access to standardized data from genome, proteome, genetics, two-dimensional, and three-dimensional structures (146). The accuracy and the consistency of the IMGT® data are based on IMGT-ONTOLOGY, the first, and so far, unique ontology for immunogenetics and immunoinformatics (147).
With the development of high-throughput sequencing, large numbers of new sequences of antigen receptor genes have become available, which can be classified into different categories: genomic sequences of IG or TR (in germline configuration in genome assemblies) or fragments of IG/TR transcripts, containing the CDR3 or not. Also, these datasets can be produced from species newly sequenced, as well as from new haplotypes of well-described species.
The annotation of such sequences remains an open question. Manual annotation is not applicable, and no good automated approach has been validated yet. A relevant annotation of these massive datasets will require the integration of genomic and expression data with existing standardized description charts, as offered by IMGT®. A standardized annotation is an important issue since it facilitates the re-utilization of datasets and comparison of analyses. Thus, the description of IG and TR polymorphisms, the integration of repertoire studies with structural features of antigen-specific domains, and even the usage of new genes in genetic engineering rely on a common standard for nomenclature, numbering, and annotation (147).
To take advantage of the current standards that have been established from classical sequencing data during the last 25 years, new, fast, reliable, and human-supervised annotation methods will have to be developed, integrating directly high-throughput sequence information from the increasing number of deep sequencing platforms and technologies, at different genetic levels (genome, transcriptome, clonotype repertoires). Along this line, IMGT/HighV-QUEST offers online tools to the scientific community for the analysis of long IG and TR sequences from NGS (148).
Special attention can be paid to the orthology/paralogy relationships between similar antigen receptor genes from different species. These characteristics are essential to understand the dynamics of IG and TR loci. In fact, with many important lymphocyte subsets characterized by canonical/invariant antigen receptors, such relationships are critical to transfer functional knowledge between models. Importantly, the phylogenetic analyses required to reconstitute the evolution of antigen receptor genes are based on multiple alignments, the quality of which is highly dependent on common numbering and precise annotation of sequences.
As far as immune repertoires are characterized by the diversity of receptors specifically binding antigen/pathogen motifs to initiate a defense response, they might not be limited to lymphocyte diversifying receptors, e.g., IG, TR, and VLR. The particularity of these systems is a somatic diversification combined to a clonal structure of the repertoire, each lymphocyte clone expressing the product of a recombination/hypermutation and/or conversion process. However, many other arrays of diverse receptors binding or sensing pathogens have been discovered in metazoans, in invertebrates as well as in vertebrates.
In some cases, their diversity is really “innate,” i.e., encoded in the genome as multiple genes produced by duplications. Fish NLR, finTRIMs, and NITR, primate KIR, chicken CHIR, or TLR in sea urchin, constitute good examples of such situations. While these repertoires may appear as relatively limited, polymorphism within populations, and differential expression of receptors per cell upon stimulation represent complex issues, which fall well into “traditional” repertoire approaches.
In other cases, receptors are subject to diversification processes much faster than gene duplication, which does not comply with a clonal selection pattern. The best examples are probably the DSCAM in arthropods, which hugely diversify by alternative splicing of exons encoding half-IgSF domains (149, 150), and the FREP lectins in mollusks, of which sequences are highly variable at the population level, and even between parents and offspring produced by auto-fecundation (151).
The number of such “innate” repertoires which are not expressed by clonally selected lymphocytes will likely increase with deep sequencing of new genomes/transcriptomes, as illustrated by a recent report from mussel (152). A good example of the importance of a proper structural description of key domains of receptors is provided by the extensive analysis of LRR motifs in studies on TLR evolution (153, 154). Further insights into the functions of such diverse proteins will be provided by the characterization of their expressed (available) repertoire, at different levels such as single-cells, cell populations, and animal populations.
Such analyses will require precise identification of genes and sequences as well as mutations, and a standardized approach of nomenclature and structural description will be as useful as it is for the vertebrate IG and TR sequences. Importantly, these receptors are made of a small number of structural units, such as IgSF domain or LRR domains, which suggests that standardized system(s) for sequence annotation could be developed following IMGT standards (155).
Statistical Analysis and Modeling of Immune Repertoire Data
Statistical Repertoire Analysis
The description of the repertoire modifications using flow cytometry or Immunoscope provided clear-cut and detailed insight into the clonal expansion processes during the responses against a defined antigen (64, 66). However, it is difficult to identify the relevant alterations of the repertoires in more complex situations such as pathogen infections or variable genetic backgrounds. For example, it appeared impossible to identify all significant modifications of TRB Immunoscope profiles during cerebral malaria by direct ocular comparison (107). Different methods were therefore developed to extract from IG and TR repertoire descriptions the relevant information, to encode it as numerical tables and to analyze them with statistical models.
CDR3 spectratype perturbation indices
Since the initial description of the CDR3 spectratyping technique, different scoring indices were developed or derived from the literature: “relative index of stimulation” (RIS) (55), “overall complexity score” (156), Reperturb (157), “complexity scoring system” (158), COPOM (159), Oligoscore (160), TcLandscape (161), “spectratype diversity scoring system” (162), Morisita-Horn index and Jaccard index (95–97), “absolute perturbance value” (163). A comparative review of such scoring strategies was published by Miqueu et al. (164).
In particular, the perturbation index Reperturb was developed by Gorochov et al. to perform TR repertoire analysis in HIV patient during progression to AIDS and under antiretroviral therapy. They could show drastic restrictions in the CD8+ T cell repertoire at all stages of natural progression that persisted during the first 6 months of treatment. In contrast, CD4+ T cell repertoire perturbations correlated with progression to AIDS with a return to a diversified repertoire in good responders to treatment (157).
Soulillou et al. refined this approach by combining the qualitative information obtained with usual CDR3 spectratyping with quantitative information of TRBV usage obtained by real-time quantitative PCR. They devised a four-dimension representation that represents TRBV subgroups, CDR3-length and percentage of TRBV use on three axis chart in addition to a color-coded representation of the CDR3 profile perturbation. Using this original approach, they were able to show that graft rejection is associated with a vigorous polyclonal accumulation of TRBV mRNA among graft-infiltrating T lymphocytes, whereas in tolerated grafts T cell repertoire is strongly altered (161, 165). Their study puts the emphasis on the importance of not only qualitative but also quantitative analysis of lymphocyte repertoires.
Platforms for repertoire data management and statistical analysis
Several platforms have been developed and rely mostly on CDR3 spectratyping and sequencing data, with recent developments to manage and analyze NGS data.
The ISEApeaks strategy and software were developed in order to satisfy the needs for efficient automated electrophoresis data retrieval and management (160, 166). ISEApeaks extracts peak area and length data generated by software used to determine fragment intensity and size. CDR3 spectratype raw data, consisting of peak areas and nucleotide lengths for each V-(D)-J-C combination, is extracted, smoothed, managed, and analyzed. The repertoires of different samples are gathered in a peak database and CDR3 spectratypes can be analyzed by different perturbation indices and multivariate statistical methods implemented in ISEApeaks. We have applied our ISEApeaks strategy in several studies. In an experimental model of cerebral malaria, we established a correlation between the quality of TR repertoire alterations and the clinical status of infected mice, whether they developed cerebral malaria or not (107). We contributed to the characterization of the membrane-associated Leishmania antigens (MLA) that stimulates a large fraction of naive CD4 lymphocytes. Repertoire analyses showed that MLA-induced T cell expansions used TR with various TRBV rearrangements and CDR3 lengths, a feature closer to that of polyclonal activators than of a classic antigen (167). We also revealed repertoire age-related perturbations in mice (7). ISEApeaks functions for statistical analysis was successfully applied to analyze the TR repertoire in fish as shown by our detailed analysis of the TRB repertoire of rainbow trout IELs, performed in both naive and virus-infected animals. Rainbow trout IEL TRBV transcripts were highly diverse and polyclonal in adult naive individuals, in sharp contrast with the restricted diversity of IEL oligoclonal repertoires described in birds and mammals (102). More recently, our study of the CD8+ and CD8− αβ T cell repertoire suggests different regulatory patterns of those T cell patterns in fish and in mammals (168). ISEApeaks was also used to implement a new statistically based strategy for quantification of repertoire diversity (159).
Kepler et al. described another original statistical approach for CDR3 spectratype analysis, using complex procedures for testing hypotheses regarding differences in antigen receptor distribution and variable repertoire diversity in different treatment groups. This approach is based on the derivation of probability distributions directly from spectratype data instead of using ad hoc measures of spectratype differences (169). A software (called SpA) implementing this method has been developed and made available online (170). This approach has been used in a longitudinal analysis of TRBV repertoire during acute GvHD after stem cell transplantation (171).
Another group (163) reported the development of a new software platform, REPERTOIRE, which allows handling of CDR3 spectratyping data. This software implements a perturbation index based upon an expected normal Gaussian distribution of CDR3 length profiles.
Owing to the complexity and diversity of the immune system, immunogenetics represents one of the greatest challenges for data interpretation: a large biological expertise, a considerable effort of standardization, and the elaboration of an efficient system for the management of the related knowledge were required. To answer that challenge, IMGT®, the international ImMunoGeneTics information system®(http://www.imgt.org), was created in 1989 by one of the authors (146). Overtime, it developed standards that, since 1995, have been endorsed by the World Health Organization-International Union of Immunological Societies (WHO-IUIS) Nomenclature Committee and by the WHO-International Nonproprietary Names (INN) (172–175). IMGT® comprises seven databases (sequence, gene, and structure databases), 17 online tools and more than 15,000 pages of web resources. Among the databases, IMGT/LIGM-DB, the database for nucleotide sequences (170,685 sequences from 335 species as of July 2013) and IMGT/GENE-DB, the gene database (3,081 genes and 4,687 alleles) are of great interest for repertoire analysis. Freely available since 1997, IMGT/V-QUEST is an integrated system for the standardized analysis of collections of IG and TR rearranged nucleotide sequences (176, 177). A high-throughput version, IMGT/HighV-QUEST (148), has been released in 2010 for the analysis of long IG and TR sequences from NGS using the 454 Life Sciences technology. In the same line, other analysis tools are becoming available showing the renewed interest for repertoire analyses and modeling consecutive to NGS technology developments (178–181).
Altogether, these efforts highlight the relevance of developing more efficient and powerful technologies for the evaluation of repertoire diversity. Notably, two successful French biotech companies (TcLand, Nantes; ImmunID, Grenoble) were created in the field of repertoire analysis, using different technologies. In collaboration with ImmunID, we have proposed a novel strategy for statistical modeling of T lymphocyte repertoire data obtained in humans and humanized mice. With this model, we revealed that half of the human TRB repertoire, in terms of proportion of TRBV-TRBJ combinations, is genetically determined, the other half occurring stochastically (182). In addition, the biotechnology company “Adaptive” and the “Repertoire 10K (R10K) Project” have been recently founded by researchers respectively from the Fred Hutchinson Cancer Research Center (Seattle and Washington) and the HudsonAlpha Institute (Huntsville). Both have developed platforms (immunoSEQ®, iRepertoire®) providing researchers with a global analysis of the T or B cell receptor sequence repertoires (183). However, despite the power of this technology, studies are still limited by the ability to process the complexity of the information provided. Specific software developments for the automatic treatment and annotation of IG and TR sequences and the statistical modeling of repertoire diversity can still be improved.
As mentioned above, the PANAMA-Blot technique also includes statistical analysis of the data. Multi-parametric analysis was introduced to compare the global reactivity of antibodies of different individuals in different groups with a given antigenic extract. This analysis has been successfully implemented to identify reactivity patterns specific for a given pathology or clinical status (10–12, 14, 15, 184). Similarly, multi-parametric analysis was also applied to TRBV spectratype analysis in an experimental cerebral malaria model (107).
Hierarchical clustering or classification algorithms have become very popular with the growing of micro-array-based transcriptome analysis. Although still uncommon for immune repertoire analysis, such approaches have been employed to categorize large sets of repertoire data without a priori (17, 102, 107).
The concept of immune repertoire has been devised to describe the diversity of cells involved in the immune system of an individual (1). As described above, different scoring systems were developed to assess this diversity, some are heuristics but others have been borrowed from theoretical ecology and evolution. As reviewed by Magurran (185
Despite the growing number of immune repertoire sequencing studies, the field still lacks software for analysis and comprehension of this high-dimensional data. Here we report VDJtools, a complementary software suite that solves a wide range of T cell receptor (TCR) repertoires post-analysis tasks, provides a detailed tabular output and publication-ready graphics, and is built on top of a flexible API. Using TCR datasets for a large cohort of unrelated healthy donors, twins, and multiple sclerosis patients we demonstrate that VDJtools greatly facilitates the analysis and leads to sound biological conclusions. VDJtools software and documentation are available at https://github.com/mikessh/vdjtools.
High-throughput profiling of T- and B-cell antigen receptor repertoires promises great advances in our understanding of the mechanisms underlying adaptive immune system function, treatment of autoimmune and infectious diseases, and development of novel approaches in cancer immunotherapy. A number of recently developed software tools aim at processing immune repertoire data by mapping Variable (V), Diversity (D) and Joining (J) antigen receptor segments to sequencing reads and assembling T- and B-cell clonotypes. Nevertheless, there still exists a major gap in common methods of data post-analysis in the field: there is no standardized data format so far, and most of data comparative analysis is carried out using a variety of in-house scripts. Here we present VDJtools, a software framework that can analyze output of most commonly used TCR repertoire processing tools and allows applying a diverse set of post-analysis strategies. The main aims of our framework are: To ensure consistency of post-analysis methods and reproducibility of obtained results; to save the time of bioinformaticians analyzing TCR repertoire data by providing comprehensive tabular output and open-source API; and to provide a simple enough command line tool so that immunologists and biologists with little computational background could use it to generate publication-ready results.
Citation: Shugay M, Bagaev DV, Turchaninova MA, Bolotin DA, Britanova OV, Putintseva EV, et al. (2015) VDJtools: Unifying Post-analysis of T Cell Receptor Repertoires. PLoS Comput Biol 11(11): e1004503. https://doi.org/10.1371/journal.pcbi.1004503
Editor: Paul P. Gardner, University of Canterbury, NEW ZEALAND
Received: May 7, 2015; Accepted: August 13, 2015; Published: November 25, 2015
Copyright: © 2015 Shugay et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Data Availability: Raw data for multiple sclerosis patients is deposited at http://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA280417
Funding: This work was supported by the Russian Science Foundation project №14-14-00533 (VDJtools development) and RFBR grant 13-04-00998 (cDNA libraries preparation). The work was carried out in part using equipment provided by the Shemyakin– Ovchinnikov Institute of Bioorganic Chemistry Core Facility. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
This is a PLOS Computational Biology Software paper
The advent of high throughput sequencing (HTS) has opened a new venue for the studies of genomics of adaptive immunity that involve deep profiling of T-cell receptor (TCR) and B-cell receptor (BCR) gene repertoires encoding a myriad of antigen specificities.
Huge volumes of complex data produced by the immune repertoire profiling have led to the development of a diverse set of software tools, which often complement each other. We [1–3] and others [4–7] have recently contributed several tools that handle large amounts of raw HTS data to process it into a human-readable list of clonotypes characterized by Variable (V), Diversity (D), Joining (J) segments and V-(D)-J junction sequences of receptor genes. While such processed data carry nearly exhaustive information on the sampled immune repertoire, this information yet needs to be convolved, scaled and compared across various samples to result in sound biological conclusions.
Post-analysis of immune repertoire data is a challenging task owing to extreme diversity of TCR and BCR sequences. For example, in technically similar microbiome profiling by 16S rRNA sequencing one deals with thousands of operational taxonomic units that represent various species , while typical TCR repertoire samples may contain hundreds of thousands [9,10] of clonotypes. Moreover, the species phylogeny and annotation is well developed in the field of microbiology , while immune repertoires remain poorly annotated. To illustrate this, a simple query with “16S rRNA” currently yields more than 8 million records in GenBank, while there are only 37 thousand records annotated as “T-cell receptor”. However, unsupervised methods of studying repertoires, for example based on sample overlap, could turn out very promising, as there exists a relatively limited diversity of overlapping clonotypes [12–15].
In the light of recent advances in storage and processing of immunological big data , community-driven initiatives for immune repertoire data sharing and analysis are likely to emerge, for example VDJserver portal  which is currently under development. There are several commonly used ways to survey immune repertoire information obtained from HTS, such as tracking individual clonotypes [18,19], comparing immune receptor segment usage [20,21] and comparing repertoire diversity . Still those are overwhelmingly performed using in-house scripts or even manually. This is becoming a major obstacle, as comparison and annotation of samples based on data generated in other studies is critical for comprehensive analysis of immune repertoire sequencing data. In contrast, similar fields, such as metagenomics, have a plethora of such instruments .
The VDJtools software package presented here aims at filling this gap by incorporating a comprehensive set of routines for analysis of TCR repertoire sequencing data (Fig 1). The variety of implemented algorithms range from basic statistics calculation and clonotype table filtering to advanced routines such as repertoire clustering and computationally intensive routines such as clonotype table joining.
Fig 1. Overview of VDJtools software package.
VDJtools analysis routines can be grouped into 6 modules and are applicable to results produced by commonly used immune repertoire sequencing processing software. Basic statistics and segment usage module include general statistics (clonotype and read count, number and frequency of non-coding clonotypes, convergent recombination of CDR3 amino acid sequences, insert size statistics, etc), spectratyping (distribution of clonotype frequency by CDR3 length), Variable and Joining segment usage profiles and their pairing frequency in re-arranged receptor junction sequences. Repertoire overlap module includes routines for computing sets of overlapping clonotypes and their characteristics, and scatter plots of clonotype frequencies. Diversity analysis includes routines for visualizing clonotype frequency distribution, computing repertoire diversity estimates and rarefaction plots. The fourth set of routines can be used to create clonotype abundance profiles and track clonotypes in time course of vaccination, myeloablation and blood cell transplant. Sample clustering is implemented based on computed repertoire similarity measures and could be used to distinguish various biological conditions, cell subsets and tissues. Auxiliary routines provide means for clonotype table filtering (e.g. by segment usage or non-coding CDR3 sequence) as well as annotation with custom or pre-built pathogen-specific clonotype database. VDJtools can be incorporated in Java programming language-based pipelines as demonstrated by VDJviz clonotype browser.
VDJtools can calculate basic immune repertoire statistics that were commonly used in pre-HTS era repertoire analysis. Those include in silico spectratype (the distribution of lengths of CDR3 nucleotide sequences) that was first introduced with corresponding molecular biology assay , and various Variable/Joining segment usage statistics that root in flow cytometry analysis of T- and B-cell populations.
The framework provides means for analyzing the diversity of immune repertoires, such as normalized unique clonotype counts (with an option to account for convergent recombination), clonotype frequency distribution, as well as rarefaction curves and lower bound estimates of total repertoire diversity widely applied in ecology field . The concept of repertoire diversity is of great importance, as it reflects the ability of our immune system to effectively withstand a multitude of encountered pathogens . By applying computational methods one could estimate how the diversity is influenced by various processes, such as aging , vaccination, and infection . Diversity measures could also be used to compare the structure of T- and B-cell repertoires in samples derived from a variety of tissues and subjects .
Advanced set of VDJtools methods includes cluster analysis of repertoire samples and clonotype tracking which have a wide range of applications. Machine learning methods such as hierarchical clusterization and multi-dimensional scaling can aid in learning T-cell antigen specificities and disease biomarker patterns from high-dimensional TCR data . Clonotype tracking is useful in studying immune repertoire dynamics associated with vaccination , autologous hematopoietic stem cell transplantation (HSCT) [19,30,31], checkpoint inhibitors , etc., as well as in detection of minimal residual disease in lymphoid malignancies [33–37].
An overview of 20 recently published immune repertoire studies (S1 Table) demonstrates that VDJtools can perform most of emerging post-analysis tasks therefore greatly facilitating the analysis process and removing the need to develop multiple custom scripts. Currently there are few software tools capable to perform post-analysis of immune repertoire data [7,38,39], all of which provide less functionality when compared to VDJtools (S2 Table). Moreover, in contrast to VDJtools which can handle output generated by various pre-processing software, these tools only support datasets in their internal formats.
Design and Implementation
The study was approved by ethics committee of the Russian Children's Hospital from January 20, 2011.
The core API of the software is implemented in Java/Groovy languages and automatically resolves all dependencies during compilation using Maven. The API includes generalized entities, such as Clonotype, Sample and SampleCollection classes, and allows storing sample metadata using MetadataTable class. The API also contains a comprehensive set of routines for computing sample-specific and cross-sample statistics, which are optimized for parallel computation. VDJtools API can be easily integrated in any software written in Java or related programming languages (e.g. Groovy, Scala and Clojure). VDJtools is an open-source software, the source code can be accessed at GitHub .
Comprehensive software documentation is hosted at ReadTheDocs  and contains basic usage guidelines (including the description of common pitfalls), a summary of implemented algorithms, as well as examples that cover some typical VDJtools usage cases. The documentation also contains step-by-step instructions for reproducing the analysis described in present paper.
VDJtools has a command line interface that allows executing analysis routines that produce tabular and publication-ready graphical output. Tabular output can be used for post-hoc analysis in R or explored in spreadsheet software such as Excel. Plotting parameters are optimized to provide the most intuitive and comprehensive graphical representation for most usage cases while users can specify their own sample groups and factors to be visualized.
VDJtools accepts tabular output of commonly used pre-processing software: MIGEC , MiTCR , ImmunoSEQ , IMGT/HighV-QUEST , and MiXCR . VDJtools also supports IgBlast  software format. Of note, using IgBlast requires a considerable amount of parsing and post processing, as it only reports Variable segment alignment and doesn’t provide the CDR3 sequence. Moreover, vanilla IgBlast doesn’t accept FASTQ format input, does not provide clonotype assembling (grouping of sequencing reads with identical Variable segment, Joining segment and CDR3 sequence) and is not optimized for parallel computations. We have implemented all those features in our wrapper for IgBlast software, MIGMAP, that could be downloaded from . VDJtools converts all input datasets to its own internal format, which is a tab-delimited table containing abundance, CDR3 sequence, V, D and J segment names and markup of CDR3 sequence germline regions.
An immune repertoire browser VDJviz which serves as a lightweight GUI for VDJtools was built using Play framework and VDJtools API and could be accessed at .
Raw data for multiple sclerosis patients is deposited at SRA (PRJNA280417). Pre-processed clonotype tables can be found in a separate GitHub repository , which also contains shell scripts that can be used to reproduce the analysis.
To demonstrate the efficiency of VDJtools, we have analyzed TCR beta repertoires for the peripheral blood samples of 13 young (6–15 years old) individuals diagnosed with multiple sclerosis (MS1-13), and 6–25 years old control group (C1-11) described in Ref. . The multiple sclerosis dataset was prepared and sequenced using the same protocol as the control one. We have also included a sample from the MS8 patient after hematopoietic stem cell transplant (MS8HSCT). The list of samples is provided in S3 Table.
To remove quantitative biases and reduce possible impact from PCR and sequencing artifacts, we have utilized unique molecular identifiers [10,45,46]. Analysis of raw molecular barcoded data was performed using our MIGEC software. Molecular identifier groups represented by a single read were discarded, and the remaining groups were subjected to cDNA consensus assembly and CDR3 extraction as previously described . Hereafter we will use the term T-cell receptor beta chain cDNA molecules (TRBM) for describing clonotype count units. Note that in these experiments we obtained ~0.5 mln cDNA molecules per ~1–10 mln starting T-cells, so we can assume that each TRBM roughly represents a single T cell.
Estimating repertoire diversity
We have started our analysis by comparing the repertoire diversity of MS and control samples. To support the diversity measure choice and check for possible biases we have performed a benchmark on previously published T cell immunity aging data  and additional ANOVA analysis to identify factors that bias diversity estimates (S1 Text, S1 Fig, S4 Table). We have used common diversity measures: the observed diversity (number of unique clonotypes), Chao  and Efron  estimates for lower bound on total species diversity, Shannon  and Simpson  indices, as well as extrapolated Chao estimate .
The benchmark, in which correlation with a physiological (age) and immune status (naïve T-cell count) factors was compared for various diversity estimates, has shown that best correlation can be achieved when samples are normalized to the same size (TRBM count). Correspondingly, ANOVA analysis suggests a strong sampling-related bias. Accounting for this bias is especially important in present case as the rarefaction curves are far from saturation (Fig 2A). Notably, lower bound estimates of total repertoire diversity that are especially affected by sampling bias were applied in some recent studies for the comparison of TCR repertoire diversity under uneven sample sizes [9,52].
Fig 2. Estimation of repertoire diversity using multinomial model.
A. Rarefaction analysis of repertoire samples from healthy donors and multiple sclerosis patients. The number of unique clonotypes in a sub-sample plotted against its size (number of T-cell receptor cDNA molecules, TRBM). Solid and dashed lines are diversity estimates computed by interpolating and extrapolating using a multinomial model respectively . Note that generally rarefaction curves for MS samples go below those of control donors. Post-HSCT sample (MS8-HSCT) displays the lowest diversity. B. Comparison of repertoire diversity using normalized Chao1 estimate. Normalization is performed by down-sampling datasets to the size of smallest dataset and computing the estimate for resulting datasets (mean estimate value from n = 3 re-samples is used). MS8-HSCT sample is discarded from calculations. *—P = 0.022, two-tailed T-test; effect size estimated by Cohen’s d is 0.98.
Using Chao1 estimate  for normalized datasets that has shown the best performance together with Efron estimate (yet is far simpler to compute) in the aforementioned benchmark, one can discover that MS samples have a significantly lower diversity than the controls (Fig 2B). This suggests a substantial expansion of T-cell clones in peripheral blood of MS patients, an observation previously supported only by local measurements such as Sanger sequencing of individual T-cells and spectratyping assays . As control population is slightly older than MS group one can expect even more profound difference in case exact age matching is achieved for the control group . Still, there is no significant difference for the directly observed sample diversity (S2 Fig), which is likely due to the fact that this estimate doesn’t account for the clonotype frequency distribution in sample and thus is less sensitive.
Cluster analysis of repertoires
As there is currently no study describing an application of cluster analysis to a large set of immune repertoire datasets coming from different individuals, we have performed a benchmark of various clustering strategies using a recently published twins TCR repertoires study . We have tested the ability to distinguish TCR repertoires of identical twins from those of unrelated individuals for several commonly used similarity measures, correlation of overlapping clonotype frequencies (R), geometric mean of total frequencies of overlapping clonotypes (F), normalized number of overlapping clonotypes (D, ), Jaccard  and Morisita-Horn indices . Only the F similarity measure showed significant difference for both TCR alpha and beta chain datasets (S1 Text, S5 Table and S3 Fig). At the same time, it should be noted that R and D measures also proved to be useful in other experimental setups. For example, R measure accurately separated TCR alpha repertoires for the T cell subsets and tissues, as well as mutant and control mice Treg repertoires .
We have next used cluster analysis to explore whether TCR beta repertoires of MS patients can be distinguished from healthy controls. As some samples were prepared in parallel with single-end sample barcoding, joined and then co-amplified after Illumina adapter ligation, we first checked for the possibility of cross-sample contamination (S4 and S5 Figs). It turned out that direct clustering of samples with F measure resulted in a strong co-clustering of samples prepared in the same batch. To correct for batch effect, we have selected “amino acid NOT nucleotide” clonotype intersection matching rule, i.e. matching of CDR3 amino acid, but not the nucleotide sequences.
Hierarchical clustering with F similarity measure and “amino acid NOT nucleotide” clonotype matching rule showed some co-clustering for control but not MS datasets (Fig 3A). Further exploration with multidimensional scaling (MDS) method showed that control repertoires of healthy children are more similar to each other according to F similarity measure, while MS repertoires are all different (Fig 3B and 3C). This result is quite similar to our observations of age-related changes in TCR repertoires (our unpublished data). With aging, expansion of antigen-specific clones moves away native repertoires that are initially more close to each other due to the public clonotypes that are frequently produced in recombination . This is in line with observation of early clonal T-cell expansions in MS children (see “Estimating repertoire diversity” section above). Since those expanding T-cell clones, including potentially autoreactive ones, are predominantly private to an MS-affected person [59–61] this leads to the decrease of the overlap between MS repertoires according to the clonotype size-weighted F similarity measure.
Fig 3. Overlap and clustering of TCR repertoires.
A. Hierarchical clustering of healthy donor and multiple sclerosis (MS) patient samples using F pairwise similarity metric (the geometric mean of the total frequency of overlapping clonotypes in first and second sample in pair). B. Multi-dimensional scaling (MDS) plot. Samples were projected onto two-dimensional plane based on pairwise similarities (F metric). C. Permutation testing for closeness of samples coming from the same group based on MDS plot. The plot shows observed (dashed red lines) and permuted (histograms) average within-group sample distance. In contrast to control group, MS group displays highly dissimilar T-cell repertoires. N = 10,000 permutations of group labels were performed. D. Hierarchical clustering of samples based on the Euclidean distance between Variable segment frequency vectors. Note that the clustering provides a nice separation between sample groups (Control and MS, P = 0.013, Fisher’s exact test).
T-cell receptor segment usage signatures
Keeping in mind that MS was shown to have a Type I-II TCR repertoire bias , i.e. the same prominent Variable segment is used, yet only limited homology between CDR3 region is present in disease specific T-cells, we have performed hierarchical clustering of Variable segment usage profiles (Fig 3D, note that profiles are weighted by TRBM count). The resulting dendrogram distinguishes MS patients and healthy donors with 91% sensitivity and 77% specificity (P = 0.013, Fisher’s exact test for cluster—group association).
A post-hoc testing was then performed to find out which Variable segments were more abundant in MS donors than in healthy controls (S6 Table). We have determined that 5 Variable segments had a statistically significant increase in frequency, including TRBV5-6 (1.6-fold, P = 2x10−5) and TRBV5-1 (1.5-fold, P = 5x10−4), which were previously reported to have a genetic association with MS [61,63]. Of note, TRBV20-1 (1.3-fold, P = 2x10−3) which has also emerged in our results was recently shown to have no genetic association with MS in a Sicilian population carrying null allele . This suggests that the observed TRBV20-1—MS association could be either specific for Russian population or represent an indirect biomarker.
Tracking repertoire changes induced by hematopoietic stem cell transplantation
Further we have compared TCR repertoires of blood samples taken from a single MS patient (MS8) before and after HSCT (see Fig 4). We have first tracked the clonotypes present before HSCT procedure to the post-transplantation repertoire (Fig 4A). The resulting plot clearly shows that pre-transplantation clones greatly expand (from ~25% of TRBMs to 75%) and occupy most of homeostatic space in post-HSCT repertoire. The magnitude of this effect resembles the one we previously observed in an ankylosing spondylitis patient HSCT case  and in adult MS autologous HSCT study .
Fig 4. Analysis of autologous HSCT-driven changes in T-cell repertoire.
A. Stacked clonotype frequency plot highlighting the details of overlap between sample MS8 (before autologous HSCT) and MS8-HSCT (post HSCT). Top 100 clonotypes based on their average frequency in those samples are shown, while other clonotypes that were observed in both samples are marked as “Not shown”. The frequency of remaining clonotypes is marked as “Not in overlap”. B. Changes in Variable-Joining segment pairing in CDR3 junctions changes induced by HSCT. Chord diagram is used for visualization, ribbons connecting segment pairs are scaled by corresponding V-J pair frequency. “TRB” prefix is stripped from segment names for simplicity.
Another peculiar finding is that a strong shift in Variable segment usage is observed, while no such change is present for J segment usage (Fig 4B). TRBV15 and TRBV7-8 ranking 10 and 5 replaced the top two Variable segments TRBV20-1 and TRBV29-1, while top two Joining segments TRBJ2-7 and TRBJ2-1 remained the same. This could not be attributed to CD4/CD8 balance alone, as there is strong differential Joining segment usage between those two populations . Interestingly, a significant HSCT-induced decrease was observed for TRBV5-6, TRBV5-1, TRBV5-8, TRBV7-6 and TRBV20-1 (P = 0.008, two-tailed paired T-test for log TRBM frequencies) segments that were enriched in MS patients compared to healthy controls (see previous section). The total frequency of those segments dropped from 20% of TRBMs to 14%.
Comparing bulk characteristics of CDR3 regions for MS patients and healthy donors
Finally, we have compared CDR3 regions of MS patients to healthy donors using a set of basic features: the length of Variable and Joining segment germline parts remaining within CDR3 region, and VJ junction (NDN) size. The length of CDR3 segment itself is a potent marker of antigen receptor reactivity. For example, longer CDR3 sequences may be more characteristic for potentially cross- and self-reactive immune receptors , while CDR3 variants with low number of randomly added “N” nucleotides are characteristic for public clonotypes, including variants specific to common pathogens such as EBV and CMV . As our analysis shows, MS patients are characterized by longer VJ junction region (Fig 5A). To check whether it is due to specific segment usage profile we have compared VJ junctions from all clonotypes of normal donors to the ones coming from clonotypes that have one of Variable segments previously shown to be over-expressed in MS patients (Fig 5B). We have found that aforementioned TRBV5-6, TRBV5-1, TRBV5-8, TRBV7-6 and TRBV20-1 are intrinsically characterized by longer VJ inserts. However, there is still a significant difference in VJ junction size between MS patients and controls for this subset of TRBV segments (Fig 5C). These results may indicate that clonal expansions in MS patients are characterized by more self-reactive T-cell clonotypes than in healthy donors. Alternatively, this could be a more general hallmark of chronic inflammation associated with MS.
Fig 5. CDR3 junction features.
MS patient-derived repertoire is enriched for TCR sequences with long VJ insert, partially due to high abundance of specific Variable segment regions. A. Length of Variable and Joining segment germline parts within CDR3 (V-germ and J-germ) and of VJ insert (VJ-junc) compared between MS donors and healthy controls. B. Average length of VJ junctions among all and selected V-segments (TRBV5-6,5–1,5–8,7–6 and 20–1, shown to be over-expressed in MS patients compared to controls, see main text) according to TCR sequences from repertoires of healthy donors. C. Comparison of VJ insert lengths between control and MS donors for clonotypes with TRBV5-6,5–1,5–8,7–6 and 20–1 segments. P-values computed using two-tailed unpaired T-test (A, C) and paired T-test (B).
Availability and Future Directions
A cross-platform binary version of software in a form of executable JAR file is available from . VDJtools software is free for scientific and non-profit use. The source code is available at GitHub repositories  and .
One important aspect of VDJtools usage not mentioned in the results section is the benchmark of pre-processing software (S6 Fig) and library preparation protocols. For this purposes we plan to constantly update VDJtools so it is able to handle the output of newly developed pre-processing software.
In future we plan extending VDJtools software to address another highly important problem in the field, the analysis of antibody repertoire . While being applicable to the analysis of BCR clonotypes, VDJtools currently doesn’t account for somatic hypermutations and therefore yet cannot offer a comprehensive analysis for the antibody repertoires. This task requires us to implement algorithms for computing statistics of hypermutation transition patterns and reconstruction of B-cell clonal lineages and visualization of hypermutation graphs. We are also looking forward for the feedback from the community to meet the demand for some exciting novel features that will surely arise in this rapidly growing field.
S1 Table. Overview of 20 recently published T-cell repertoire sequencing studies.
Primary analysis software and post-analysis methods that are supported by VDJtools are highlighted in green. Note that none of these papers indicate using specialized software for analysis of clonotype tables, therefore post-analysis in each case was performed either manually or using in-house scripts developed from a scratch.
S5 Table. Repertoire similarity.
The ability of repertoire similarity measures to distinguish identical twins (n = 3 pairs) from unrelated individuals (n = 12) for TCR alpha and beta chain samples. Statistical significance and effect size were assessed using two-tailed T-test P-values and Cohen’s d.
S6 Table. Variable segments that are highly used in MS patients.
In order to determine the possible Type I-II bias according to Ref. , i.e. sharing of common repertoire features under the absence of common clonotypes, in TCR repertoires of MS patients we have performed multiple testing for TRBV frequency difference using one-tailed T-test. One-tailed T-test was chosen to increase the power as we a priori search for an expansion in the T-cell compartment. Appropriate correction for multiple testing was applied (Benjamini-Hockberg correction). Variable segments that are significantly over-represented in MS samples comparing to control are shown.
S1 Fig. Repertoire diversity estimator performance.
This plot shows Spearman correlation of diversity estimate with age and naïve T-cell count. Unmodified samples (exact) and samples normalized to the same size (resampled) from the “aging” study were used (n = 39). Note that ChaoE is omitted from the “resampled” plot, as it equals observed diversity when samples are of the same size.
S2 Fig. Difference in repertoire diversity between Control and MS.
Difference was measured using four repertoire diversity estimates considered in present study (separate panels). The effect sizes are 1.21, 0.98, 0.95 and 0.46 respectively (Cohen’s d). **—P < 0.01, *—P < 0.05, ns—non-significant, two-tailed T-test.
S4 Fig. Possible biases in sample clustering in present study.
A. Hierarchical clustering of repertoires based on two distinct clonotype matching rules: matching CDR3 amino-acid sequences (left panel) and matching of CDR3 amino acid sequences but distinct CDR3 nucleotide sequences (right panel). Batch effect for samples on the same sequencing lane is shown with vertical lines. B. Checking for possible sex bias in repertoire clustering. Multi-dimensional scaling (MDS) plot is shown for healthy donors of various ages and sexes from the aging study (n = 39, left panel). Statistical significance of co-clustering for same sex samples (low within and high between cluster distance) was performed using random permutation of factor levels between samples, red line shows observed values, P-values are shown as numbers near red lines (n = 10,000 permutations, right panel).
S5 Fig. In-depth analysis of the cross-sample contamination issue.
A. Example of three top clonotypes coming from different batches (A2, A3 and A4) clearly shows presence of intra-batch contamination. B. Frequency of parent clonotypes (x axis) and their contamination traces (y axis) in the pooled samples of aging study. Top 100 clonotypes having the largest frequency in pooled samples were analyzed. C. Input of cross-sample contamination to the observed inter-sample overlap (F measure) for samples coming from the same (red) and different (green) sequencing lane.
S6 Fig. An example of Rep-Seq processing software comparison.
Comparison of clonotype extraction efficiency on A4-i107 sample from the “aging” study described in S1 Text. Note that error correction in current case was performed using unique molecular identifiers, therefore this figure only deals with CDR3 mapping and clonotype assembly capabilities of software tools. MiTCR and MIGEC identified 95% and 98% of clonotypes found by IgBlast. False clonotype rate was 0.2% and 2.7% respectively.
Conceived and designed the experiments: MS DMC. Performed the experiments: MAT OVB EVP IVZ VIK KIK EVS. Analyzed the data: MS DVB DAB MVP VIN. Wrote the paper: MS DMC.
- 1. Bolotin DA, Shugay M, Mamedov IZ, Putintseva EV, Turchaninova MA, et al. (2013) MiTCR: software for T-cell receptor sequencing data analysis. Nat Methods 10: 813–814. pmid:23892897
- 2. Shugay M, Britanova OV, Merzlyak EM, Turchaninova MA, Mamedov IZ, et al. (2014) Towards error-free profiling of immune repertoires. Nat Methods 11: 653–655. pmid:24793455
- 3. Bolotin DA, Poslavsky S, Mitrophanov I, Shugay M, Mamedov IZ, et al. (2015) MiXCR: software for comprehensive adaptive immunity profiling. Nat Methods 12: 380–381. pmid:25924071
- 4. Alamyar E, Giudicelli V, Li S, Duroux P, Lefranc MP (2012) IMGT/HighV-QUEST: the IMGT(R) web portal for immunoglobulin (IG) or antibody and T cell receptor (TR) analysis from NGS high throughput and deep sequencing. Immunome Res 8: 26.
- 5. Ye J, Ma N, Madden TL, Ostell JM (2013) IgBLAST: an immunoglobulin variable domain sequence analysis tool. Nucleic Acids Res 41: W34–40. pmid:23671333
- 6. Yang X, Liu D, Lv N, Zhao F, Liu F, et al. (2015) TCRklass: A New K-String-Based Algorithm for Human and Mouse TCR Repertoire Characterization. J Immunol 194: 446–454. pmid:25404364
- 7. Giraud M, Salson M, Duez M, Villenet C, Quief S, et al. (2014) Fast multiclonal clusterization of V(D)J recombinations from high-throughput sequencing. BMC Genomics 15: 409. pmid:24885090
- 8. Rasheed Z, Rangwala H, Barbara D (2013) 16S rRNA metagenome clustering and diversity estimation using locality sensitive hashing. BMC Syst Biol 7 Suppl 4: S11.
- 9. Robins HS, Campregher PV, Srivastava SK, Wacher A, Turtle CJ, et al. (2009) Comprehensive assessment of T-cell receptor beta-chain diversity in alphabeta T cells. Blood 114: 4099–4107. pmid:19706884
- 10. Britanova OV, Putintseva EV, Shugay M, Merzlyak EM, Turchaninova MA, et al. (2014) Age-related decrease in TCR repertoire diversity measured with deep and normalized sequence profiling. J Immunol 192: 2689–2698. pmid:24510963
- 11. De Filippo C, Ramazzotti M, Fontana P, Cavalieri D (2012) Bioinformatic approaches for functional annotation and pathway inference in metagenomics data. Brief Bioinform 13: 696–710. pmid:23175748
- 12. Venturi V, Price DA, Douek DC, Davenport MP (2008) The molecular basis for public T-cell responses? Nat Rev Immunol 8: 231–238. pmid:18301425
- 13. Robins HS, Srivastava SK, Campregher PV, Turtle CJ, Andriesen J, et al. (2010) Overlap and effective size of the human CD8+ T cell receptor repertoire. Sci Transl Med 2: 47ra64. pmid:20811043