The Immune Epitope Database and Analysis Resource: From Vision to Blueprint
Peptide binding to Major Histocompatibility Complex (MHC) molecules is a prerequisite to T-cell recognition of antigens, a deciding step in activation of immune cascades in cellular immunity. The use of motifs which define conserved features such as length of peptide and unique residues to characterize the peptide ligands are thus well-established. Current methods for classification of MHC-peptide binding specificities are based largely on MHC allele, and more recently, MHC superfamily specificity. These are based on findings which revealed that unique MHC alleles or their allele groups show a strong preference for peptides with specific features. These methods focus on the diversity of MHC molecules through genetic polymorphism of the alleles, but leave pathogen diversity unexplored. In this study, a large collection of binding and non-binding peptides derived from a variety of pathogens were used to investigate the value of peptide motifs in accurately predicting pathogen-specific epitopes. The results show promise in characterizing peptide ligands at the pathogen-specific level as revealed by accuracy rates of up to 76.9% of motifs at predicting binding peptides. INTRODUCTION Human Leukocyte Antigen (HLA) molecules (human MHC) are crucial in the adaptive immune system. They invoke immune cascades through binding and presentation of antigenic peptide ligands to T-cells for recognition. Motifs reflect conserved features such as length of peptide and unique residues of peptide ligands. The clustering of peptide binding characteristics by their MHC allele specificity has been well-studied (Falk et al, 1991). In addition, peptide motifs related to a particular MHC allele and associated with specific diseases have been characterized too. More recently, the ability of peptide ligands to bind several different MHC allele groups called MHC superfamilies (Threlkeld et al, 1997) have been established. Nevertheless, no work has been done to identify and evaluate pathogen-specific peptide motifs across a spectrum of pathogens, although such an approach carries great promise in many aspects. Since peptide binding to MHC is a prerequisite to T-cell recognition of antigens, these findings prove useful in T-cell epitope prediction. In this study, a pathogen-specific perspective to understanding motifs in T-cell epitopes is sought. It aims to explore the possibility of characterizing motifs discovered from pathogenspecific peptides. In addition, an attempt has been made to examine the value of these motifs in identifying binding peptides. The results shed new light into peptide motif definition, suggesting that it may not be limited to MHC allele or MHC superfamily specificities and can be characterized at the pathogen-specific level. MATERIALS AND METHODS Dataset A total of 11799 HLA class I binding and non-binding peptide linear sequences from the Immune Epitope Database and Analysis Resource (IEDB) (Peters et al, 2005) were used in the final analysis. Manual curation was performed to group the epitopes according to the source of specific pathogens. In all, 14 pathogens were involved in this study. Motif Discovery and Assessment The Multiple EM for Motif Elicitation (MEME) algorithm (Bailey and Elkan, 1994) was applied to identify statistically significant motifs for each pathogen. The parameters of a probabilistic model which could have generated the dataset provided are generated by the algorithm, resulting in a motif and a relative frequency of its occurrences. In this study, width of each motif reported was set to between two and thirteen. Motif Alignment and Search Tool (MAST), based on the QFAST algorithm, (Bailey and Gribskov, 1998) was used to assess the performance of the motifs. By searching through datasets of 14 binding and non-binding peptides, performance measures like sensitivity, specificity and accuracy could be derived to evaluate the motifs for successfully predicting binders and non-binders not containing the said motifs Sequence Analysis WebLogo was used to create sequence logos of all nonamers (global analysis) and also length-specific epitopes of each pathogen (pathogen-specific analysis). This analysis serves to give an indication of the sequence similarity and thus, a preliminary idea of the positionspecificity of the motifs found by MEME. This web-based application (Crooks et al, 2004) displays consensus sequences in the form of sequence logos. Each logo consists of stacks of letters, one stack for each position in the sequence. The overall height of each stack indicates the sequence conservation at that position, whereas the height of symbols within the stack reflects the relative frequency of the corresponding amino acid at that position. RESULTS AND DISCUSSION Motif Discovery and Assessment A total of 1336 potential motifs were found by MEME from the epitope sequences of 14 pathogens. Sequences of width two made up the largest percentage of potential motifs found by MEME among the pathogens. This is not of much significance as shown by the high average Evalues of these two amino acid motifs. The E-values decreased sharply from motifs of width three onwards. Thus, longer width sequences are statistically more significant. Sequences of width four, six and eight made up a significant number of motifs found. Of the 1336 motifs found, 445 were assessed for performance using MAST. Overall, the accuracy of all motifs for each pathogen successfully predicting binders and non-binders not containing the said motifs ranged from 50-77% (Figure 1). Thus, the use of pathogen-specific motifs to predict MHC Class I binding peptides is feasible. An interesting observation would be the sensitivity of the motifs between both different widths and pathogens was always extremely high while the specificity was always much lower. 1 Legend: F.t= Francisella tularensis, HBV= Hepatitis B Virus, HCV= Hepatitis C Virus, HIV= Human Immunodeficiency Virus, HHV= Human Herpesvirus, Flu= Influenza Virus, Las= Lassa Virus, M.t= Mycobacterium tuberculosis, Pla= Plasmodium spp., Pse= Pseudomonas spp., SARS= SARS coronavirus, T.c= Trypanosoma cruzi, Vac= Pox Viruses, Vib= Vibrio spp. This shows that the motifs identify non-binders much better than binders, which may prove limiting to their usefulness. 40% 45% 50% 55% 60% 65% 70% 75% 80% F. t HBV HCV HIV HHV Flu Las M. t Pla Pse SARS T. c Vac Vib Pathogen A cc u ra cy Figure 1: Overall accuracy of all motifs from each pathogen Sequence analysis 3219 nonameric peptide sequences from 14 pathogens were examined for sequence conservation for a global analysis. There is evident sequence conservation among positions two and nine. This could account for the significant number of motifs of width eight found. They consisted of mostly Leucine, which is consistent with previous findings that the carboxyl terminals show a strong preference for hydrophobic anchor residues and that position two was another anchor residue (Rammensee, 1995). Also, all length-specific epitopes within each pathogen were analyzed. They showed alignment at positions two and the carboxyl terminus, regardless of length of epitope and pathogen. These alignments usually consisted of amino acids like Leucine, Proline and Valine. However, no motifs of width nine were picked up by MEME at all, which does not explain the consensus of sequences between 10mers despite it being the next largest dataset.