Improving the Results of De novo Peptide Identification via Tandem Mass Spectrometry Using a Genetic Programming-based Scoring Function for Re-ranking Peptide-Spectrum Matches

  title={Improving the Results of De novo Peptide Identification via Tandem Mass Spectrometry Using a Genetic Programming-based Scoring Function for Re-ranking Peptide-Spectrum Matches},
  author={Samaneh Azari and Bing Xue and Mengjie Zhang and Lifeng Peng},
De novo peptide sequencing algorithms have been widely used in proteomics to analyse tandem mass spectra (MS/MS) and assign them to peptides, but quality-control methods to evaluate the confidence of de novo peptide sequencing are lagging behind. A fundamental part of a quality-control method is the scoring function used to evaluate the quality of peptide-spectrum matches (PSMs). Here, we propose a genetic programming (GP) based method, called GP-PSM, to learn a PSM scoring function for… 


GA-Novo: De Novo Peptide Sequencing via Tandem Mass Spectrometry using Genetic Algorithm
GA-Novo outperforms PEAKS, the most commonly used software for this task, by constructing 8% higher number of fully matched peptide sequences, and 4% higher recall at partially matched sequences, on the testing dataset.
PEAKS: powerful software for peptide de novo sequencing by tandem mass spectrometry.
A new de novo sequencing software package, PEAKS, is described, to extract amino acid sequence information without the use of databases, using a new model and a new algorithm to efficiently compute the best peptide sequences whose fragment ions can best interpret the peaks in the MS/MS spectrum.
pSite: Amino Acid Confidence Evaluation for Quality Control of De Novo Peptide Sequencing and Modification Site Localization.
The effective and universal model together with the extensive use of spectral information makes pSite an excellent quality control tool for both de novo peptide sequencing and modification site localization.
Probability‐based protein identification by searching sequence databases using mass spectrometry data
A new computer program, Mascot, is presented, which integrates all three types of search for protein identification by searching a sequence database using mass spectrometry data, and the scoring algorithm is probability based.
An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database
The approach described in this manuscript provides a convenient method to interpret tandem mass spectra with known sequences in a protein database.
Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search.
A statistical model is presented to estimate the accuracy of peptide assignments to tandem mass (MS/MS) spectra made by database search applications such as SEQUEST, demonstrating that the computed probabilities are accurate and have high power to discriminate between correctly and incorrectly assigned peptides.
A method for assessing the statistical significance of mass spectrometry-based protein identifications using general scoring schemes.
This paper investigates the use of survival functions and expectation values to evaluate the results of protein identification experiments, finding accurate survival functions that are specific to any combination of scoring algorithms, sequence databases, and mass spectra.
A comprehensive full factorial LC‐MS/MS proteomics benchmark data set
This data set consisting of 59 LC‐MS/MS analyses of 50 protein samples extracted individually from Escherichia coli K12 and spiked with different concentrations of bovine carbonic anhydrase II and/or chicken ovalbumin, according to a 2 × 3 full factorial design is presented.
Introduction to Computational Proteomics
Most of the currently available technology for identifying proteins from biological samples simply cannot contend with the complexity, and the majority of the low-abundance proteins are not observed.
Building credit scoring models using genetic programming
In this paper, genetic programming (GP) is used to build credit scoring models and it is concluded that GP can provide better performance than other models.