A primer in macromolecular linguistics

  title={A primer in macromolecular linguistics},
  author={D. Searls},
  • D. Searls
  • Published 2013
  • Biology, Medicine
  • Biopolymers
Polymeric macromolecules, when viewed abstractly as strings of symbols, can be treated in terms of formal language theory, providing a mathematical foundation for characterizing such strings both as collections and in terms of their individual structures. In addition this approach offers a framework for analysis of macromolecules by tools and conventions widely used in computational linguistics. This article introduces the ways that linguistics can be and has been applied to molecular biology… Expand
Learning the Language of Biological Sequences
The main ideas and concepts behind the approaches developed in pattern/motif discovery and grammatical inference to characterize successfully the biological sequences with their specificities are surveyed. Expand
Native Chemical Computation. A Generic Application of Oscillating Chemistry Illustrated With the Belousov-Zhabotinsky Reaction. A Review
A new interpretation of the recognition of a sequence of chemicals representing words in the machine's language as an illustration of the “Maximum Entropy Production Principle” and concluding that word recognition by the Belousov-Zhabotinsky Turing machine is equivalent to extremal entropy production by the automaton. Expand
How Chemistry Computes: Language Recognition by Non-Biochemical Chemical Automata. From Finite Automata to Turing Machines
The Turing machine uses the Belousov-Zhabotinsky chemical reaction and checks the same symbol in an Avogadro′s number of processors, and has implications for chemical and general computing, artificial intelligence, bioengineering, the study of the origin and presence of life on other planets, and for artificial biology. Expand
Estimating probabilistic context-free grammars for proteins using contact map constraints
This work develops the theory behind the introduction of contact constraints in maximum-likelihood and contrastive estimation schemes and implements it in a machine learning framework for protein grammars, a significant step towards more flexible and accurate modeling of collections of protein sequences. Expand
A vocabulary of ancient peptides at the origin of folded proteins
Compared domains representative of known folds and identified 40 fragments whose similarity is indicative of common descent, yet which occur in domains currently not thought to be homologous, which are proposed to represent the observable remnants of a primordial RNA-peptide world. Expand
Quantiprot - a Python package for quantitative analysis of protein sequences
Three main fields of application are proposed of the Quantiprot package, which provide a simple and consistent interface to multiple methods for quantitative characterization of protein sequences and can be used in alignment-free similarity searches, and in clustering of large and/or divergent sequence sets. Expand
Common substructures and sequence characteristics of sandwich-like proteins from 42 different folds
Comparison of the sequence fragments corresponding to strands that make up the common substructures revealed specific rules of distribution of hydrophobic residues within these strands that can be conceptualized as grammatical rules of beta protein linguistics. Expand
Ant Colony Optimization for Construction of Common Pattern of the Protein Motifs
In this work is presented an approach for the construction of common patterns of the protein motifs of the amyloid protein motifs, extracted from the database AMYPdb, denoted as regular expressionsExpand
Glycosphingolipids: synthesis and functions
Although GSLs are dispensable for cellular life, they are indeed collectively required for the development of multicellular organisms and are thus considered to be key molecules in ‘cell sociology’. Expand
Hemoglobin state-flux: A finite-state model representation of the hemoglobin signal for evaluation of the resting state and the influence of disease
A weak-model approach for examination of the intrinsic time-varying properties of the hemoglobin signal is introduced, with the aim of advancing the application of functional near infrared spectroscopy (fNIRS) for the detection of breast cancer, among other potential uses. Expand


Computational linguistics: A new tool for exploring biopolymer structures and statistical mechanics
Unlike homopolymers, biopolymers are composed of specific sequences of different types of monomers. In proteins and RNA molecules, one-dimensional sequence information encodes a three-dimensionalExpand
Grammatical Representations of Macromolecular Structure
It is shown how nearly all of these methods to model RNA and protein structure are based on the same core principles and can be converted into equivalent approaches in the framework of tree-adjoining grammars and related formalisms. Expand
A stochastic context free grammar based framework for analysis of protein sequences
A new Stochastic Context Free Grammar based framework has been introduced allowing the production of binding site descriptors for analysis of protein sequences and suggests that this system may be particularly suited to deal with patterns shared by non-homologous proteins. Expand
The language of genes
Many techniques used in bioinformatics, even if developed independently, may be seen to be grounded in linguistics, and further interweaving of these fields will be instrumental in extending the understanding of the language of life. Expand
Gene structure prediction by linguistic methods.
A grammar and parser for eukaryotic protein-encoding genes is described, which by some measures is as effective as current connectionist and combinatorial algorithms in predicting gene structures for sequence database entries. Expand
Recursive domains in proteins
The extent to which four simple rules can generate the known all‐β folds is explored, using tools from graph theory. Expand
Protein linguistics — a grammar for modular protein assembly?
  • M. Gimona
  • Medicine
  • Nature Reviews Molecular Cell Biology
  • 2006
The correspondence between biology and linguistics at the level of sequence and lexical inventories, and of structure and syntax, has fuelled attempts to describe genome structure by the rules ofExpand
Reading the book of life
  • D. Searls
  • Computer Science, Medicine
  • Bioinform.
  • 2001
With the publication of the human genome sequence, the notion of genome as literature may be seen as an extension of the linguistic metaphor that has dominated molecular biology from its inception, as is evident from the terminology used in the field. Expand
Routes are trees: The parsing perspective on protein folding
This work identifies all direct folding route trees to the native state and allows us to construct a simple model of the folding process, which provides an account for the fact that folding rates depend only on the topology of thenative state but not on sequence composition. Expand
Theory for the folding and stability of globular proteins.
  • K. Dill
  • Chemistry, Medicine
  • Biochemistry
  • 1985
Using lattice statistical mechanics, theory is developed to account for the folding of a heteropolymer molecule such as a protein to the globular and soluble state and the number of accessible conformations is calculated to be an exceedingly small fraction of the number available to the random coil. Expand