• Corpus ID: 44450514

Automated DNA Motif Discovery

  title={Automated DNA Motif Discovery},
  author={William B. Langdon and Olivia S{\'a}nchez Graillet and Andrew P. Harrison},
Ensembl’s human non-coding and protein coding genes are used to automatically find DNA pattern motifs. The Backus-Naur form (BNF) grammar for regular expressions (RE) is used by genetic programming to ensure the generated strings are legal. The evolved motif suggests the presence of Thymine followed by one or more Adenines etc. early in transcripts indicate a non-protein coding gene. 

Figures and Tables from this paper

Evolving gzip matches Kernel from an nVidia CUDA Template

Genetic interface programming is demonstrated by automatically generating a parallel CUDA kernel with identical functionality to existing highly optimised ancient sequential C code by automatically converting GPGPU nVidia kernel C++ code into a BNF grammar.

Evolving a CUDA kernel from an nVidia template

This work demonstrates genetic interface programming (GIP) by automatically generating a parallel CUDA kernel with identical functionality to existing highly optimised ancient sequential C code (gzip), which is converted into a BNF grammar.

Genetic programming needs better benchmarks

This paper argues that the definition of standard benchmarks is an essential step in the maturation of the field and motivates the development of a benchmark suite and defines its goals.



The evaluation of a stochastic regular motif language for protein sequences

This research establishes the viability of SRE-DNA as a new representation language for protein sequence identification by evaluating its expressive merits and the practicality of using grammatical genetic programming in stochastic biosequence expression classification.

Informatic Resources for Identifying and Annotating Structural RNA Motifs

The task of structural motif discovery is focused on and a survey of the informatics resources geared towards this task is provided to provide a snapshot of the currently available resources.

Ab initio identification of human microRNAs based on structure motifs

MiRPred is a novel method for ab initio prediction of miRNAs by genome scanning that only relies on (predicted) secondary structure to distinguish miRNA precursors from other similar-sized segments of the human genome.

A stochastic context free grammar based framework for analysis of protein sequences

A new Stochastic Context Free Grammar based framework has been introduced allowing the production of binding site descriptors for analysis of protein sequences and suggests that this system may be particularly suited to deal with patterns shared by non-homologous proteins.

Combinatorial motif analysis and hypothesis generation on a genomic scale

Tests using 10 previously identified regulons from budding yeast and 14 artificial families of sequences demonstrated the effectiveness of the new motif-finding method, using multiple objective functions and an improved stochastic iterative sampling strategy.

Evolving Regular Expressions for GeneChip Probe Performance Prediction

The evolved data mined motif is better at predicting poor DNA sequences than an existing human generated RE, suggesting runs of Cytosine and Guanine and mixtures should all be avoided.

Searching for IRES.

Examination of the available RNA structure prediction software and RNA motif searching programs indicates that while these programs are useful tools to fine tune the empirically determined RNA secondary structure, the accuracy of de novo secondary structure prediction of large RNA molecules and subsequent identification of new IRES elements by computational approaches is still not possible.

NucPred - Predicting nuclear localization of proteins

NucPred analyzes patterns in eukaryotic protein sequences and predicts if a protein spends at least some time in the nucleus or no time at all, based on regular expression matching and multiple program classifiers induced by genetic programming.

Repeated Sequences in Linear Genetic Programming Genomes

Mackey-Glass chaotic time series prediction and eukaryotic protein localisation demonstrate evolution of Shannon information (entropy) and lead to models capable of lossy Kolmogorov compression.