Identification of minimal eukaryotic introns through GeneBase, a user-friendly tool for parsing the NCBI Gene databank

@article{Piovesan2015IdentificationOM,
  title={Identification of minimal eukaryotic introns through GeneBase, a user-friendly tool for parsing the NCBI Gene databank},
  author={Allison Piovesan and Maria Caracausi and Marco Ricci and Pierluigi Strippoli and Lorenza Vitale and Maria Chiara Pelleri},
  journal={DNA Research: An International Journal for Rapid Publication of Reports on Genes and Genomes},
  year={2015},
  volume={22},
  pages={495 - 503}
}
We have developed GeneBase, a full parser of the National Center for Biotechnology Information (NCBI) Gene database, which generates a fully structured local database with an intuitive user-friendly graphic interface for personal computers. Features of all the annotated eukaryotic genes are accessible through three main software tables, including for each entry details such as the gene summary, the gene exon/intron structure and the specific Gene Ontology attributions. The structuring of the… 

Figures and Tables from this paper

GeneBase 1.1: a tool to summarize data from NCBI gene datasets and its application to an update of human gene statistics
TLDR
GeneBase 1.1 is released, a local tool with a graphical interface useful for parsing, structuring and indexing data from the National Center for Biotechnology Information (NCBI) Gene data bank, offering unique functionalities not provided by the NCBI Gene itself.
Human protein-coding genes and gene feature statistics in 2019
TLDR
Using GeneBase, a software with a graphical interface able to import and elaborate National Center for Biotechnology Information (NCBI) Gene database entries, tabulated spreadsheets updated to 2019 about human nuclear protein-coding gene data set are provided ready to be used for any type of analysis about genes, transcripts and gene organization.
Understanding the causes of errors in eukaryotic protein-coding gene prediction: a case study of primate proteomes
TLDR
This work investigated the prevalence of gene prediction errors in a large set of 176,478 proteins from ten primate proteomes available in public databases, and focused on the mismatched sequence errors that cause particular problems for downstream applications.
Systematic identification of human housekeeping genes possibly useful as references in gene expression studies
TLDR
The present study conducted a meta-analysis of a pool of 646 expression profile data sets from 54 different human tissues and identified actin γ 1 as the HK gene that best fits the combination of all the traditional criteria to be used as a reference gene for general use.
Genome-wide analyses supported by RNA-Seq reveal non-canonical splice sites in plant genomes
TLDR
It is concluded that bona fide non-canonical splice sites are present and appear to be functionally relevant in most plant genomes, although at low abundance.
On the length, weight and GC content of the human genome
TLDR
Following analysis in different transcriptomes and species, it was showed that the greatest deviation was observed in the pathological condition analysed (trisomy 21 leukaemic cells) and in Caenorhabditis elegans.
SPLICE-q: a Python tool for genome-wide quantification of splicing efficiency
TLDR
It is illustrated that SPLICE-q is suitable to detect a progressive increase of splicing efficiency throughout a time course of nascent RNA-seq and it might be useful when it comes to understanding cancer progression beyond mere gene expression levels.
A molecular view of the normal human thyroid structure and function reconstructed from its reference transcriptome map
TLDR
This study provides a quantitative global reference portrait of gene expression in the normal human thyroid and highlights differential expression between normalhuman thyroid and a pool of non-thyroid tissues useful for modeling correlations between thyroidal gene expression and specific thyroid functions and diseases.
Integrated Transcriptome Map Highlights Structural and Functional Aspects of the Normal Human Heart
TLDR
A systematic meta‐analysis of the available gene expression profiling datasets for the whole normal human heart generated a quantitative transcriptome reference map of this organ, illustrating the structural and functional aspects of the whole organ and is a general model to understand the mechanisms underlying heart pathophysiology.
...
...

References

SHOWING 1-10 OF 62 REFERENCES
Xpro: database of eukaryotic protein-encoding genes
TLDR
Xpro is a relational database that contains all the eukaryotic protein-encoding DNA sequences contained in GenBank with associated data required for the analysis of eukARYotic gene architecture and provides annotations on the splice sites and intron phases.
Advances in the Exon-Intron Database (EID)
TLDR
The latest data is presented on the comparison of intron positions in 11,025 orthologous genes of human, mouse and rat, and no convincing cases of introns gain are found, and relevant data-quality issues of genomic databases are discussed.
Gene: a gene-centered information resource at NCBI
TLDR
The National Center for Biotechnology Information's (NCBI) Gene database integrates gene-specific information from multiple data sources and represents the integration of curation and automated processing from RefSeq, collaborating model organism databases, consortia such as Gene Ontology, and other databases within NCBI.
EID: the Exon?Intron Database?an exhaustive database of protein-coding intron-containing genes
TLDR
An Exon-Intron Database in FASTA format is constructed and it is inferred that there is a 2% rate of errors or other deviations from the standard GTellipsisAG motif in nuclear genes, which can be used to eliminate 4921 genes from the overall database.
Intron-exon structures of eukaryotic model organisms.
TLDR
The variable intron-exon structures of the 10 model organisms reveal two interesting statistical phenomena, which cast light on some previous speculations about genome size and intron size.
Fast parsers for Entrez Gene
TLDR
This work presents four parsers that were developed using several parsing approaches (Parse::RecDescent, Parse::Yapp, Perl-byacc and Perl 5 regular expressions) and provides the first in-depth comparison of these sophisticated Perl tools.
Genome-wide bioinformatic and molecular analysis of introns in Saccharomyces cerevisiae.
TLDR
It is suggested that the current set of 228 yeast introns is still not complete, and that additional intron-containing genes remain to be discovered in yeast, but that splicing in yeast may not be as rigidly determined by splice-site conservation as had previously been thought.
The IDB and IEDB: intron sequence and evolution databases
TLDR
A non-redundant database of nuclear, protein-encoding, genomic DNA sequences highlighting nuclear pre-mRNA introns was constructed using information contained in the SWISS-PROT and GenBank sequence databases, and a statistical analysis of the exon and intron sequences catalogued in IDB is provided.
Statistical features of human exons and their flanking regions.
TLDR
It is shown that human exons with flanking genomic DNA sequences can be classified into 12 mutually exclusive categories, which could serve as a standard for future studies so that direct comparisons of results can be made.
Gene Ontology Annotations and Resources
TLDR
The Gene Ontology (GO) Consortium is a community-based bioinformatics resource that classifies gene product function through the use of structured, controlled vocabularies and has been expanded not only to cover new areas of biology through focused interaction with experts, but also to capture greater specificity in all areas of the ontology.
...
...