Domain combinations in archaeal, eubacterial and eukaryotic proteomes.

  title={Domain combinations in archaeal, eubacterial and eukaryotic proteomes.},
  author={Gordana Apic and Julian Gough and Sarah A. Teichmann},
  journal={Journal of molecular biology},
  volume={310 2},
There is a limited repertoire of domain families that are duplicated and combined in different ways to form the set of proteins in a genome. Proteins are gene products, and at the level of genes, duplication, recombination, fusion and fission are the processes that produce new genes. We attempt to gain an overview of these processes by studying the evolutionary units in proteins, domains, in the protein sequences of 40 genomes. The domain and superfamily definitions in the Structural… 

Figures and Tables from this paper

Genomic and structural aspects of protein evolution.

This review discusses the number of currently known superfamilies, their size and distribution, and superfamily expansions related to biological complexity and to specific lineages, and the extraordinary variety of the domain combinations found in different genomes.

Protein families and their evolution-a structural perspective.

It is shown that about two thirds of the sequences from completed genomes can be assigned to as few as 1400 domain families for which structures are known and thus more ancient evolutionary relationships established.

Multi-domain protein families and domain pairs: comparison with known structures and a random model of domain recombination

It is established here that all the domain families with more than three members in genomes are duplicated more frequently than would be expected by chance considering their number of neighbouring domains.

Analysis of Domain Combinations in Eukaryotic Genomes

Here, whole protein sets from completely sequenced and semi-completely sequenced genomes including draft eukaryotic genomes are collected, and the domain combinations are analyzed to obtain an overview of eukARYotic genomes.

This Déjà Vu Feeling—Analysis of Multidomain Protein Evolution in Eukaryotic Genomes

This work assembled a collection of 172 complete eukaryotic genomes that is not only the largest, but also the most phylogenetically complete set of genomes analyzed so far, and shows that independent evolution of domain combinations is significantly more prevalent than previously thought.

Evolution of the PWWP-domain encoding genes in the plant and animal lineages

It is found that as a single module the PWWP domain occurs only in proteins with a limited, mainly, species-specific distribution, and models wherein more complex protein architectures involving the P WWP domain occur with the appearance of more evolutionarily advanced life forms do not support these results.

Protein Family Expansions and Biological Complexity

The identity of those superfamilies whose relative sizes in different organisms are highly correlated to the complexity of the organisms is determined and one explanation of the discrepancy between the total number of genes and the apparent physiological complexity of eukaryotic organisms is provided.

Comprehensive analysis of co-occurring domain sets in yeast proteins

This work designs a novel representation of proteins and their constituent domains as a protein-domain network, and provides a comprehensive list of co-occurring domain sets in yeast, and sheds light on their function and evolution.

Distribution of Protein Superfamilies in the Three Superkingdoms of Life

The Superfamily database provides structural assignments to protein sequences to analyze the distribution of specific superfamilies within and across the genomes, and here, this work focuses on the distributions of superfam families in archaeal, bacterial and eukaryotic genomes.



Structural assignments to the Mycoplasma genitalium proteins show extensive gene duplications and domain rearrangements.

It is shown that the domains in the matched M. genitalium sequences come from 114 superfamilies and that 58% of them have arisen by gene duplication, more than twice that found by using pairwise sequence comparisons.

Immunoglobulin superfamily proteins in Caenorhabditis elegans.

This study describes the repertoire of proteins that are members of the immunoglobulin superfamily (IgSF) in Caenorhabditis elegans, a framework for refinement and extension of the repertoire as gene and protein definitions improve, and the basis for investigations of their function and for comparisons with the repertoires of other organisms.

Genome‐wide analysis of integral membrane proteins from eubacterial, archaean, and eukaryotic organisms

Detailed statistical analyses of integral membrane proteins of the helix‐bundle class from eubacterial, archaean, and eukaryotic organisms for which genome‐wide sequence data are available suggest that uni‐cellular organisms appear to prefer proteins with 6 and 12 transmembrane segments, whereas Caenorhabditis elegans and Homo sapiens have a slight preference for proteins with seven transmemBRane segments.

Patterns of protein‐fold usage in eight microbial genomes: A comprehensive structural census

Eight microbial genomes are compared in terms of protein structure and patterns of fold usage—whether a given fold occurs in a particular organism and all the genomes appear to have similar usage patterns for these folds, according to a “Zipf‐like” law.

Protein evolution viewed through Escherichia coli protein sequences: introducing the notion of a structural segment of homology, the module.

It is confirmed that E. coli contains a very high proportion of paralogous proteins and found that the segments of homology fell into 352 sequence-related groups or families, which strongly suggests that the 1404 present-day modules and proteins derive from a minimal set of 352 ancestral modules.

Cadherin superfamily proteins in Caenorhabditis elegans and Drosophila melanogaster.

The identification and analysis of the cadherin repertoires in the genomes of Caenorhabditis elegans and Drosophila melanogaster are presented and it is shown that three pairs of genes, and two triplets, should be merged to form five single genes.

Estimating the number of protein folds and families from complete genome data.

Using the data on proteins encoded in complete genomes, combined with a rigorous theory of the sampling process, we estimate the total number of protein folds and families, as well as the number of

Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods.

The extent to which the SAM-T98 implementation of a hidden Markov model procedure; PSI-BLAST; and the intermediate sequence search (ISS) procedure can detect evolutionary relationships between the members of the sequence database PDBD40-J is determined.