Gene expression profiles in normal and cancer cells.
Serial analysis of gene expression, or SAGE, is a technique designed to take advantage of highthroughput sequencing technology to obtain a quantitative profile of cellular gene expression. Essentially, the SAGE technique measures not the expression level of a gene, but quantifies a ”tag” which represents the transcription product of a gene. A tag, for the purposes of SAGE, is a nucleotide sequence of a defined length, directly 3’-adjacent to the 3’-most restriction site for a particular restriction enzyme. As originally described, the length of the tag was nine bases, and the restriction enzyme NlaIII. Current SAGE protocols produce a ten to eleven base tag, and, although NlaIII remains the most widely used restriction enzyme, enzyme substitutions are possible. The data product of the SAGE technique is a list of tags, with their corresponding count values, and thus is a digital representation of cellular gene expression. However, to say that SAGE produces a digital output, is not to imply that no loss of fidelity occurs from the conversion of an actual transcript and its expression level to a tag and its count value. Accuracy in both the assignment of tags to genes as well as the ability to quantify a gene’s expression level are sacrificed in order to increase throughput, and therefore increase the speed and lower the cost of analysis. A ten base tag is by no means a perfect representation of a gene’s entire transcript. There will be instances in which two or more genes share the same tag (i.e., the tag to gene assignment is ambiguous), and instances in which one gene has more than one tag (i.e., through alternate termination in an individual, and polymorphism in a population, the gene to tag assignment is not specific). And, as if this inherent difficulty in making specific and unambiguous tag to gene assignments wasn’t enough, an entirely acceptable sequencing error rate from the point of view of most sequencing tasks can have several disturbing effects on SAGE tag data, when dealing with such short sequences. So there are, really, two problems to be tackled when dealing with SAGE data in the form of tags and counts. The first deals with insuring that the tags and their counts are a valid representation of transcripts and their levels of expression, and the second, with making valid tag to gene assignments. In consideration of the first problem – the valid data problem – sequencing error has the greatest effect. Assuming that there is an average 1% per base sequencing error rate, for ten bases, the chance of one or more errors occurring is roughly 10%. The error, if it occurs, will, of course, lower the correct tag count by one, but will also either increase the tag count of an already established tag by one, or will establish and count a tag which does not, in reality, exist. The former effect is not of great concern when drawing conclusions from tags with relatively high counts, since raising or lowering a tag count by one or two should, overall, have no great effect. The former and latter effects, on the other hand, do much to increase suspicion of the tags with low counts, particularly those with a count of 1. Currently, the only way this suspicion has been dealt with has been to remove from the data tags counted only once. This may not be an optimal approach, and investigations are currently underway to discover if a better approach might exist. In consideration of the second problem – making valid tag to gene assignments – unspecific and ambiguous tag to gene assignments, as well as sequencing error, both play a role in creating confusion. In making tag to gene assignments, a certain degree of messiness is encountered. It would be preferable if specific and unambiguous gene assignments could be made for every experimental derived tag, but this is definitely not the case. The difficulties are several, and begin with the set of sequences from which tags are derived. The transcriptome of Homo sapiens has yet to be entirely sequenced, let alone characterized. Until it is, there is only an incomplete set of sequences from which to derive tags. Next, considering the nature of the roughly 1.5 million transcript-source sequences stored in GenBank, only about 18,000 are well-characterized cDNA/mRNA sequences, while the vast majority are expressed sequence tag (EST) sequences. The problem with using EST sequences for the derivation of 10 base tags is that they are usually only single-pass sequenced, and therefore, have, roughly, an average 1%