Computing the Estimated EST Expression

**** This page is under construction *****

The CGAP tissue and histologic database

For each tissue, the NCI Cancer Genome Anatomy Program (CGAP database) may be queried by possible histological state, source, extraction and cloning method. In the initial query on the CGAP Web site, selecting the option "ANY" for all fields provides an initial overview of the available libraries. The more restrictive a search, the fewer the number of libraries that are selected. Within each library, transcripts are listed along with the number of times they were detected after a fixed number of PCR cycles.

As we were primarily interested in computing protein maps, using UniGene we extracted gene symbols associated with those CGAP EST's that were clustered to a gene of known function. To restate this: the CGAP Web site contains library specific expression data and the UniGene site contains the gene cluster symbol correspondence.

Finally, the Expasy SwissProt/trEMBL database contains gene symbols and protein sequence data. From this, one can compute the pI and Mw.

1. Mapping Gene symbols between CGAP, UniGene and SwissProt databases

A Perl script was outputs these gene symbols from the CGAP=UniGene derived data. This is cross=reference agains the Expasy SwissProt/trEMBL homosapiens data set to produce a list of corresponding SwissProt accession numbers (SP-ACC). This list can then be input to the Expasy pI/Mw tool server to produced tab-delimited data containing the pI (isoelectric focusing point), Molecular mass (Mw), and SwissProt ID (SP-ID) for the mature, unmodified proteins [Medjahed03a]. The following summarizes the steps in mapping the annotation mapping.

Remove all EST and empty gene symbol entries
Sort by the Hs. UniGene identifiers
Lookup the gene symbols in the sorted UniGene data
Using the gene symbols, lookup (SP-ACC,SP-ID,pI,Mw) on the Expasy.org Web site

2. Computing the estimated EST expression

We then needed to compute the estimated EST expression. In the case of a single library, this information was computed from the expression-detection counts. The number of hits for each CGAP EST was first divided by the sum total of sequences within that library to provide a relative expression for each transcript.

Then, the results were renormalized by dividing relative expression levels by the maximum relative expression level so that the maximum expression was normalized to 1.0. Expression values are > 0.0 (least abundent) and less than or equal to 1.0 (most abundent).

A tissue search may find several libraries fulfilling the requirements of the initial query. Therefore, to improve the signal-to-noise ratio, the search results were pooled to generate a non-redundant list of entries. This leads to a more comprehensive expression map for that tissue corresponding to that histological state.

Pool and add CGAP libraries coresponding to the same tissue and histological state
Compute the relative EST frequencies from this pooled data
Compute the maximum relative EST frequency (i.e., MaxESTexpr)
Merge the (SP-ACC,SP-ID,pI,Mw) with the (SP-ID,MaxESTexpr) data to generate (SP-ACC,SP-ID,pI,Mw,MaxESTexpr) data used to for the ProtPlot master protein index.

3. Generating the ProtPlot .prp files

The resulting data is a tab-delimited '.prp' formatted file that contains expression levels ranging from 0.0 (undetected) to 1.0 (most abundant). The following sequence of operations is performed on each (tissue, histological state) to create a ProtPlot sample. Each ProtPlot sample is saved in a tab-delimited ".prp" formated file containing the (SP-ACC,SP-ID,pI,Mw,MaxESTexpr) data.

Additional details on these methods are available in ( [Medjahed02], [Medjahed03a], [Medjahed03b]).

Djamel Medjahed, LMT, SAIC-Frederick
Peter Lemkin, LECB, NCI-Frederick

Revised: 08-26-2004