Abstract
33
Microsatellites or simple sequence repeats (SSRs) are widely present in the human genome. About 3% of the human genome is composed of such repeats in the form of mono, di, tri, or up to six nucleotide repeats. Repeat sequences occurring in the coding regions of the DNA are likely to influence the function of the proteins. Trinucleotide repeats have been associated with the inheritance of several genetic disorders. To date, at least 30 diseases are linked to trinucleotide repeat expansions (TRE). The purpose of this study is to systematically identify trinucleotide repeats in the coding region of all cancer related genes and to study their potential influence on the protein structure and function using bioinformatics and computational approaches. The coding sequences of 2245 cancer related genes involved in 18 different pathways listed at CGAP web site (cgap.nci.nih.gov) were retrieved from ensemble database (www.ensembl.org) by using EnSamrt Batch data/ sequence retrieval system (www.ensembl.org/EnsMart). Trinucleotide repeat sequences were mapped using the UNIX version of Perfect Tandem Repeat Finding Program (PTRF) (ncisgi.ncifcrf.gov/∼collinsj/Tandem_Repeats/downloads). For further systematic analyses of the repeats, a local relational database, which contains repeats, genes, protein domains and pathway data, was created. Since larger repeats are more likely to be unstable in the genome, in this study we have only focused on repeats that have repeat length of 6 units and more (n=95) for further analysis. For these repeats, the exact exonic location of the repeat and the coded amino acid sequence was determined and recorded using the transcript structure data from Ensembl database. Using the peptide structure data in the same database, the functional domains where the repeated sequence occurs were determined for each repeat and added to the summary table. In this study we have shown that the majority (59%) of the repast are located in the first exon of the cancer genes. Alanine (24%) and glutamine (16%) were shown to be the most abundant amino acid repeats in cancer genes. Interestingly 47% of the repeats were found to occur within a known functional protein domain. Comprehensive Pubmed literature search for each repeat revealed that 20% of these repeats have been studied and found be polymorphic, indicating that they are likely to interfere with the function of the protein domains in the polymorphic state. We believe that screening of the potentially functional repeats in cancer patients and comparing them to their frequencies in normal population is necessary and might have a great impact on elucidating the molecular bases of cancer related to trinucleotide repeats. This strategy provides candidate cancer genes, functions of which may be influenced by the polymorphism of trinucleotide tracts, and more likely to contribute to development of cancer.
- American Association for Cancer Research