We have developed a computer-based screening strategy to search the dbEST database to find differentiation antigens that are expressed by cancers arising in nonessential normal tissues such as prostate, breast, and ovary (G. Vasmatzis et al., Proc. Natl. Acad. Sci. USA, 95: 300–304, 1998). Here, we report the identification of three new members of the GAGE/PAGE family, termed XAGEs. XAGE-1 and XAGE-2 are expressed in Ewing’s sarcoma, rhabdomyosarcoma, a breast cancer, and a germ cell tumor. We also describe the relationship of the XAGEs to the GAGE/PAGE family. XAGE-1 and XAGE-2 should be evaluated as possible targets for vaccine-based therapies of cancer.
ESTs 4 are partial sequences of clones randomly selected from various cDNA libraries (1, 2, 3, 4, 5, 6) . Each of these clones is generated from a single transcript. The number of transcripts in given tissues represents the tissue-specific expression level of the corresponding gene. This provides valuable information on the expression patterns of genes in different tissues. The publicly available EST sequences are collected in the dbEST database, maintained by the National Center for Biotechnology Information. 5 The Cancer Genome Anatomy Project of the National Cancer Institute uses laser capture microdissection techniques to generate EST libraries 6 (4, 5, 6) from normal and malignant tissues. The Cancer Genome Anatomy Project has accumulated an enormous amount of tissue-specific sequence data, which keeps growing by the addition of more sequences and libraries from different tissues and tumor types. These EST sequences can be clustered, sorted, and filtered to screen and identify genes that are specifically expressed in certain tissues. Such “database mining” provides sequences that are expressed preferentially or exclusively in malignant tissues and, thus, provides lists of genes that may be useful targets for the diagnosis and therapy of diseases (4, 5, 6, 7) . We have recently reported on a computer screening strategy to identify genes that are preferentially expressed in tumors, e.g., prostate tumors. Using this procedure, we identified several new transcripts that may be useful targets for the therapy of prostate cancer (7 , 8) . One set of novel genes that we identified by this method consists of three related “PAGE” genes, which are homologous to the GAGE family of tumor antigens. One of these, which was named PAGE-1 in the main text of Brinkmann et al. (8) but changed to PAGE-4 to distinguish it from PAGE-1 of Chen et al. (9) , is expressed in prostate, testis, uterus, fallopian tube, and placenta as well as in prostate, testicular, and uterine cancers. PAGE-4 is being evaluated as a target for vaccine therapy of prostate cancer. Here, we report the identification of three other PAGE-GAGE-like genes. We extended our EST search and clustering methods to include a computational module, termed “homology walking,” that finds relatives of a gene. Using an EST from the PAGE-4 EST cluster (8) as the starting sequence, this procedure led to the identification of three novel PAGE-GAGE-related genes termed XAGE-1, -2 and -3. The GAGEs, PAGEs, and XAGEs form one large supercluster of related genes, which are expressed in various reproductive tissues as well as in different tumors.
Materials and Methods
Homology walking is an iterative procedure that finds members of a gene family using the EST database (Fig. 1) ⇓ . It begins with a given set of one or more gene sequences (gene set 1). Genes are often represented in dbEST by more than one EST (7 , 10) . Therefore, the first step of our procedure was to find ESTs for the given gene set by running BLASTN 7 (11) for each of the sequences in the gene set 1 against the human ESTs in dbEST. The homology stringency was set higher (BLAST parameters: 8 S = 600, V = 200, B = 200, N = −30, W = 20) than in the previous study (7) because we were interested in distinguishing small sequence differences between genes that belong to the same family. All ESTs found for each gene formed the EST cluster for that gene. To find ESTs from related but not necessarily identical genes, we also ran FASTA (Ref. 12 ; default parameters) for each of the gene sequences against all human ESTs in dbEST, and the ESTs with an E-score of <0.01 were kept (FASTA hits). The ESTs identified in this way for all the genes in set 1 were pooled to form the FASTA hit list 1. This hit list contained ESTs for the genes in set 1 and possibly some from other related genes. To find these other genes, we removed ESTs that belong to clusters of gene set 1 from the FASTA hit list 1. If any ESTs remained after this procedure, they represented new gene(s) (however, see below). The EST clusters corresponding to these new gene(s) were constructed by running BLASTN against the dbEST database (Fig. 1) ⇓ . The multiple EST sequences for each cluster were then aligned using CLUSTAL in the GCG-Lite package. 9
In some clusters, the 3′ end sequences were very similar to each other, but the 5′ ends differed. These clusters could be separated into subclusters, and multiple alignment was performed for each subcluster separately. The two different transcripts could arise from different causes, including alternative splicing, a ligation error, or the presence of unspliced nuclear mRNAs.
A consensus sequence was then produced from the multiple alignment for each cluster or subcluster. For most aligned positions, the sequences were identical. At disputed positions, the most common nucleotide was chosen for the consensus. These sequences represent the first neighbors of the original gene set. Each of the consensus sequences was run against GenBank with BLASTN to see if it represented a known gene. If so, the GenBank sequence replaced the consensus sequence to represent the cluster.
The procedure was then repeated using the augmented set of genes to find the next neighbor set of genes. Many of the steps were skipped in the second cycle because EST clusters were already known for all the genes and FASTA needed to be run only with the new gene sequences.
The iteration was terminated when no new genes could be identified. In principle, the termination point was reached when there was no EST left in the combined FASTA hit list after the ESTs from already identified genes had been removed. In practice, however, we found that some ESTs were always left, which, nevertheless, did not represent new genes in the family. Some of these were very short sequences that should have been included in one of the clusters but were excluded because of their short length. Others were longer but contained a long stretch of unrelated sequence joined by a short stretch that had high homology with one of the identified gene sequences. These were not considered to represent a new gene and were ignored. In other cases, the sequence had a homologous stretch only in the untranslated region, whereas the translated region shared no homology with the other genes in the family; these were also ignored.
The GCG program PILEUP was used to generate homology relationship dendrograms. The GCG program MAP was used to obtain deduced protein translations from nucleotide sequences.
Results and Discussion
Homology Walking in the EST Database
The results presented below were obtained by using the dbEST file provided by National Center for Biotechnology Information as of August 1, 1998. The sequence homology programs FASTA and BLASTN were used to perform homology walking in the human dbEST database (see “Materials and Methods”). The starting sequence was the nh32c06.s1 EST, which spanned the whole consensus sequence of the PAGE-4 EST cluster.
In the first step of homology walking with only PAGE-4 as the lead sequence, seven separate EST clusters were identified (Table 1) ⇓ : four for already defined genes (GAGEs and PAGE 2–4) and three that corresponded to new genes (XAGE 1–3). For the second step of the homology walking, the nucleotide sequences of PAGE-2, PAGE-3, XAGE-1, XAGE-2, and XAGE-3 and the known sequences of the seven GAGE genes were used as the lead sequences (see “Materials and Methods”). This second step brought in one more EST (op90e12.s1) that represented another known gene, PAGE-1 (9) . A third step of homology walking, using the PAGE-1 gene sequence retrieved from the GenBank, failed to identify any additional genes.
Thus, the homology walking procedure came to an end after two steps, with PAGE-4 as the starting point. However, an examination of the FASTA hit list and the EST clusters of all identified genes indicates that one step would have sufficed if the procedure was started from any other identified gene. The MAGEs and GAGEs have been related in the literature due to some common functional characteristics (13) , and in an earlier paper (8) , we showed a very weak sequence homology between MAGEs and GAGE/PAGE genes at the amino acid level. However, the procedure used here did not bring the MAGEs into the family.
Fig. 2 ⇓ shows a dendrogram constructed from the nucleotide sequence homologies of the 14 genes (PAGE 1–4, XAGE 1–3, and GAGE 1–7) found here by the homology walking procedure. The GAGE genes are clustered along with PAGE-1 in one branch; PAGE-3, XAGE-1, XAGE-2, and XAGE-3 genes are clustered together in another branch, closer to the GAGE genes than PAGE-4 and PAGE-2a/b. The PAGE-2a and PAGE-2b genes are clustered together because they represent sequence variants from the same gene (see below). PAGE-4 is the most distant relative in this family of GAGE/PAGE/XAGE genes. A dendrogram of the putative amino acid sequences of the same genes shows a similar relationship between the members of this gene family. 10
Other Methods for Finding Related Genes
Relatives of a gene can also be found by means of PSI-BLAST (14) and by ENTREZ. 11 However, both use full-sequence databases rather than the dbEST database. The use of the dbEST database has advantages because many more genes are represented in the dbEST database, but it is also more complicated because EST sequences must be clustered to find a consensus gene sequence. When PSI-BLAST was run using a GAGE sequence, the PAGE-1 and PAGE-4 sequences were identified as well as other GAGE genes. As expected, the PAGE-2, PAGE-3, and XAGE genes were not found because these are not in the full-sequence databases. ENTREZ appears to find only close neighbors because GAGEs find only other GAGE genes and PAGE-1 and PAGE-4 do not find GAGE or other PAGE genes.
Known PAGE-related Genes That Were Detected by Homology Walking
Three EST clusters represent the three PAGE genes, PAGE 2–4, that we described previously (8) . Multiple alignment of the PAGE-2 cluster shows a deletion of 51 nucleotides in two ESTs (zv62h08.r1 and ai61a04.s1). The reading frames with and without the deletion were the same except for a deletion of 17 amino acids (Fig. 3) ⇓ . The two sequences were denoted PAGE-2a and PAGE-2b.
Another cluster includes ESTs from the GAGE genes (15 , 16) . Because these genes are highly homologous to each other, they gather in one cluster according to our BLASTN criteria. The fifth cluster includes a single EST (op90e12.s1) corresponding to the PAGE-1 gene, which was previously identified as another GAGE-related prostate antigen (9) .
New Genes Identified by Homology Walking
Three EST clusters represent genes that have not been described previously. We named them XAGE-1, XAGE-2, and XAGE-3.
The XAGE-1 cluster contains 13 ESTs from testis, bone sarcoma, and muscle cancer libraries. This cluster could be separated into two subclusters, one with eight and the other with five ESTs. The sequence of the smaller subcluster contains a segment that resembles the end of an intron and does not contain an open reading frame upstream of that position. The larger subcluster contains an open reading frame from the start of the composite sequence and another frame that codes for a protein sequence that is homologous to that of the GAGE proteins. But this open reading frame does not contain an initiation codon (ATG) until about halfway into the sequence. Because of the uncertainty with translation, this gene was omitted from Fig. 3 ⇓ .
The XAGE-2 cluster contains 19 ESTs from fetal liver-spleen, placenta, uterus, fetal heart, and breast tumor libraries and 1 EST (nw51g04.s1) from a bone sarcoma library (Table 1) ⇓ . The EST from the bone sarcoma library has a 5′ end sequence that is different from that of the others, but there is no obvious splice site at the point of the sequence deviation. This EST was initially removed from the cluster and a composite sequence was constructed from the other sequences. A translation of the composite sequence indicated an open reading frame of 111 amino acids that was closely related to XAGE-3 (Fig. 3) ⇓ . The 5′ deviant nw51g04.s1 sequence encodes the same protein, XAGE-2, because the point of deviation is upstream of the initiation codon.
The XAGE-3 cluster contains 8 ESTs, 6 from placenta and 2 from liver-spleen libraries (Table 1) ⇓ . Two of these ESTs, both from clone yw86a06, have a 5′ end sequence that is different from that of other ESTs. At the position where the sequences begin to diverge, the deviant sequence has a CT stretch and an AG, which is consistent with a splice site (17) . 12 It is likely that this sequence represents either an unspliced nuclear RNA or an alternatively spliced form of the same gene. A consensus sequence was constructed after omitting the deviant ESTs from the alignment. This consensus encodes a deduced open reading frame of 111 amino acids (Fig. 3) ⇓ .
XAGE Proteins Might Be Useful Tumor Targets
MAGE, BAGE, and GAGE genes encode antigens that are specifically expressed in certain tumors, such as melanomas (13 , 15 , 18 , 19) , and PAGE-4, is expressed in prostate and prostate tumors (8) . Some of the genes are presently being evaluated as targets for cancer therapy (20) . The expression specificity of the various XAGE genes can be deduced from the tissue/tumor distribution of the corresponding transcripts in the dbEST database (Table 1) ⇓ . These expression patterns indicate that some of the novel XAGE genes are expressed, like many other MAGE/GAGE genes, in tumors and in some fetal and reproductive tissues, but not in essential human tissues. XAGE-3 is expressed in placenta and fetal liver/spleen, and XAGE-2 appears in fetal tissues, placenta, and uterus as well as in breast tumor and bone sarcoma libraries. Of particular interest may be XAGE-1, which is found in various bone and muscle cancer libraries. The EST frequency in these cancer libraries indicates that this gene might be not only specifically expressed in bone and muscle cancer, but also that its expression level is high. Thus, XAGE-1 may be an attractive target for the therapy of muscle and bone tumors.
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
↵1 Present address: Epidauros Biotechnology, Am Neuland 1, D-82347 Berrnried, Germany.
↵2 The first two authors contributed equally to this work.
↵3 To whom requests for reprints should be addressed, at Laboratory of Molecular Biology, Division of Basic Sciences, National Cancer Institute, NIH, Building 37, Room 4E16, 37 Convent Drive, MSC 4255, Bethesda, MD 20892-4255. Phone: (301) 466-4797; Fax: (301) 402-1344.
↵4 The abbreviation used is: EST, expressed sequence tag.
- Received January 12, 1999.
- Accepted February 15, 1999.
- ©1999 American Association for Cancer Research.