We conducted a large-scale association study to identify genes that influence nonfamilial breast cancer risk using a collection of German cases and matched controls and >25,000 single nucleotide polymorphisms located within 16,000 genes. One of the candidate loci identified was located on chromosome 19p13.2 [odds ratio (OR) = 1.5, P = 0.001]. The effect was substantially stronger in the subset of cases with reported family history of breast cancer (OR = 3.4, P = 0.001). The finding was subsequently replicated in two independent collections (combined OR = 1.4, P < 0.001) and was also associated with predisposition to prostate cancer in an independent sample set of prostate cancer cases and matched controls (OR = 1.4, P = 0.002). High-density single nucleotide polymorphism mapping showed that the extent of association spans 20 kb and includes the intercellular adhesion molecule genes ICAM1, ICAM4, and ICAM5. Although genetic variants in ICAM5 showed the strongest association with disease status, ICAM1 is expressed at highest levels in normal and tumor breast tissue. A variant in ICAM5 was also associated with disease progression and prognosis. Because ICAMs are suitable targets for antibodies and small molecules, these findings may not only provide diagnostic and prognostic markers but also new therapeutic opportunities in breast and prostate cancer.
Breast cancer is the most common form of cancer in women and accounts for over 400,000 deaths each year worldwide. Among the risk factors identified are family history, diet, reproductive history, exogenous estrogen exposure, smoking, and alcohol consumption. The heritability of breast cancer is estimated at about 25% (1) , but germline mutations in so-called high-penetrance susceptibility genes, such as BRCA1 and BRCA2 (2) , explain <10% of all cases. Therefore, it is expected that more common genetic variations with low penetrance account for a high proportion of breast cancer cases.
Family based linkage studies have proven very successful for identifying genes with highly penetrant alleles that segregate in a simple Mendelian fashion within families (3) . However, successes in which linkage methods are used have largely been restricted to rare diseases and to selected “familial” subsets of common disorders, accounting for only a small proportion of disease in the population. At least two significant problems are associated with using traditional linkage approaches to map complex traits. First, linkage mapping has low resolution, and the genetic loci implicated in such studies can include hundreds of genes. Second, complex diseases are determined by the interaction of multiple genes throughout the genome (4) . Within different families, the contributions of the different genes will likely vary, thus complicating the ability to detect and replicate linkage.
For common disease susceptibility genes, direct association approaches that compare the distribution of genetic marker frequencies between groups of unrelated individuals are expected to have greater power than traditional linkage studies (5) . Recently, there has been increasing interest in the use of whole-genome association methods to identify genes that are involved in complex trait variation (6, 7, 8) . To date, however, few large-scale studies have been reported (9, 10, 11) , and the search for common polymorphisms that influence breast cancer risk has yet to produce useful markers. In an effort to identify genes and variants that influence breast cancer susceptibility, we conducted a large-scale case-control study using >25,000 single nucleotide polymorphisms (SNPs) located within approximately 16,000 genes. We report the identification of intercellular adhesion molecule variants on chromosome 19p13.2 that confer increased breast and prostate cancer risk and might predict disease outcome.
MATERIALS AND METHODS
Study Subjects and Preparation.
The discovery sample comprised 254 breast cancer patients attending the Frauenklinik Innenstadt, Munich, Germany. Lymph node status was positive at time of assessment in 94 cases (37%), and 18 cases (7%) had known distant metastases. Twenty-four cases (11%) reported a positive family history of breast cancer. The median age at diagnosis was 56 years (range = 23–87 years). Two hundred and sixty-eight controls with a median age of 57 years (range = 17–88 years) were recruited from patients with benign disease attending the clinic during the same period. Controls with a family history of breast or ovarian cancer were excluded. Both parents of each study participant were reported to be of German descent.
The German replication sample consisted of 188 cases and 150 controls recruited at the Department of Obstetrics and Gynecology, Technical University of Munich. The majority of breast cancer cases were recruited at preoperative visits, and female controls were recruited from healthy individuals or patients with nonmalignant diagnoses. Median age of diagnosis for cases was 59 years (range = 22–87 years), and median age of controls was 50 years (range = 19–91 years). All but two participants reported both parents were of German descent. The two exceptions each reported one parent of non-German, Eastern European origin.
The Australian replication sample comprised 180 breast cancer cases recruited by the Pathology Department of Gold Coast Hospital or by the Genomics Research Center, Southport. Median age of diagnosis was 50 years (range = 24–74 years). Controls consisted of 180 healthy volunteers recruited through the Genomics Research Center. Only controls with no family history of cancer or precancerous conditions were included. Controls were individually age matched to cases (±5 years). Median age for controls was 60 years (range = 28–94 years).
The prostate cancer sample consisted of 368 German patients with a median age of diagnosis of 65 years (range = 43–90 years) recruited at the Urology Clinic Munich-Planegg, and 368 controls without symptomatic prostate disease with a median age of 68 years (range = 25–92 years) collected at the University Hospital of Tuebingen. All subjects involved reported both parents to be of German ancestry.
All subjects involved in our studies signed a written informed consent and the institutional ethics committees of participating institutions approved the experimental protocols. For discovery samples, we extracted DNA from 5 mL of blood from each subject using a desalting method and quantitated fluorometrically (Fluoroskan Ascent CF, Labsystems, Helsinki, Finland) using Pico green. DNA pools were generated by combining equimolar amounts of each sample as described elsewhere (12 , 13) .
SNP Markers and Genotyping.
A set of 25,494 SNP markers was selected from a collection of 125,799 experimentally validated polymorphic variations (14) . This set was limited to SNPs located within gene coding regions, minor allele frequencies >0.02 (95% have frequencies >0.1), and a target inter-marker spacing of 40 kb. SNP annotation is based on National Center for Biotechnology Information dbSNP database, refSNP, build 118. 8 Genomic annotation is based on National Center for Biotechnology Information genome build 34. Gene annotation is based on LocusLink genes for which National Center for Biotechnology Information was providing positions on the Mapview FTP site. 9
For pooled DNA assays, 25 ng of case and control DNA pools were used for amplification at each site. We conducted all PCR and MassEXTEND (Sequenom, Inc., San Diego, CA) reactions using standard conditions (13) . Relative allele frequency estimates were derived from area under the peak calculations of mass spectrometry measurements from four analyte aliquots as described elsewhere (13) . For individual genotyping, the same procedure was applied except only 2.5 ng of DNA was used, and only one mass spectrometry measurement was taken.
We carried out tests of association between disease status and each SNP using pooled DNA in a similar fashion as explained elsewhere (15) . Sources of measurement variation included pool formation, PCR/mass extension, and chip measurement. When three or more replicate measurements of a SNP were available within a model level, the corresponding variance component was estimated from the data. Otherwise, the following historical laboratory averages were used: pool formation = 5.0 × 10−5, PCR/mass extension = 1.7 × 10−4, and chip measurement = 1.0 × 10−4. We carried out tests of association using individual genotypes that used a χ2 test of heterogeneity based on allele and genotype frequencies. The DerSimonian-Laird random effects meta-analysis method (16) was used for the analysis of replication samples to test for the consistency of association while permitting allele frequencies to differ among samples. All tests of allele frequencies involving only replication samples are one-sided, confirming the effect observed in the discovery sample. We derived P values using the log odds of each contrast and their standard errors. Multiple approaches were explored in an effort to identify haplotypes demonstrating a stronger association with disease status than single sites. These included analyses of 15 SNP haplotypes and subsets thereof using the coalescent theory-based PHASE v2.0 (17) and the score method that relies on the expectation-maximization algorithm (18) . No attempt was made to correct P values for multiple testing. Rather, P values are provided to compare the relative strength of association from multiple dependent (e.g., SNPs within samples) and independent (e.g., SNPs between samples) sources of information. P values <0.05 are referred to as statistically significant.
We carried out a genome-wide association study using 25,494 SNPs located within 10 kb of 15,995 LocusLink annotated genes and a sample consisting of 254 breast cancer cases and 268 age-matched controls of German descent. We used a high-throughput approach using DNA pools, chip-based mass spectrometry (11, 12, 13 , 19) , and a three-step SNP selection strategy. In the first step, we did a single PCR and primer extension reaction for each SNP on one DNA pool consisting of cases and on one consisting of controls. Relative allele frequencies obtained from four mass spectrometry measurements of the extension products were compared between pools. In the second step, the 1,619 SNPs (∼5%) with the most statistically significant associations were remeasured in triplicate on each DNA pool. In the third step, the 74 most significant SNPs (∼5%) from step two were individually genotyped in each sample. Fifty-two SNPs were confirmed to have statistically significant differences between cases and controls (P < 0.05).
Genome-wide case-control studies that use tens of thousands of SNPs and liberal statistical selection criteria are expected to yield a high proportion of false-positive associations. To distinguish the true genetic effects from the false positives, the 52 SNPs were genotyped in two independent collections of breast cancer cases and controls from Germany and Australia. The most consistent and statistically significant association was observed with a SNP located on chromosome 19p13.2. The marker SNP rs1056538, a nonsynonymous variation (V301I) in exon 5 of intercellular adhesion molecule 5 (ICAM5), had a P value of 0.001 [odds ratio (OR) = 1.5] in the discovery sample and a P value of 0.03 (OR = 1.4) and a P of 0.07 (OR = 1.3) in the German and Australian replication samples, respectively (Table 1) ⇓ . The analysis of all three samples resulted in a combined significance of P < 0.001 (OR = 1.4) and a significance of P = 0.01 (OR = 1.3) within the replication samples only.
We tested an additional 60 SNPs located within 50 kb of the initial marker using the discovery pools to fine map the region of association (Fig. 1) ⇓ . The region of highest significance extended approximately 20 kb, spanning the 3′-end of ICAM1 and the ICAM4 and ICAM5 genes. We selected 15 SNPs spaced throughout this region for genotyping in all three samples (Fig. 2) ⇓ . The SNPs most strongly associated in the discovery (Fig. 2A) ⇓ and combined samples (Fig. 2B) ⇓ were two nonsynonymous SNPs in ICAM5, the original marker rs1056538 (V301I) and rs2228615 (A348T), which were in near complete linkage disequilibrium (Fig. 2E) ⇓ . A nonsynonymous SNP in ICAM1, rs5030382 (K469E), was significantly associated in the discovery sample but not in the replication samples. Analyses of haplotypes consisting of subsets of the 15 genotyped SNPs did not reveal any haplotype with stronger association than individual SNPs (data not shown).
Because there are several common features between breast and prostate cancer (20) , we also tested this locus for association with prostate cancer susceptibility in an independent collection of German cases and controls. Thirteen of the 15 SNPs genotyped in the breast cancer sample were genotyped in the prostate cancer sample. The resulting association pattern was similar to that observed for breast cancer. The nonsynonymous SNPs, rs1056538 and rs2228615, in ICAM5 were most strongly associated (OR = 1.4, P = 0.002; Fig. 2C ⇓ ).
The discovery collection of German breast cancer patients included information related to family history, age of onset, and severity of disease. The genotyped SNPs were tested for association with these variables. The SNP rs1056538 and rs2228615 were most strongly associated with a positive family history of breast cancer (P = 0.0065, Fisher’s exact test). Fifteen percent of those homozygous for the susceptible allele (C) had a positive family history, compared with 9% of the heterozygotes and none of those homozygous for the protective allele. Comparing allele frequencies between cases with a positive family history and controls results in a substantially larger estimated effect size than in all cases (OR = 3.4, P = 0.001). There was no association between rs1056538 and age of diagnosis. Also noteworthy, the SNP most strongly associated with breast cancer status in the replication samples, rs281439 located 542 bp 5′ of ICAM5 (Table 1) ⇓ , was also associated with indicators of cancer severity (Table 2) ⇓ . Patients carrying the allele associated with breast cancer susceptibility (G) had a significantly shorter time span between diagnosis and recruitment, suggesting shorter survival time, and higher rates of metastases to other organs. These results suggest that one or more variants in this region are risk factors for breast and prostate cancer and may influence disease progression and prognosis.
Using semiquantitative RT-PCR, we found expression of ICAM1 in a variety of normal tissues, including breast and prostate and several breast tumor samples tested (data not shown). As described previously, ICAM5 showed strongest expression in brain and, along with ICAM4, was expressed at very low levels in various other tissues, including breast and prostate, and a number of breast tumor tissues and cell lines.
In an association study using SNPs in nearly 16,000 genes, we obtained evidence that a region on chromosome 19p13.2 containing the genes ICAM1, ICAM4, and ICAM5 influences breast and prostate cancer risk. The identification of a susceptibility region influencing both breast and prostate cancer is particularly interesting because breast and prostate cancer have many common features, such as hormone-sensitivity, parallel incidence rates in various countries, and common genetic alterations (20) . The role of intercellular adhesion molecules in cancer has been described (21 , 22) . ICAM1 is constitutively expressed in cells involved in immune response and induced on other cells, including endothelial and epithelial cells. It has a well-known role in inflammation-related processes and immune surveillance, and derangements of its expression have been implicated in the development of a variety of inflammatory diseases (23, 24, 25) as well as in tumor progression of several cancers, including breast cancer (26, 27, 28) . ICAM1 is also involved in transmembrane signal transduction after binding to β2 integrin ligands and multimer formation (29) , activating the mitogen-activated protein kinase pathway (30) and eventually transcription factors like AP-1 that regulate cell proliferation events (31) . The inhibition of AP-1 has been shown to inhibit breast cancer cell growth (32) . Relatively little is known about the roles of ICAM4 and ICAM5 in cell signaling events and tumor surveillance, but their involvement in similar pathways is likely. ICAM4 has been reported to be exclusively expressed in erythrocytes and has a suggested role in cell interaction events, including hemostasis and thrombosis (33) . ICAM5 is mainly expressed in specific areas of the brain (34) and has been implicated in dendritic outgrowth (35) and rapid cell spreading of microglia (36) . Although the reported expression patterns and described functions of ICAM4 and ICAM5 are not indicative of a role in breast and prostate cancer susceptibility, their roles in cell adhesion and cell signaling together with their low level expression in cancer-relevant tissues leave the possibility that their dysregulation or dysfunction may increase cancer risk.
Our findings are in agreement with previous reports on the involvement of ICAM1 in tumor progression, and the K469E variant is a potential candidate to influence this process. However, the data also suggests an influence of other variants in this region on breast and prostate cancer susceptibility. Although the genetic evidence favors ICAM5, higher relative expression levels of ICAM1 make it a more favorable candidate for predisposition and potentially tumor progression. However, current data cannot exclude any of the three ICAM genes as biologically responsible for the observed association.
The route by which these genetic associations were arrived at and the potential for spurious association must certainly be considered. Recent published work has brought much needed light to the need for proper validation to verify genetic findings for complex traits (37, 38, 39) . In the current study, the initial association found between the ICAM5 marker and breast cancer status was one result from over 25,000 hypothesis tests. A conservative Bonferroni adjustment to yield an experiment-wide type I error rate of 0.05 would demand a test-wise P value on the order of 10−6. Given the modest sample size, only common variations with relatively large effects (OR > 2) would reach such significance levels. Instead, we chose to be more mindful of the role of type II error rates and apply a more liberal set of criteria in the initial phases of the study and verify true genetic effects by independent replication. The analysis of 52 selected markers in the German and Australian replication samples resulted in multiple associations of continuing interest, with the ICAM5 providing the most consistent and statistically significant association (P = 0.01, OR = 1.3). This would not be considered significant on an experiment-wide level after Bonferroni adjustment, which would require a test-wise P value on the order of 0.001. Indeed, if the true size of the genetic effect is 1.4, then the total replication sample size of 368 cases and 330 controls has <50% power to reject the false null hypothesis at this level.
It should be noted that we observed two additional lines of evidence supporting the validity of the association of this region to breast cancer susceptibility. First, the risk allele for coding SNP rs1056538 is significantly more common in cases with a family history of breast cancer. Second, the independent identification of the same variation with remarkably similar effects observed in a larger prostate cancer study gives us added reason to believe that one or more variations in these ICAM genes are influencing breast and prostate cancer susceptibility in populations of European origin. These are characteristics that have also been observed for the germ line variations in CHEK2 (40 , 41) .
The successful identification of a small gene region associated with breast and prostate cancer susceptibility through large-scale association supports the promise of more comprehensive genome-wide studies. The approach presented here using DNA pools and genome-wide gene-based SNPs is very useful to quickly discover susceptible genes for various cancer types and other complex disorders. However, given the complexity of the genetic architecture underlying complex trait variation, genetic analyses, alone, are not likely to unambiguously identify the genes or genetic variations responsible. Further validation in independent samples must be carried out to support genetic evidence, and functional experiments must be conducted to identify which variations in which genetic and environmental contexts are associated with disease susceptibility and prognosis. The observed association of genetic variants to disease severity and progression has promising implications for patient management. Because ICAMs are cell surface molecules and therefore obvious targets for antibodies and small molecules, these findings also offer new opportunities for therapeutic intervention.
We would like to thank Richard Kolodner for critically reviewing the report and making helpful comments. We would also like to thank all patients participating in this study and the members of Sequenom’s high throughput center team for generating the data.
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
Requests for reprints: Andreas Braun, Sequenom Inc., 3595 John Hopkins Court, San Diego, CA 92121. Phone: 858-202-9018, Fax: 858-202-9020; E-mail:
- Received May 20, 2004.
- Revision received September 1, 2004.
- Accepted October 3, 2004.
- ©2004 American Association for Cancer Research.