Although “response” has been an attractive term for oncologists and patients, oncologists really want to know which therapy to start for a given patient and when to discontinue that therapy in favor of an alternative. In efficacy trials, cancer therapeutics have conventionally been assessed by endpoints that are based on the categorical Response Evaluation Criteria In Solid Tumors (RECIST) system. In this article, we make the case for a new paradigm in which therapeutics are assessed on a continuous scale by evidence of efficacy, using a variety of quantitative tools that take advantage of technologic innovations and increasing understanding of cancer biology. The new paradigm relies on randomized comparisons between investigational arms and control arms, as historical controls are unavailable or unreliable for these quantitative measures. We discuss multiple limitations of RECIST, including its overemphasis on tumor regression, concerns about the accuracy of tumor measurements and the validity of comparisons with historical controls, and its inadequacy in disease settings in which tumor measurements on cross-sectional imaging are difficult or uninformative. We discuss how the new paradigm overcomes these limitations and provides a framework for answering the key questions of the oncologist and improving patient outcomes. Cancer Res; 72(20); 5145–9. ©2012 AACR.
The art of oncology involves a balance between providing hope and conveying realistic expectations for cancer patients. The term “response” evolved in this context, an optimistic and reassuring term that served as a goal for both oncologists and patients. For solid tumors, response has historically been defined as a substantial reduction in the size of measurable lesions. In an era before modern imaging, Moertel and Hanley conducted a study in which 16 oncologists were asked to measure the size of 12 solid spheres “on a soft mattress and covered with a layer of foam rubber (1).” They observed a high degree of interobserver and intraobserver variability in measurements and recommended that an objective response be defined as “a reduction in the product of the longest perpendicular diameters of the most clearly measurable tumor mass by at least 50% … at least 2 months after the onset of therapy.” This recommendation served as the basis for the first standardized response criteria (the World Health Organization criteria; ref. 2) in which patients were grouped into categories of “complete response,” “partial response,” “no change,” or “progressive disease.” Although the first and second versions of the Response Evaluation Criteria In Solid Tumors (RECIST; refs. 3, 4) updated these criteria to reflect modern imaging and use unidimensional measurements, the underlying concept has remained unchanged for 30 years.
Nonsurvival endpoints in efficacy trials are conventionally based on the categorical RECIST system. In phase II trials, the Investigational Drug Steering Committee of the National Cancer Institute generally recommends the use of randomized designs and a progression-free survival (PFS) endpoint, while acknowledging that single-arm designs and a response rate endpoint may be appropriate in certain settings (5). The response rate is the proportion of patients with a complete or partial response as defined by RECIST (>30% decrease in sum of longest diameters of target lesions and no new lesions). PFS is a time-to-event endpoint in which progression is defined by RECIST (>20% increase in sum of longest diameters of target lesions or any new lesions). Recently, many trials have used disease control rate or clinical benefit as endpoints, both of which refer to the proportion of patients with either stable disease or a response (at a specified time point) as defined by RECIST. Although these terms imply drug effect, such an endpoint includes patients who may not actually be benefiting but simply have relatively indolent disease. Furthermore, these RECIST-based endpoints do not consistently correlate with the important clinical endpoints of survival or improved quality of life, the outcomes of greatest importance to patients (and regulators).
From a practical point of view, the oncologist wants to know which therapy to start for a given patient and when to discontinue that therapy in favor of an alternative. Rather than argue about where the lines should be drawn to define response and progression, we contend that there should be no lines at all, but that efficacy should be viewed on a continuous (rather than categorical) scale. Cancer therapeutics should be evaluated by evidence of efficacy, which is best established by randomized comparisons between investigational arms and control arms.
The new paradigm
The volume of potential therapies/combinations and emerging data on interindividual and intraindividual differences in treatment effects present challenges that cannot be addressed by a categorical, response-based system. To address these challenges, we propose a new paradigm of using evidence of efficacy to evaluate new therapies. An intentionally vague term, evidence of efficacy would be established by one or more quantitative tools that take advantage of technologic innovations and increasing understanding of cancer biology. In this section, we highlight a number of reasons why this new paradigm would increase the efficiency of drug development.
In the past 2 decades, we have witnessed an explosion in the availability of diagnostics that divide cancers into molecular subtypes (6). Some of these diagnostics have led to predictive biomarkers for therapeutics, such as vemurafenib for melanoma harboring the V600E BRAF mutation (7). Molecular diagnostics have shown that clinically relevant differences exist in histologically identical tumors between patients, between tumors in the same patient, in the same tumor over time, and even in distinct clonal populations from the same tumor at the same time (8). When developing cancer therapeutics, it is reasonable to expect that different measures of efficacy may be needed for different molecular subtypes, due to differences in tumor biology and in the prognoses with available therapies. The new paradigm would allow for the flexibility needed to study subpopulations of patients with different efficacy measures.
Better resolution on cross-sectional imaging has improved our capacity to measure tumors precisely, making it possible to use change in tumor size as a measure of efficacy in patients with measurable disease. Lavin first pointed out that the use of continuous tumor size data to compare randomized groups of patients is more statistically efficient than the use of categorical assessments in nonrandomized patients (9). Karrison and colleagues took this concept one step further by showing how the ratio of tumor size at the first on-treatment computed tomography (CT) scan compared with baseline on a logarithmic scale (log ratio) could be used as the primary endpoint for a randomized phase II trial of erlotinib and sorafenib in non–small cell lung cancer (NSCLC) with a feasible sample size (10). We simulated phase II trials using data from a positive phase III trial of sorafenib in renal cancer and found that a randomized design with log ratio at 6 weeks as the endpoint has greater power than more conventional designs with response rate by RECIST or PFS endpoints to predict the known phase III results (11). Although all of the aforementioned studies used longest diameter tumor measurements, Zhao and colleagues showed that semiautomated volumetric measurements are better than unidimensional measurements at an early time point for distinguishing NSCLC with and without sensitizing mutations to gefitinib (12). The establishment of centralized image banks will make broad implementation of semiautomated volumetric measurements of tumors a feasible option for use in future trials. The new paradigm would use tumor size as a continuous variable, taking advantage of the technologic improvements in imaging-based tumor measurements.
Technologic innovations and outcomes research have also given rise to a number of efficacy measures that might complement or supersede tumor size. Positron emission tomography (PET), dynamic contrast-enhanced MRI, or other functional imaging studies may detect drug effects and help to distinguish viable tumor and necrotic tissue. Serum biomarkers, such as circulating tumor cells (CTC), prostate-specific antigen (PSA), or cancer antigen 125 (CA-125), may also detect drug effects and may be especially useful in patients with nonmeasurable disease on cross-sectional imaging. For example, Scher and colleagues presented results of an effort to use enumeration of CTC as an efficacy–response biomarker in castration-resistant prostate cancer (CRPC) and as part of a biomarker panel that is prognostic for survival (13). Tissue biomarkers, such as genomic or proteomic assays with minimal measurement error, could be used in a similar way. Patient-reported symptoms or quality of life, measured with previously validated scales, can show that patients are deriving meaningful benefits from therapy.
Advanced computational methods have made it possible to use quantitative data collected at various time points before and during therapy (i.e., longitudinal data) to develop mathematical models of disease progression in a population of patients. In an example using tumor size, Wang and colleagues developed a nonlinear mixed effects model of tumor growth using longitudinal data from 3,398 patients on 4 registry trials in NSCLC. Their model was able to quantitatively measure the effects of various drugs and drug combinations on the growth of NSCLC, as well as predict overall survival on the basis of an early change in tumor size (14). Similarly, Stein and colleagues fit longitudinal PSA data from individual patients with CRPC on 5 trials to an equation that involves an exponential decrease and simultaneous exponential increase in PSA. They showed that the growth rate constant correlates with survival and concluded that this constant may be a novel endpoint for efficacy trials (15). These types of models, with tumor size or any of the other quantitative efficacy measures discussed above, can be used to compare drug effects between a control arm and an investigational arm on a randomized trial. The new paradigm would allow for these advanced computational methods to leverage data from completed trials without the use of previously treated patients as historical controls.
Regardless of the endpoint(s) selected, a few guiding principles are required for successful implementation of the new paradigm. First, the durability of drug effect is an essential component of the evidence of efficacy. A drug that has a small effect with long duration might be a better initial therapy than one that has a larger effect with shorter duration. For example, everolimus for advanced pancreatic neuroendocrine tumors does not cause substantial tumor regression in many patients (RECIST response rate of 5% vs. 2% with placebo) but inhibits tumor growth for an extended period of time in many patients (PFS rate of 34% at 18 months vs. 9% with placebo; ref. 16). On the other hand, chemotherapy (5-fluorouracil, cisplatin, and streptozocin) for advanced neuroendocrine tumors causes substantial tumor regression in many patients (RECIST response rate of 33%), but these effects are of relatively short duration (median, 36 weeks; ref. 17). Second, because these endpoints rely on technologic innovations that make historical data unavailable or unreliable, randomization and blinding between an investigational arm and a control arm are essential for declaring that a novel therapy has evidence of efficacy compared with alternatives. Furthermore, randomized allocation of treatment and objective, quantitative measurement by individuals blinded to the assigned treatment can overcome biases introduced by recognition of drug-specific toxicities. Figure 1 shows how the new paradigm offers multiple approaches for answering the key questions of the treating oncologist. In a randomized phase II trial with a control arm, comparisons based on one or more of the proposed endpoints would help sponsors decide whether or not to conduct a phase III trial. Investigators would prospectively define a clinically meaningful benefit based on these endpoints, considering the disease and patient population.
RECIST falls short of the new paradigm
In recent years, most novel therapies have been developed based on activity against a molecular target or pathway that is a putative driver of the malignant phenotype. As Ratain and Eckhardt have pointed out, such drugs may be active without meeting criteria for a RECIST response (18). As of December 2011, we are aware of at least 4 drugs in 5 indications with evidence of efficacy as monotherapy leading to U.S. Food and Drug Administration approval despite RECIST response rates of 10% or less in phase III trials: sorafenib and everolimus in renal cancer, panitumumab in colorectal cancer, and sunitinib and everolimus in pancreatic neuroendocrine cancer (16, 19–22). Given these examples and the multitude of drugs in development that might also cause growth inhibition or slight tumor regression, the wisdom of continuing to use RECIST response rate as an endpoint for clinical trials is questionable. The increasing use of “waterfall plots” to depict the percentage change in tumor size for individual patients reflects the understanding that RECIST response rates do not tell the whole story, but it remains impossible to make valid conclusions from these data in the absence of a comparator arm. In contrast to RECIST, the new paradigm treats tumor size as a continuous variable, so a drug that causes a mean 0% change in tumor size at a certain time point would have evidence of efficacy relative to a control arm with a mean 10% increase in tumor size.
Concerns have been raised about the accuracy of tumor measurements and the validity of conclusions in trials using RECIST-based endpoints. Oxnard and colleagues conducted a study in patients with NSCLC showing high intermeasurement variability in the size of lung target lesions on CT scans conducted within 15 minutes of each other and concluded that increases or decreases of less than 10% might be the result of this variability alone (23). Given that sharp contrasts between lung tumors and the surrounding air-filled parenchyma allow for easier identification of tumor boundaries, this variability might be even higher for tumors in other visceral organs. The use of categorical response criteria magnifies the impact of measurement variability, whereas comparisons of continuous data minimize this impact. Tang and colleagues showed that, in single-arm phase II trials with RECIST response rate as the primary endpoint, small errors in historical response rate assumptions can result in a 2- to 4-fold increase in the rate of phase II trials that are falsely interpreted as positive (24). Randomization could minimize erroneous conclusions based on RECIST-based endpoints, but the power of such trials would lag behind the power of trials that use comparisons of continuous data.
The “one size fits all” philosophy of RECIST limits its applicability to trials in a number of disease settings. Cancer is a biologically heterogeneous group of diseases, and the idea that a single set of criteria can assess the response to therapy for different cancers is analogous to the suggestion that the same could be done for distinct diseases in other medical specialties (e.g., rheumatoid arthritis and systemic lupus erythematosus). The response to therapy in Hodgkin lymphoma and diffuse large B-cell lymphoma is increasingly assessed by PET/CT rather than conventional CT because of increased sensitivity at early time points (25). Similarly, gastrointestinal stromal tumors may respond to therapy without RECIST responses, resulting in a new set of CT response criteria for this disease that correlate better with meaningful clinical outcomes (26, 27) In ovarian cancer, the challenge of measuring tumor deposits in the peritoneum has led to alternative approaches using CA-125 levels to assess response (28). The absence of reliable measurable disease is also true of CRPC with predominantly bony and lymph node metastases, leading to expert recommendations on biochemical (PSA) and other non-RECIST endpoints for clinical trials (29). The Response Assessment in Neuro-Oncology Working Group was recently formed to overcome the limitations of RECIST and the closely related Macdonald criteria for assessing responses in primary brain tumors and brain metastases and has published new criteria for high-grade gliomas that more precisely define measurable disease and incorporate contrast enhancement, treatment with corticosteroids, and clinical status as part of the assessment (30). Whereas RECIST-based endpoints have little or no value for clinical trials in each of these tumor types, the new paradigm would encourage measures of efficacy that reflect the biology of each disease and that can be measured accurately with modern technology. Randomization and blinding between an investigational arm and a control arm would preclude the need to validate each endpoint prospectively in each disease setting, as they would minimize the risk of making false conclusions (i.e., type I and type II errors).
Although RECIST served a historical purpose and our colleagues who have worked tirelessly to develop and refine it should be applauded, it falls short of the ideal as an assessment of efficacy for cancer therapeutics in development. Rather than suggest a modification to RECIST or an alternative set of criteria that can be applied broadly to solid tumors, we have argued for a new paradigm in which the tools to establish evidence of efficacy are chosen to reflect the biology of the disease in the population being studied. This new paradigm takes advantage of the state of the science in oncology, including knowledge about the biologic heterogeneity of cancer, improved precision in cross-sectional imaging, the availability of functional imaging studies, biomarkers that have emerged from technologic innovations, and advanced computational methods for modeling longitudinal data. Randomization and blinding between investigational arms and control arms are essential components of this new paradigm, allowing the use of endpoints that are not previously validated. Adoption of this new paradigm will more efficiently answer questions about which therapy to start and when to discontinue therapy and consider alternatives to maximize outcomes for our patients.
Disclosure of Potential Conflicts of Interest
M.L. Maitland has received confidential data and reimbursement for travel expenses from GlaxoSmithKline for related research. No potential conflicts of interest were disclosed by the other authors.
Conception and design: M.R. Sharma, M.L. Maitland, M.J. Ratain
Development of methodology: M.R. Sharma
Acquisition of data (provided animals, acquired and managed patients, provided facilities, etc.): M.R. Sharma
Analysis and interpretation of data (e.g., statistical analysis, biostatistics, computational analysis): M.R. Sharma
Writing, review, and/or revision of the manuscript: M.R. Sharma, M.L. Maitland, M.J. Ratain
The work was supported by NIH grant T32GM007019 and a Conquer Cancer Foundation Translational Research Professorship award to M.J. Ratain; National Cancer Institute Mentored Career Development Award K23CA124802 to M.L. Maitland.
- Received January 6, 2012.
- Revision received April 9, 2012.
- Accepted April 11, 2012.
- ©2012 American Association for Cancer Research.