Assessing that the efficacy of a cancer therapeutic is an integral part of its path to regulatory approval, we review the history that led to our current assessment method, Response Evaluation Criteria in Solid Tumors (RECIST). We describe the efforts of Moertel and Hanley to standardize response assessments in lymphoid malignancies and how this was adapted in the World Health Organization (WHO) criteria. Two decades later, RECIST was advanced to streamline WHO and improve its reproducibility. We describe the ways in which thresholds established by Moertel and Hanley to provide accuracy and reproducibility evolved to become measures of efficacy and why they have been valuable. While we recognize RECIST is far from perfect—in need of modification as a measure of efficacy for some agents and in some diseases—for the majority of solid tumors, it is very valuable. We argue that over time, the efficacy thresholds established by WHO and then RECIST have proved their worth, and we summarize 10 years of U.S. Food and Drug Administration (FDA) approvals in solid tumors to support our position that current RECIST thresholds should be retained. Cancer Res; 72(20); 5151–7. ©2012 AACR.
In recent decades, the practice of cancer medicine and the technology of experimental cancer therapy have reached progressively higher levels of scientific sophistication.… There is, however, one essential element of this experimentation which is perhaps too frequently forgotten amidst such technical sophistry, namely, the actual measurement of the study end point. The culmination of most experimental therapeutic trials for solid tumors occurs when a man places a ruler or caliper over a lump and attempts to estimate its size. With this is introduced the inevitable factor of human error. Although the ultimate aim of therapy is increased survival, only few of our current approaches achieve that goal. To search for antitumor activity in a new modality by using survival as an endpoint is a far too complex and time-consuming effort and one that is frequently confused by the multiple therapies that may be attempted in any single patient (1).
The quotation above is from a classic article by Moertel and Hanley that appeared in the journal Cancer in 1976. It is a must read for anyone interested in the history of clinical trials in oncology and provides the background for how Response Evaluation Criteria in Solid Tumors (RECIST) came to be. Although much (all) of the paragraph above could be written today, it was in this Cancer submission more than 36 years ago that the “modern era” of drug assessment began. Those were exciting times. Nearly a decade earlier in Cancer Research, Moxley and colleagues had first reported on the “Intensive combination of chemotherapy and X-irradiation in Hodgkin's disease” (2). Not long after the Cancer Research submission, the use of combination chemotherapy was put on solid footing when De Vita and colleagues at the National Cancer Institute reported on the use of MOPP [Mustargen (mechlorethamine; Lundbeck) + Oncovin (vincristine; Eli Lilly) + procarbazine + prednisone] chemotherapy in advanced Hodgkin disease (3). But as therapies and regimens began to proliferate, it became apparent that one investigator's assessment of response (benefit) might differ greatly from that of a colleague, and hence a “common language” was needed. Thus, it was that 16 experienced oncologists treating lymphoma (all men!) gathered to decide what would be considered a reliable measure of response to a therapy (1). Their task that day was simple—measure 12 solid spheres representing “simulated tumor masses” ranging in size from 1.8 to 14.5 cm in diameter using “usual clinical methods” that in this era meant calipers or rulers. Their goal was to establish the amount of shrinkage that could not be ascribed to operator error and that would not be found if a placebo was administered. Importantly, the goal was not to decide what would be clinically meaningful or important, but rather what could be measured reliably. The “masses” were arranged randomly on a soft mattress and covered with a layer of foam rubber measuring either 0.5 or 1.5 inches in thickness. The participants were unaware that 2 pairs of these masses were identical in size, allowing Moertel and Hanley to gauge how internally reproducible each investigator was. The masses were measured in 2 dimensions perpendicular to each other, and the product of the perpendicular diameters was calculated. The results were as follows: When the same investigator measured 2 identical size spheres, he incorrectly concluded there was a difference of 50% in the product of the perpendicular diameters, only 6.8% of the times. Comparing the result obtained by 2 investigators measuring the same sphere, Moertel and Hanley found that only 7.8% of the times, their calculations of the product of the perpendicular diameters differed by 50% (1). Thus, in only 6.8% to 7.8% of the times did the measurements of 2 identical masses differ by more than 50%. In comparison, differences of 25% in the product of the perpendicular diameters were found in an unacceptably high 19% and 24.9% of comparisons. Thus, Moertel and Hanley concluded that “In the clinical setting, it is recommended that the 50% reduction criterion be employed and that the investigator should anticipate an objective response rate of 5% to 10% due to human error in tumor measurement.”
Within 5 years, the 50% difference in the product of the perpendicular diameters was endorsed when Miller and colleagues reported the recommendations of a World Health Organization (WHO) initiative to develop standardized approaches for the “reporting of response, recurrence and disease-free interval” (4). Propagating the practice begun by Moertel and Hanley, the WHO criteria recommended that tumors be “measured in two dimensions by ruler or caliper with surface area determined by multiplying the longest diameter by the greatest perpendicular diameter.” Disappearance of all-known disease was scored as a complete response (CR), whereas a designation of a partial response (PR) was allowed if in either a single or multiple masses, there was a “50% decrease in the sum of the products of the perpendicular diameters of the multiple lesions.” Both required confirmation with “2 observations not less than 4 weeks apart.” Note here that a “50% or more decrease in total tumor load” is not quite accurate, as the sum of the perpendicular diameters whereas “short hand” for volume in fact does not measure volume or “tumor load.” But the important point is the 50% threshold that Moertel and Handel had chosen as an operationally optimal value had evolved to become the threshold for efficacy. And in turn, it was further endorsed in 2000 when the now widely used RECIST was proposed as a replacement for the WHO criteria (5, 6). These authors recognized the “arbitrary origins” of the 50% WHO criteria when they said that “the definition of a partial response, in particular, is an arbitrary convention—there is no inherent meaning for an individual patient of a 50% decrease in overall tumor load.” But despite this, they chose in RECIST a nearly identical threshold for response—a 30% reduction in one dimension. Instead of calculating the product of the perpendicular diameters, RECIST measures only one dimension—the longest diameter. If one thinks of a sphere, a 30% reduction from 1 to 0.7 in the longest diameter yields a product of the perpendicular diameters of 0.49 (0.7 × 0.7 = 0.49)—a value that is essentially indistinguishable from the 0.5 proposed by Moertel and Hanley and adopted by WHO.
With this as background, one can understand how, some might argue, we took an operational definition and adapted it as a measure of efficacy—without supporting data—and that we must establish measures of efficacy that are based on data. But as we will argue, while it may have been fortuitous, Moertel and Hanley, the authors of WHO, and those who developed RECIST got it right.
Desist Using RECIST? The Issues
But then why do some investigators, such as Benjamin and colleagues (7), advocate discarding RECIST? Some complain that in phase I/II trials, the 30% arbitrary cutoff point to score a response leads us to “miss” some active agents. They question why we should deem an agent ineffective simply because it cannot shrink tumor size by 30% in one dimension. Others object that in phase III trials, the RECIST definition of progression (20% above baseline or above nadir) underestimates benefit by shortening progression-free survival (PFS). Are these valid arguments, or will we be discarding a measure of efficacy validated time and again and replacing it with an alternative that will be, in all likelihood, equally or more flawed? We will look at these individually, but first let us be clear about the issues. In a phase I/II trial, a drug might be exceptionally active in one patient but never score a PR in another. Such a drug might have value in a very specific patient population but in an unselected group of patients would be considered by most to be ineffective. This is not an argument about the validity of RECIST, but rather an argument about whether drugs should be developed in selected cohorts or in the general population. Similarly, in a phase III study, “relaxing the RECIST criteria” could result in longer PFS values, allowing one to argue that a drug “helps patients live longer without suffering disease progression.” The merits or lack thereof of such an adjustment can be discussed, but again this is not an argument about the validity of RECIST, but rather an argument about what should be considered disease progression.
Let us also accept that RECIST is not perfect. There are specific cases—diseases, even therapies—in which RECIST underperforms. Not all cancers behave equally, and how drugs kill cancer cells varies greatly. We may need to make adjustments for some cancers such as gastrointestinal stromal tumor (GIST) or mesothelioma (8–11). In GIST, for example, investigators incorporated tumor density on computed tomographic (CT) scan and one-dimensional measurement of tumor lesions in their assessment of a population of patients who were divided into good and poor responders by the criterion of a decrease in standardized uptake value (SUV) of <2.5 measured by 2[18F]fluoro-2-deoxy-d-glucose (FDG)-positron emission tomography (PET). Compared with the definition of response according to RECIST, they identified an 83% response rate in contrast to the 45% response rate obtained with RECIST. In addition, the responders identified by these criteria had a longer time to progression than nonresponders (10), while in mesothelioma, investigators had developed and validated modified RECIST adapted to the unique growth pattern of this pleural-based disease (8). But note here that while we may assess response with methodologies other than conventional imaging studies such as CT and MRI, or by adjusting how we measure tumors, thresholds will nevertheless have to be established and validated, much as we do with RECIST.
We may also need to make adjustments for certain therapies, notably vaccines and even some immunomodulatory therapies that do not bring about immediate effects but that may have delayed responses (12–15). Ipilimumab, for example, is a fully human, IgG1 monoclonal antibody that blocks cytotoxic T-lymphocyte–associated antigen 4 (CTLA-4), a negative regulator of T cells, augmenting T-cell activation and proliferation. With this agent, delayed responses as well as short-term tumor progression before delayed regression have been observed, albeit in an uncertain percentage of patients (12, 15). And with vaccines, one would not expect an immediate response that manifests as a difference in PFS but rather a difference in overall survival (OS; ref. 14). Such was the case with sipuleucel-T, where the median for time to disease progression (TTP) was 11.7 weeks compared with 10.0 weeks for placebo (P = 0.052), whereas median survival was 25.9 months for sipuleucel-T and 21.4 months for placebo (P = 0.01; ref. 13). Thus, for certain tumor types and particular classes of therapy, adjustments may need to be made.
RECIST and the Definition of Response
So let us then turn to the first concern: Is the RECIST threshold of 30% that defines response in phase I/II studies too rigid? Are we discarding “active agents” after phase I/II studies because we use RECIST? The answer to these questions is clearly no—at least with regard to agents that are worth pursuing. While Moertel and Hanley were looking for an “operational definition,” they were aware of the importance of identifying a measure of efficacy (1). Moertel had extensive experience in the care of patients, and with Hanley, Moertel had concluded, in our opinion correctly, that “certainly under real life circumstances, a 25% regression must be considered meaningless, and even a 50% regression is questionable.” And why did they think this? Because they noted that “the natural course of a malignant solid tumor is to grow. This rate of growth will be variable, and in a review of published human data…a reasonable median for tumor-volume doubling time…could be estimated as approximately 60 days.” And given this doubling time, Moertel and Hanley likely thought that 25% to 50% regressions would not prolong survival much more than 2 months and they likely felt that this was nowhere near the cures so many thought the future promised. Were they correct in their estimates? In fact, their estimate of a tumor doubling time of 60 days seems not to have been too far-off as documented by a more recent and more extensive review (16). And so Moertel and Hanley would argue and we would agree that drugs that reduce tumors by less than 30% in a majority of patients should not be pursued in an unselected population, as they would likely not have a great impact on survival.
To be sure that many disagree with this and believe that less than 30% is meaningful, arguing that stable disease (SD) may be valuable and should not be ignored [(ref. 17); by definition, SD is a change in tumor quantity that qualifies as neither response nor progression. The latter is defined in RECIST as a 20% increase in the largest diameter.] The U.S. Food and Drug Administration (FDA), correctly in our view, has not been willing to include the rate of SD as part of the overall response rate (ORR; defined as the sum of CR and PR), considering that indolence is often indicative of the underlying disease biology rather than a consequence of any drug effect (18, 19). Those who disagree have usurped the term “clinical benefit” and used it to define a “clinical benefit rate” (CBR) that includes CR + PR + SD. But clinical benefit was a term coined as a “patient-centered endpoint” delineated to assess the benefit of gemcitabine in pancreatic cancer. Clinical benefit tells us nothing about tumor size but informs us about something very important: how well a patient feels. As defined for pancreatic cancer, clinical benefit includes a composite of measurements of pain (analgesic consumption and pain intensity), Karnofsky performance status, and weight. Furthermore, this definition of clinical benefit required a sustained (≥4 weeks) improvement in at least one parameter without worsening in any others (20). As a patient-centered endpoint, “clinical benefit” is something a patient experiences and not something that a physician discerns. For patients who enroll in clinical trials, the overwhelming majority of whom at enrollment have few, if any, symptoms, the suggestion they have derived “clinical benefit” from a therapy that has not shrunk their tumor and in fact may not have even slowed its inherent biology is scientifically bankrupt (21). Indeed, there is evidence SD is of little or no value. An analysis of more than 140 phase II clinical trials that examined both “targeted” and “cytotoxic” therapies found that, unlike ORR (CR + PR), SD rates did not correlate with either PFS or OS (22). The lack of correlation with survival endpoints occurs because SD is not a measure of substantive tumor reduction and consequently does not alter the survival trajectory. Thus, SD as a measure of drug efficacy has no value and should not be used as a response endpoint unless standardized definitions are developed and these are shown to correlate with meaningful changes in clinical outcome, especially survival. Note here that correlations of ORR and survival endpoints could be increased if the response threshold was greater than 30% shrinkage. Similarly, if SD was more narrowly defined, such as responses covering the range from −20% to −29%, for example, or as stability lasting for many months, it might correlate with survival endpoints, although never as well as ORR (22–24).
RECIST and the Definition of Progression
And what about concerns that RECIST defines progression too rigidly? The current RECIST definition of progression—a 20% increase above baseline or nadir—has merit in that a 20% increase in one dimension is a 173% increase in volume when one thinks of a sphere. Especially for a patient whose tumor does not shrink, but only grows, a 173% increase in volume is not a trivial amount. Unfortunately, most of us are lulled by the thought that it has “only increased by 20%.” But should we be complacent if the tumor load has almost doubled? To be sure, one can persuasively argue whether a value that is 12% above original, and 20% above a 10% nadir, should be classified as “progression”—especially if it has occurred over 6 months. But 84% of original from a nadir of 70% is concerning, and if this amount of change was observed over a 6-week interval, it is especially so. A discussion of what constitutes “disease progression” is beyond the scope of this opinion, but it is important to recognize that progression is not only about quantity, it is about quantity over time—and we believe that new paradigms that consider both quantity and time need to be pursued (25, 26). Discussions about RECIST thresholds for progression that do not consider the time element overlook a very important variable.
Validation of RECIST
But more importantly with regard to the value of RECIST, whether referring to response or progression, is the fact it has been validated time and again by oncology drug development. Before advocating “clinical benefit rate” and including SD as a measure of efficacy, before lowering the bar so as not to miss “active agents,” the sobering data in Table 1 should be examined. Summarized therein are drugs approved by the FDA in metastatic solid tumors over the past 10 years. Cytotoxic and targeted therapies are itemized, including many of the therapies that promised so much, but ultimately disappointed. The 29,312 patients enrolled represent a fraction of those evaluated, as this number does not include patients enrolled in the phase II trials that preceded these phase III studies and the thousands enrolled in confirmatory phase III trials. The median gain achieved in PFS with our new therapies in these past 10 years is a marginal 2.15 months. The median gain in OS has been 2.16 months. No one can look at these data, which include 48 FDA drug approvals over 10 years, and argue that our “screening threshold” is too high and that we are missing active agents. Do we really want to lower the threshold and spend the next 10 years achieving even less than we have in the just completed 10 years? Do we want to enroll thousands of patients in trials that prolong life for 1 month? We think there is unanimity on this point and the answer is NO! We wanted RECIST to guide drug development and approval. We wanted it to discriminate the good from the not so good. We wanted a valid and meaningful threshold. We may have not yet achieved the ideal, but one thing is clear, lowering the efficacy bar will not get us closer.
To be sure, RECIST is not perfect. As we noted, there are specific cases where RECIST underperforms. But for the majority of metastatic solid tumors, the proving ground for most cancer therapeutics, RECIST has been and will continue to be valuable. We would argue that while RECIST has been and will always remain an imperfect tool, it and WHO before it have proved their worth over time and should not be changed. Unless of course we want to raise the bar for efficacy and, like Moertel, look for much greater benefits. In that effort, count us in!
Disclosure of Potential Conflicts of Interest
No potential conflicts of interest were disclosed.
Writing, review, and/or revision of the manuscript: A.T. Fojo, A. Noonan
Conception and design: A.T. Fojo
- Received February 27, 2012.
- Revision received May 16, 2012.
- Accepted May 16, 2012.
- ©2012 American Association for Cancer Research.