Quantities, measured and analyzed with numbers and statistics, are the foundation of our ability to communicate the attributes of much of the physical world. Mathematical formulas and calculus enable us to predict what will happen when we manipulate matter in motion. Notations measuring differing lengths of vibrating strings can create the exquisite harmonies of the “music of the spheres” or a Bach cantata. And varied linear dimensions in building construction can be arranged to form the ‘golden section’ of classical architecture, thus yielding harmonious proportions that please the eye.
Small wonder, then, that new models of representative government that emerged from the Atlantic basin at the close of the 18th century would rely on achieving harmony by balancing quantities. Co-opting human passions and desires with numbers fixes them as universally measureable and thus, we want to believe, manageable. The belief that what can be measured, can be managed, has carried over into modern economic and social policy, for which statistics have become essential ingredients.
Scientists and statisticians attempt to hold inessential variables constant when comparing quantifiable phenomena. In the realms of nature and society, however, few important variables can be held constant for more than a moment in time. The social and natural phenomena that often interest us—e.g., disease and its course—happen over time. The passage of time, during which nature is in constant flux, is what most often confounds our efforts to measure causes and effects in the realm of biology and medicine.
Medical protocols for the detection and treatment of disease normally rely on efforts to measure either the course of disease in large, defined populations (epidemiology) or to evaluate cause and effect relationships for diagnostic or therapeutic purposes (trials). Virtually all investigations of diagnostic and treatment options today assume that randomized controlled trials are the ‘gold standard’ for medical research.
Since it is practically impossible to test a treatment option against its alternatives using a nation’s entire population, test subjects are identified through random sampling, which, as when a dealer shuffles a deck of cards, is intended to eliminate all bias in their selection.
In addition to randomization, the best research trials are blinded. The purpose of single- and double-blinded trials is to eliminate bias from entering into the identification and measurement of causes and effects. If a test subject in a single-blind trial is unaware whether she is receiving the medication being tested or a placebo, for example, she will be less likely to stint (or exceed) the instructed dosage.
In double-blind trials both the test subjects and those conducting the experiment are unaware of whether individuals belong to the test subject group or to the randomly chosen control group not using the tested drug or medical device. Double-blinding ensures that researchers as well will not, however unconsciously, allow their own biases to influence how they characterize a cause or an effect.
Thus when evaluating the quality of evidence used to support a medical protocol, physicians and medical researchers normally consider the evidence’s ‘internal’ validity. Was the supporting trial or study randomized? Was it double blinded? But they must also consider a study’s ‘external’ validity. How reliably can a study’s findings can be generalized to the population being sampled? Are those findings consistent with what epidemiological data tells us about that larger population?
Newspaper outlets occasionally bring us fresh stories of studies proving or challenging the benefits of mammography screening.  Still influential, however, is the first and purportedly randomized large controlled study of mammography screening performed in the United States, the mammography trial conducted by the Health Insurance Plan (HIP) of Greater New York (GNY) in the 1960s.
Statistician Sam Shapiro; Dr. Philip Strax, a radiologist; and Dr. Louis Venet, a surgeon, led the study which was designed to test “the hypothesis that…periodic beast cancer screening with mammography and clinical examination…leads to earlier detection of breast cancer than ordinarily experienced and that mammography contributes significantly to detection.” Their results were published in the Journal of the American Medical Association (JMA) in 1966 and 1971.
Strax, Shapiro, and Venet initiated their study as a randomized control trial of clinical breast cancer screening (i.e., palpation, or CBE), breast x-ray (mammography) screening, and mammography screening in combination with CBE.  Their objective was fortuitously served by being able to take advantage of the availability to them of a defined population of 85,000 women enrolled in 23 of the many group medical practices affiliated with the Health Insurance Plan of Greater New York, where Shapiro was employed as Director of Research and Statistics.
Thirty-one thousand women selected at random from HIP-GNY patients aged 40 to 64 years were designated for the study group, and an identical number in the same age group selected for the control group. The study-group women were “offered a screening examination in their medical group centers, and 64 percent cooperate [sic].” Meanwhile, patients “in the control group follow their usual practices in receiving medical care. No special effort is made to encourage them to have general physical examinations…. Mammography is not routinely included in the general physical examination.”
Clinical breast examination remained a significant diagnostic variable throughout the trial: “Even under the best conditions for applying mammography, certain types of breast cancer that are palpable are missed with this technique.” Nonetheless, most reports of the HIP-GNY study represented its findings as proof of the value of x-ray mammography alone. Data for breast self-examination, which had been encouraged by the American Cancer Society and the National Cancer Institute since the 1950s, were not included in the study.
Cautioning that their findings were “preliminary,” the study’s authors acknowledged that their data applied to a “more limited period” (1963 – 1969) than the “minimum of five to ten years experience … necessary to answer the question definitively.”  More than the limited period of time, however, compromised the study.
While the initial total of randomly sampled patients was 62,000, with 31,000 assigned to each group, in the end only “about 20,000 women,” or 65%, “of the study group … appeared for their initial screening examinations, and of these, 80% participated in the first annual examination, 74% in the second annual examination, and 69% in the third annual examination.” In other words, of the original 31,000 women assigned to the study group, barely half (52%) underwent their first annual examination, and only a quarter (26.5%, or 8,215 women) appeared for all three annual examinations.
The applicability of the study to all asymptomatic women aged 40 to 64 was further compromised by the fact that the population sampled was more urban and more ethnically and racially diverse than the larger population of the nation’s women. In 1970 over two-thirds of the U.S. population lived in areas characterized by the Census Bureau as “outside central cities” or “nonmetropolitan areas.” Members of the HIP also represented only women who, as a result of their income or employment, were able to afford membership.
Such variables matter. In the epidemiology of breast cancer, higher socioeconomic status, residence in the urban and northern United States, and a first full-term pregnancy after age 30 all increase a woman’s relative risk for breast cancer. Moreover, while women might be randomly selected for the study and control groups, in practice they could not be forced to undergo screening—whether X-ray or palpation or both. Nor could women in the control group, for whom general physical exams were included in their HIP coverage, be denied breast palpation by individual physicians who might have chosen to do so as a matter of course, or patient request. What that number was, and the thoroughness of the palpation or CBE, we don’t know.
We do know that some women declined to be screened (whether they declined X-ray screening or clinical breast exam, or both, is not clear). And we know that once they declined, those who remained in the study group were not even representative of the women of New York:
…study women who refused screening differed from those examined in several respects: eg, they are slightly older; have lower educational attainment; are less likely to be Jewish, to have been married, or to be multiparous or premenopausal; and a lower proportion report ever having had a lump in the breast. Also, they differ markedly from the others in basic attitudes toward preventive health examinations. They are more apt to avoid physical examinations in general, to feel that people should wait until they have symptoms before seeing a physician, and to believe that their physicians “know all my health conditions without . . . more special tests. 
The numbers of biopsies recommended as a result of mammographies, mammographies in combination with clinical breast examination (palpation), and clinical breast examination alone—all three of which were part of the study group screening–were combined with the numbers of biopsies performed as a result of “changes in opinion that occurred later on the basis of additional information obtained through reexamination of the patient by her own physician. About one out of six of the cases classified as biopsy recommendation … were affected by these changes.” 
In short, both the study and control groups selected from among HIP-GNY participants lost their randomly chosen composition as the study progressed. Some women declined to be screened after all, and some failed to participate in the full protocol of three annual screenings. Physician recommendations to biopsy on the basis of “additional information” served as another confounding variable.
Ultimately x-ray screening alone accounted for the detection of 60% of the 127 confirmed cancers found in the 8,215 women who completed three annual examinations. Palpation alone accounted for 56 confirmed cancers, and a combination of x-ray and palpation accounted for 29 confirmed cancers. This is the essence of the first of the two principal claims made for the GNY-HIP study, namely, that mammography is an improvement over palpation for the detection of breast cancer.
The second most-repeated claim for mammography screening attributed to the HIP-GNY trial is that mammography saves lives. On close inspection, however, one finds more questionable numbers. According to the study’s authors, after the six-year period of the trial (1963-1969), there were 52 deaths among the control women “in which breast cancer was the underlying cause,” and 31 deaths among the study women, “screened and not screened combined.” [Italics added.] These numbers, which do not compare breast cancer mortality among women who received screening mammography with women who did not, are the source of claims that screening biography can reduce breast cancer deaths by as much as 40 percent.
Then there were the study’s reported “case fatality rates,” defined as the “probabilities of death among women with breast cancer during specified periods following histologic confirmation of breast cancer.” While this phrasing invites the inference that the probable deaths would have been from breast cancer, the authors themselves stated otherwise: These case fatality rates did not necessarily represent breast cancer deaths because they included “all deaths among women with breast cancer regardless of the cause of death [italics added].”  (Autopsy studies have found breast cancer in approximately 10% of women who die of non-cancer causes.)
Since an unknown quantity of the deaths had not yet occurred, their causes could not be given. Therefore, to posit case fatality rates, the nature of the quantitative data the authors presented necessarily changed from data descriptive of actual events to data resulting from the calculation of mathematical probabilities based on statistical projections. This was a subtle but crucial epistemic shift that may have been lost on readers ready for the study’s “preliminary…optimism.” Shapiro continued to track and report breast cancer deaths from the HIP-GNY study’s patients until 1985. Given the serious shortcomings of the study, the usefulness of this information is debatable.
Because the HIP-GNY trial was later used to justify a nation-wide program of x-ray breast cancer screening sponsored, and largely funded, by the federal government, it is important to recognize what it did, and did not, do. It did not prove that mammography alone substantially improves the detection of clinically significant breast cancers. It did not prove that mammography “saves lives.” It was unable to prove those things because trial’s internal and external validity were so compromised that it could not justify a medical screening protocol for the general adult female population.
Not only did the HIP-GNY trial lack both internal and external validity. It was conducted with women who had every reason to believe that their individual health needs were paramount in their own clinical setting. Therefore it was conducted in violation of the standards of medical research ethics current at the time.
Tension between the ethical obligation of physicians to consider first and foremost the interests of individual patients, and a desire to advance medical research, is evident throughout the men’s description of the manner in which the study was conducted. For example:
The patient appears for a screening examination in a familiar setting, the medical group center, and she knows that the group’s surgeon and radiologist are participating directly in the program. Her responsiveness to the initial invitation to participate is increased …. The surgeon who functions as the examining physician is a member of the medical team in the group responsible for the patient’s care, and the findings of examinations in which he has been a critical participant are more readily translated into action than might ordinarily be true in a screening program.
Meanwhile, in order to
reduce … the amount of time required to screen a patient“ imaging was limited to “a cephalocaudad [vertical] and mediolateral [horizontal] view of each breast and … slightly increasing the kilovoltage. [Thus] it was possible to screen women at the rate of 13 patients per three-hour session.
In the event that a patient whose mammography revealed suspicious white spots or shadows hesitated to proceed with a biopsy, “special measures” were “introduced to increase the likelihood that…the woman will accept hospitalization.”
A team approach has been developed to minimize the likelihood of ‘losing’ the patient. Responsibility for communicating with the patient is assumed by the medical group surgeon assisted by a regisered nurse on the central staff who has wide experience in the fields of health education and research.
Just what these patients were told to persuade them to proceed with hospitalization was not reported in the JAM articles. Whether, if a biopsy and histology revealed cancerous cells, the patient was awakened and given a reasonable and informed opportunity to decide what sort of treatment she chose to receive—a lumpectomy, modified mastectomy, radical mastectomy or no surgery at all—was also not reported. Of the 23 women whose suspicious results were confirmed as cancers during the first two years of the trial, all but 3 underwent radical mastectomies. Whether the surgeries were performed and billed by HIP-GNY affiliated surgeons, a potential conflict of interest, we do not know.
In the end what the HIP-GNY study did do was subject the breasts of thousands of female patients to ionizing radiation, with unknown consequences. Furthermore, while over 1,800 women suffered the anxieties of biopsy recommendations, over a thousand women were actually subjected to the risks—principally infection and tumor metastasis—of their biopsies. With the exception of the 212 women whose biopsies confirmed cancers, the rest underwent these risks unnecessarily.
If relying on quantifiable evidence in setting medical protocols has complicated the challenge of proving the value of x-ray breast cancer screeening, it has also impeded efforts to quanitify mammography’s risks. These risks were known even before “over-diagnosis” began to receive general media coverage in the late 1990s,  as we shall see in a subsequent post.
The frequency of, and ages at which, breasts are x-rayed are easily counted. Mortality likewise provides a quantifiable point of comparison. But, as we age, the biological processes of health and disease are unique to individuals. Medical efforts to defeat cancer are experienced differently by individuals as well, for whom the miseries of cancer treatment may become just as important as the year in which they expire. Hence the ‘quality of life’ considerations with which all patients and good physicians must grapple when confronting death.
Finally, just as perilous as the common confusion of statistical correlation and causation is the failure to appreciate the profound moral distinction between description and prescription. This failure is a serious obstacle to those intent on ensuring that competent professional judgment is also ethically defensible. Deep within the epistemic chasm between description and prescription—between what “is” and what “ought to be”–lies a subjective medical bias toward intervention, whatever its risks.
When the body’s failure to suppress cancerous cell growth demands interference, good medicine must be ready to interfere. But when screening technologies entail probable and perhaps irreparable harm—as all screening involving x-radiation does—one must ask how many women might be harmed so that one might be saved? Like most ethical questions, this one does not lend itself to a numerical solution.
(Published May 15, 2015. The next post takes a hard look at the American Cancer Society’s 1970s campaign to market screening mammography to healthy women.)
 See, for example: Gina Kolata, “New Mammogram Studies Divided on Benefits,” The New York Times (September 3, 2002), “Vast Study Cast Doubts on Value of Mammograms,” The New York Times (February 11, 2014) and other reporting by Kolata and Jane Brody in the Times.
 Handel Reynods, M.D. The Big Squeeze: A Social and Political History of the Controversial Mammogram (Cornell University Press, 2012), Chapter 2 (13%).
 Sam Shapiro, Philip Strax, MD, Louis Venet, MD, “Evaluation of Periodic Breast Cancer Screening with Mammography: Methodology and Early Observations,” Journal of the American Medical Association, Vol. 196 (1966). Reprinted in Cancer Journal for Clinicians, Vol. 40, No. 2 (March-April, 1990), pp. 1990), pp.111-125. A pilot study to determine the feasibility of such a trial had been supported by the National Cancer Institute, the HIP-GNY trial itself was partly funded by a contract with the U.S. Public Health Service.
 Sam Shapiro, Philip Strax, and Louis Venet, “Periodic Breast Cancer Screening in Reducing Mortality from Breast Cancer,” Journal of the American Medical Association, Vol. 215, (1971), pp. 1777-1785.
 Sam Shapiro, Philip Strax, MD, Louis Venet, MD, “Evaluation of Periodic Breast Cancer Screening . . .,” (1966, 1990), p.113.
 Sam Shapiro, Philip Strax, MD, Louis Venet, MD, “Evaluation of Periodic Breast Cancer Screening . . . .” (1966, 1990) p.115.
 Sam Shapiro, Philip Strax, and Louis Venet, “Periodic Breast Cancer Screeening . . .,” (1971), p.1788.
 Barron H. Learner, The Breast Cancer Wars: Hope, Fear, and the Pursuit of a Cure in Twentieth Century America (Oxford University Press, 2001), p. 55.
 Sam Shapiro, Philip Strax, and Louis Venet, “Periodic Breast Cancer Screeening . . .,” (1971), pp. 1777, 1785.
 Sam Shapiro, Philip Strax, and Louis Venet, “Periodic Breast Cancer Screeening . . .,” (1971), p. 1780.
 Jennifer L. Kelsey and Marillie D. Gammon, “The Epidemiology of Breast Cancer,” CA-A Cancer Journal for Clinicians, Vol. 41, No. 3 (May-June 1991), p. 157.
 Sam Shapiro, Philip Strax, and Louis Venet, “Periodic Breast Cancer Screeening . . .,” (1971), p. 1780.
 Sam Shapiro, Philip Strax, and Louis Venet, “Periodic Breast Cancer Screeening . . . ,” (1971), p. 1781.
 Sam Shapiro, Philip Strax, and Louis Venet, “Periodic Breast Cancer Screeening in Reducing Mortality from Breast Cancer,” Journal of the American Medical Association, Vol. 215 (1971), p. 1783.
 Welch, H.G. and W.C. Black, “Using Autopsy Series to Estimate the Disease ‘Reservoir’ for ductal carcinoma in situ of the Breast: How Much More Cancer Can We Find?” Annals of Internal Medicine, Vol. 127, (1997), pp. 1023-1028; W.C. Black and H.G. Welch, “Advances in Diagnostic Imaging and Overestimations of Disease Prevalence and the Benefits of Therapy,” New England Journal of Medicine, Vol. 328 (1993), pp. 1237-43.
 Shapiro S., Venet W. Strax P., Venet L., Roeser “Selection, follow-up and analysis in the Health Insurance Plan Study: A Randomized trial with breast cancer screening,” National Cancer Institute Monographs, Vol. 67 (May 1985), pp. 65-74.
 Sam Shapiro, Philip Strax, MD, Louis Venet, MD, “Evaluation of Periodic Breast Cancer Screening . . . “, (1966, 1990), p.116.
 Sam Shapiro, Philip Strax, MD, Louis Venet, MD, “Evaluation of Periodic Breast Cancer Screening . . .”, (1966, 1990), pp.116-117.
 Sam Shapiro, Philip Strax, MD, Louis Venet, MD, “Evaluation of Periodic Breast Cancer Screening. . .”, (1966,1990) p.116.
 Shapiro S., Venet W. Strax P., Venet L., Roeser “Selection, follow-up and analysis in the Health Insurance Plan Study: A Randomized trial with breast cancer screening,” National Cancer Institute Monographs, Vol. 67 (May 1985), pp. 65-74
 Of the 1,840 biopsy recommendations arising from all screening, only 1,056 (57%) were actually performed, and of those performed, only 212 biopsies (11.5%) confirmed cancers.Sam Shapiro, Philip Strax, and Louis Venet, “Periodic Breast Cancer Screeening . . .,” (1971), Table 3, p. 1780.
 Johannes P. van Netten, Stephen A. Cann and James G. Hall, “Mammography Controversies: Time for Informed Consent?” Journal of the National Cancer Institute, Vol. 89, Issue 15 (August 6, 1997); on “overdiagnosis,” see for example H. Gilbert Welch, “Breast Cancer Screenings: What We Still Don’t Know,” New York Times (December 29, 2013).