During World War II, the US Army launched a seemingly routine experiment to find the ideal way to screen soldiers for tuberculosis. Jacob Yerushalmy, the statistician in charge of this project, would succeed at this task — and end up fundamentally changing our conception of medical diagnosis in the process. This episode features Dr. Shani Herzig, as well as a new segment featuring Dr. Umme H. Faisal on Yellapragada Subbarow and his discovery of ATP.
Bedside Rounds store: https://www.teepublic.com/stores/bedsiderounds
Umme H. Faisal on Twitter: https://twitter.com/stethospeaks
Please note that transcripts are based off of earlier editorial copies, and are not necessarily the same as the final episode.
This is Adam Rodman, and you’re listening to Bedside Rounds, a monthly podcast on the weird, wonderful, and intensely human stories that have shaped modern medicine, brought to you in partnership with the American College of Physicians. This episode is called A Vicious Circle, and it’s the second of my two-parter on the development of signal detection theory and the formal integration of uncertainty into the idea of diagnosis. In this episode, we’re going to finish the story about Jakob Yerushalmy and integration of uncertainty into diagnosis. And as in the previous episode, I am joined by Dr. Shani Herzig, who will forever carry the demonym “smartest person I know”, and we are going to discuss some of the downstream implications of his work — what does it mean to redefine a diagnosis in terms of uncertainty? How do we as physicians actually carry this out in our day to day lives? How does this affect the tests and studies that we used to try to figure out what is going on with people? And most importantly, what does that mean for our patients? And this is a somewhat shorter episode — so keep on listening at the end, because I have a segment about medical history to introduce!
Previously, on Bedside Rounds …
Oh, you have no idea how long I’ve wanted say that — really brings me back to end-of-season cliffhangers from TV shows from the mid 90s, when I was a kid. By the onset of WW2, immunologists were routinely using the concepts of sensitivity and specificity of serological tests, especially in regards to syphilis testing — which increasingly referred to the modern definitions, the ability of a diagnostic test to detect patients with the disease, or correctly detect patients without a disease. At the same time, vast radar arrays had been set up, especially in Great Britain, and military statisticians had developed something called the receiver operator characteristic curve — a way to objectively determine the best threshold to minimize both false negatives and false positives and actually detect a German bomber and not, say, a flock of geese. Into this epistemological milieu, the US Army launched a program to find the best radiographic method to screen for tuberculosis. And the statistician at the head of it — Jacob, or Yak, Yerushalmy — would end up fundamentally changing our understanding of diagnosis.
So it’s now 1944. The US Army convened a group of esteemed radiologists and pulmonologists — or phthisists as they were called at the time — to find the ideal method to screen Army recruits for tuberculosis. They chose to study populations at two VA Hospitals — effectively screening all the patients, mostly older WWI veterans, and all employees and staff. Every patient would have four different films taken. The first was a 35 mm photofluorogram of the chest. Think of a Kodak carousel slide projector basically. The second was a 4 x 10” stereophotofluorogram — basically the paper print out of fluoroscopy, and what was used in the early 20th century. It’s not a negative, so the heart and bones were actually black and the aerated lungs white. The third was the same paper print out, but the negative. And the final was a 14” x 17” celluloid film, which is what we normally think of as a chest x-ray today. I realize a lot of my younger colleagues have probably never seen a celluloid x-ray, since everything has been on PACS for years, so think about the opening scene of scrubs, even though they fr ustratingly put the X-ray on the lightbox backwards, or the patient has situs inversus. As you can imagine, there were a lot of studies here — 1,256 films in total. Yerushalmly was deeply influenced by his work in classifying cancer, and he developed a systematic method that all five readers would follow. Each film was interpreted separately by each of the radiologists and pulmonologists. Then after a lapse of 2-3 months, the films were circulated again, meaning the same same reader interpreted each film twice.
I asked Shani to break down this experiment for us:
Yakob Ural, Shami, otherwise known as Jaak yeah, yeah. That’s why he went by. Um, he was basically, well, the task was to compare the diagnostic. I think they call it the diagnostic efficiency of four different types of chest imaging basically. Um, but in the process of that activity, it became apparent that, um, there were problems in doing that kind of fulfilling their objectives. Um, and he goes through and identifies a number of different problems, um, with respect to doing so. And you know, the main, the main things are, are basically like there is, um, something, well, there’s the problem with defining what is the gold standard? So like, how do you say which test is most accurate? Like which one’s right. And like, is there some other gold standard that those tests can be compared against in order to do so you kind of have to say that one of the tests is the gold standard and how do you actually choose that?
So that’s a really big problem. And then the other major problems that he goes into, I’m going to say are kind of along the lines of like measurement error is probably what you could call it. And, um, inherent in measurement error is instrument error, which is like the actual thing being used to, um, assess disease presence or absence may be flawed. So there may be flaws in the instrument itself, um, kind of the technology of the instrument. And then there’s like interpreter errors or are human errors, um, or errors in interpretation, which essentially, um, what he finds is that there’s like a lot of both inter individual and intra individual variation, which set, which basically says there are problems with the ability of humans to decide whether or not the test is positive.
After a few months, Yerushalmy sat down to analyze this massive set of data. He immediately noted two fundamental problems — the first, as Shani points out, was that there’s no “gold standard” to compare the techniques against. There was no way to definitively know who does and doesn’t have tuberculosis. And the second was that the doctors didn’t agree with each other — for example, for the 14×17 celluloid films, the positive rate ranged from 56 out of 1256 to as high as 100. And the doctors didn’t even agree with themselves! Many times a physician would read a film as positive, only to read it negative 3 months later.
One of the physicians disagreed with himself almost half the time. Shani was not impressed.
I mean, that’s horrible. That’s like, that’s like you you’d be better off having a coin that you flipped as a doctor than having that actual doctor
The first think Yerushalmy did was try and solve the “gold standard” problem. He decided to use group concordance — he treated a film as positive for TB if three or more of the reviewers had read it as TB. By using this method, he was able to show that none of the four methods was any better than the other — which was the conclusion of the paper, which was published in JAMA in 1947 with the title Tuberculosis Case Finding: A Comparison of the Effectiveness of Various Roentgenographic and Photofluorographic Methods. There are some interesting observations in this paper — most notably group concordance — but this paper and its conclusions alone weren’t exactly groundbreaking.
Shani was a little more impressed with his use of group concordance to determine a gold standard.
I think it’s an elegant answer. Um, I think it definitely helps to minimize the problems that we’ve been talking about. Um, one thing that, uh, he also goes on to point out, which I thought was really interesting is that once you have defined a gold standard, it becomes really hard then to identify additional new techniques that are better than the gold standard, because if you’re calling something a gold standard and then you have this new technique, that’s actually picking up more positives, but the gold standard isn’t, you’re going to call those false positives initially, potentially. Um, until you kind of, you know, until you define a new gold standard, but you know, you may not initially recognize that those aren’t actually false positives. Those are actually, that’s actually the new gold standard doing a better job.
In my opinion though, it’s the second paper from this study that would be ground breaking. I don’t know the details behind the publication, but Yerushalmy published his more heretical analysis the next month, in a solo report in the Tuberculosis Control Issue of the government’s Public Health Report, entitled “Statistical Problems in Assessing Methods of Medical Diagnosis, With Special Reference to X-ray Findings.” The entire issue of this journal is dedicated to his paper, sandwiched between commentary that is remarkably prescient about what his paper means. And yes, I think this paper is so important Shani and I are going to walk you through it.
This paper opens with the fundamental problem that he has identified in diagnosis: “the process of medical diagnosis involves the application to a specific case of the knowledge accumulated from a large number of similar cases… Since no two cases are exactly alike, the resulting diagnoses are not absolute but involve some uncertainty and better be thought of in terms of probabilities… One of the fundamental objectives of improved medical care is to increase the probabilities of correct diagnoses.”
Tests will perform differently, whether due to fundamentals in the test method, like a biopsy, or whether due to imperfect humans. But as the number of different tests has markedly increased, a fundamental problem has developed — how do you compare different tests for the same condition? In order to do this, you must have a way to clearly identify positive cases. “It is” he writes, “a vicious circle.” And how can we do it in such a way that allows us to not only choose the best test, but also be open-minded to new diagnostics that come along?
This seems to be a fairly straightforward statement. But the implications are pretty deep.
Um, I mean, I think the implications are well, there, there are a few implications. Um, one that all of what we do in medicine is we have to recognize that there’s so much subjectivity in it. We think of things as objective when in reality, they’re all of these things that I just mentioned are impacting our ability to diagnose disease and even to define disease. So everything upon which medicine is based is predicated upon having some gold standard, knowing whether there’s disease or not, when you develop all these tests. And if we can’t even agree on whether or not there is actually disease or how to assess whether there actually is disease, it really kind of throws everything else into the air.
And this is where I wish I knew more about the connections between Yerushalmy, who was working with the USPHS, and the statisticians working on signal detection. Because in order to answer these questions, he imports ideas from the receiver operator characteristic curve as well as the previous work done on serologic sensitivity and specificity. Using his large dataset of x-ray images, he takes these concepts previously only used in serology, and applies them to diagnosis writ large — so as a reminder, the sensitivity, which he defines as the probability of a correct diagnosis of positive cases, and then specificity, the probability of the correct diagnosis of negative cases. A very sensitive test will have few false negatives; a very specific test will have few false positives. He then runs through his numbers again, calculating the sensitivity and specificity for the four different x-ray methods, showing that they were all equally sensitive, and justifying the conclusion in the original article published in JAMA. I know that I’m harping a lot of sensitivity and specificity — but for the past, almost, thirty years, this has been a huge deal that affects day-to-day medical care in ways both mundane and profound. I’ll give you two examples that come up all the time in my life. Anemia — low levels of hemoglobin — is incredibly common across the general population, but especially in adults with chronic medical conditions, which I largely see. If that anemia is caused by low iron levels, it can be easily treated with oral iron. But there are other causes of anemia — most commonly just inflammatory effects from chronic illness itself, as well as kidney disease — that would make treatment with iron ineffective. So how to best test for iron deficiency? There’s a pathological gold standard test of course — a bone marrow biopsy, literally pushing a large needle into the hip to aspirate the bone marrow, and then stain the marrow for hemosiderin to estimate iron levels. You can imagine that this is suboptimal to say the least — incredibly uncomfortable for patients, taking a ton of time, and costing a lot of money. Instead, we look at blood tests. But there are SO many different tests. Which one to use? Fortunately, this has been studied quite extensively. Take, for example, a study of 101 patients at a VA hospital, which compares three common tests — the ferritin, the total iron binding capacity or TIBC, the mean corpuscular volume of hemoglobin, and the transferrin saturation. From this study, the authors calculated different sensitivities and specificities at different cutoffs, and plotted them on a receiver operator characteristic curve — which clearly shows one test performs much better under even condition — and that’s ferritin, at least in hospitalized male veterans who undergo bone marrow biopsy.
Let’s take another, higher stakes example. In the emergency room, a pulmonary embolism — that is, a blood clot that often travels from large veins in the leg to the pulmonary vasculature — is a can’t miss diagnose, since it can lead to serious complications and death, even in young patients, has a reasonably straightforward diagnostic test, a CT scan, and an effective and reasonably safe treatment — blood thinners. But a PE can present very subtly — sometimes with just shortness of breath. So how can you possibly tell whether or not you should be worried about pulmonary embolism? Emergency room physicians in particular have developed different “tests” to see if they need to further evaluate someone for a PE — one of those is the pulmonary embolism rule-out criteria, or the PERC, a series of low risk criteria including age, vital signs, lack of medications or surgery that make PE more likely, and lack of exam findings. This combination of findings is between 96 and 97.5% sensitive at ruling out a PE in low risk people, depending on the study. Sensitivity and specificity give us the language and conceptual framing to go beyond “clinical suspicion”.
While this sort of thinking isn’t particularly controversial now, that wasn’t the case at the time. Francis Weber, the medical direct of the Tuberculosis Control Division, dedicated the entire issue of Public Health Reports to defending Yerushalmy, and writes that his findings as “in some quarters given rise to a dissent not unrelated to apprehension,” though he gives a cautious optimism because, as he writes, “unlike an Achilles heel, the discrepancies cited in these studies are in no sense indigenous defects but phenomena susceptible to remedy.”
If you just literally read his paper, Yerushlamy is very careful to scope his conclusions. In particular, he argues that sensitivity and specificity are really only useful in disease control — screening tests of large amounts of low risk people, careful to place his conclusions in the same tradition as the Wassermann reaction and syphilis screening. But in Weber’s introductory editorial lays out the undeniable conclusion. All diagnosis, including the methods of “classical” diagnosis, have the same limitations as chest x-rays. In fact, Weber suggests, at least radiologists are now aware of their limitations. The Oslerian attending rounding with adoring students hasn’t the slightest clue. That “prominent aorta” you thought you heard? Those end-expiratory crackles? That subtle lateral rectus palsy? They’re all subject to a certain amount of fundamental uncertainty. The statistician Jerzy Newman has the final piece in this issue, and wastes no time cutting to the chase:
“Dr Yerushalmy’s data and the discussion suggest that a diagnosis is always subject to error. Therefore, we have to consider the probabilities of the correct outcomes of the diagnosis.”
This insight — that a certain amount of uncertainty is baked into every diagnose — would become a new paradigm in diagnosis in the second half of the 20th century, especially as the new field of “informatics” started down an ambitious goal of building a machine that “thought” like a doctor — and might eventually replace them. We’re going to dig into that story in a future episode. But I wanted to ask Shani more about the immediate implication of Yerushalmy’s work — especially of this idea of group concordance as a gold standard.
I mean, it really depends on what it, what, what the topic is. Right. Um, so there are, it really depends. I mean, I’m sure that there are some diseases where, um, so you know what rheumatology rural logic diseases is a great example, right? So in rheumatology, how does one actually define if someone has lupus like you? So what ends up happening is a lot of doctors get together and devise like standard criteria. And that becomes these research definitions, um, which require like four out of six of whatever set of symptoms, um, becomes the way that we diagnose disease. And that method is settled upon by multiple, multiple physicians, many physician experts. Now that’s a little different than what he’s talking about. Um, Jaak when he talks about group opinion, that’s actually, um, more applied to the case of actually managing patients, which is to say that it sounds like he’s advocating for having multiple physicians interpreting every, uh, test in some ways. And, you know, we, we do that to some extent as a quality check, for example, in radiology, some portion of all of the films that are read, get re-read by a second radiologist.
Speaker 4 (09:47):
No, but what percentage of my diagnoses get rediagnosed?
Yeah. So we don’t do it in other areas and particularly not in medicine. I mean, we, we do, to some extent in that our EKG is, for example, when we get EKG is on patients as hospitalists, we read them at the bedside and then eventually we get a read from a cardiologist. Um, but we’re exactly, but we’re taking action on our own read. And ideally every case would probably be discussed with at least one other colleague, um, to, you know, both, both kind of move, uh, move to the median or mean, but, but also to have this, um, you know, additional check on, like there, there’s just this inherent variability.
It raises an interesting question of course. Could Yerushalmy’s ideas about group concordance help improve our diagnostic abilities?
I don’t think, I mean, I think they are translatable to diagnosis and an ideal healthcare system. They would be translated to diagnosis clinical diagnosis, but I don’t think it’s feasible probably to, I mean, we already have like a shortage of physicians. I don’t think it’s feasible to have multiple physicians weighing in on every single diagnosis or every single test interpretation. Um, so it’s definitely easier from a screening standpoint. Um, you just have a panel of doctors that review like tons of things.
Whenever I have Shani on the show, we usually descend into a philosophical deconstruction of widely-accepted truths. But medicine is not a particularly philosophical field, so I always have to ask her, how is this useful to practicing physicians — and even more importantly, our patients?
I mean, I wish I knew that I don’t think as we were saying before, I don’t know that anyone really knows that. Like, I mean, what do you do then? Um, I mean, I think you, you have to rely on the best information that you have, but I think you have to take everything with a grain of salt and really, I mean, therein lies the art of medicine, right. I think you have to make care patient centered and be upfront with patients about these facts and kind of lean on them for how, uh, are they the kind of person who would want to know no matter what and would want to take the risk of a false positive, or are they someone who is willing to accept the risk of a false negative, or not doing the test at all and missing the fact that they have disease? And, you know, so a lot of, I think all of our decisions need to be based not only on, you know, the test characteristics, but there’s a whole scheme of things around those tests that we do that need to be factored in including patient preferences and whether or not the results are actually going to have clinical implications and influence the care of the patient.
If you’ve ever heard me speak about diagnosis, you know that I feel rather strongly about epistemology, and the uses — and limits — of different types of knowledge when taking care of actual human beings. But one thing I’ve always struggled with — and also been publicly called out for, and one that I still don’t have a particularly good answer for — is the best way to actually teach all of this nuance in a way that is productive, and not just nihilistic.
Like you’re you Adam, you’re just coming in and you’re just like dropping bombs and walking out. Yeah. You’re you’re like,
Speaker 6 (33:26):
See you later guys. So what am I, how should you teach them?
I, I think, I think it’s, I think all of medicine is about humility and knowing your limit, knowing one’s limits, right? It’s like, like knowing what you don’t know is, is really important. And so I think just knowing what you don’t know does a huge service to the practice of medicine. Um, because then you will make the extra effort to knowing that these tests are all worse than they are portrayed to be. You’ll make the extra effort to actually get the information from the patient, take a better history way in the patient’s preferences. Like it, it kind of like the fallibility of these tests helps us to potentially move away from such strict reliance on them and move back to what medicine was founded upon, which is like incorporating all of these things in the absence of those tests, tests should be like, you know, the cherry on the sundae, it should not be the Sunday. Yeah. So
I think I’m going to steal that quote — test characteristics are the cherry on top of the sundae, not the sundae itself. But this is getting ahead of ourselves a little bit. At this point, it’s still the 1940s. This new focus on sensitivity and specificity was a fundamental redefinition to the diagnostic model that had persisted at this point for a century and a half. You could do all the lung auscultation, abdominal percussion, urinalyses, and chest roentgenograms you wanted. A diagnosis was always subject to error, only as good as the sensitivity and specificity of the individual tests. But for the most part, the implications of Yerushalmy’s work went unheeded for the next couple decades outside of disease control. What Shani and I are alluding to — how test characteristics go “mainstream” to change clinical practice, and how many of the problems with them that Yerushalmy and his contemporaries identified would be downplayed and ignored — will have to be told in future episodes of the series as we continue to explore the evolution of the concept of diagnosis.
That is it for the show! Because this was a shorter episode, I have two special features! The first will be a new recurring segment, since I’m so terrible at doing #AdamAnswers — Stethospeaks, with Dr. Umme H. Faisal, who is going to discuss some of the amazing historical threads she posts on Twitter. And if you are deeply interested in some of the philosophical implications of test characteristics — and whether it even makes sense to talk about unchanging “characteristics” — keep listening after the end of this episode, because Shani and I have a lovely conversation about the “diagnostic paradigm” and a new way to think about the sensitivity and specificity of diagnostic tests.