The Challenges of the Medical Gold Standard Test

Long before there were lab tests and x-rays and CT scans, doctors were diagnosing disease. Diseases were described – as they still are today – as a collection of signs and symptoms. (A syndrome is technically any collection of signs of symptoms, with the term disease suggesting “disorder” or derangement from normal.) At some point, doctors started cutting the dead open to see what was actually happening to these patients on the inside, and then describing those findings as well. Medical school is still taught this way. You study a disease’s pathology at length, including what it looks like and what’s happening at a cellular level.

And since the beginning of time, patients have always come with only their signs and symptoms; patients don’t carry around a placard telling you what their disease is. So doctors started wondering, “There must be some way to figure out which of these patients with vomiting and abdominal pain have appendicitis, without having to do surgery on them.” And as testing began, so did the idea of a gold standard: the best, most absolute proof that a patient has a certain disease. If you have the gold standard, you’ve got the disease. In the case of appendicitis, it’s an inflamed, infected appendix when the surgeon cuts you open, with signs of appendicitis when the pathologist looks under the microscope after surgery.

But there’s often a few problems with the gold standard concept when you apply it to us humans:

  • First, the gold standard isn’t always so physically apparent as a swollen appendix. To take the most abstract example, how do you come up with a gold standard for say, depression, or alcoholism? (They exist, but they’re obviously not based on what depression looks like under a microscope.)
  • Second, the gold standard test is very often very invasive, so you don’t always want to use the gold standard to diagnose every single disease. Imagine if we just had to cut everyone open who have vomiting and abdominal pain? Or if we cut into every person’s brain with a headache to see which ones have a brain tumor?
  • Next, even the gold standard test can be imperfect. Gout’s gold standard is joint fluid showing monosodium urate crystals, but experts even admit that this test isn’t 100% reliable. Maybe the fluid you get just happens to not have any crystals in it by pure luck. Or maybe there’s too few crystals to find.
  • Not only can the gold standard be invasive, but it can be really resource intensive. Take for example the gold standard for knowing if a patient has bacteria growing in their blood. It can sometimes take 3-5 days (and almost always at least 24 hours) for these tests to give results. Who can wait five days for a test when the disease might leave the patient dead in two?
  • Finally, depending on the disease, sometimes we don’t even need to use the gold standard. The disease is so mild and temporary that the gold standard is just a waste. Take the common cold: while there’s certainly tests we can do to confirm that a patient with a runny nose has a cold… who cares? It’s a cold!

And thus, other testing was born. Lab tests, CT scans, MRIs, EKGs, and even scores and calculators like we have on MDCalc. These were great, but it’s taken decades (and this work is on-going) to figure out how good these tests are compared to that “gold standard.” And to do this work, we have to do both tests and get the gold standard and then see how good the test was. For example, a CT scan is a test we often order for patients we think have appendicitis. And while CT is an excellent test to see which patients will have a gold standard appendicitis, even CT isn’t perfect. (It’s probably about 95-98%, which is pretty incredible, but still not perfect).

The scores on MDCalc are used just like lab tests or CT scans: how good are they at predicting which diseases or outcomes a patient has compared to the gold standard?