What Makes a Good Clinical Decision Instrument?

We come across a lot of academic papers and research at MDCalc when figuring out what to add to the site next. There’s a huge range of information that we’ll add to MDCalc, including scores, algorithms, “decision rules,” referenced lists of accepted information (like exclusion criteria for TPA), and actual math equations. (We end up referring to these all as “calculators,” just so that it’s easy to know what we’re referring to.)

But not all “calculators” are created equal, of course. Some are better than others, for a number of reasons.

  1. How strong is its evidence? Probably first and most importantly, does the calculator appear to do what it’s supposed to do? If the paper states its job is to figure out who has right ear pain vs who has left ear pain, did it do that according to the results? And, taking it an important step further – and that we typically require on MDCalc – did it get validated?
  2. Is it solving or helping in a clinical conundrum? You could imagine someone coming up with a clinical decision instrument for ear pain:
    • Which ear does the patient have pain in?
    • Does that ear look red?
    • Is that ear tender?

    But obviously no one needs a score for this, because that’s just what you do as a clinician. It’s obvious. It’s one of the criticisms people have of some of our calculators, including the HEART Score for Major Cardiac Events, specifically the elevated troponin. We all known that patients with chest pain with an elevated troponin are much more likely to have a poor outcome, so obviously those patients require admission to the hospital – no one needs a rule or instrument for that.

  3. Are terms well-defined? It often takes detective work to figure out where a particular criteria is defined in the paper; often terms are not clear at all, and we end up contacting authors to figure out exactly what they meant by “Heart Rate > 100,” or “Recent Surgery.” Heart Rate > 100 initially, or ever? How recent is recent?
  4. Is it reasonably easy to perform? While hopefully MDCalc makes it much easier to use any decision instrument and takes away your mnemonics and rote memorization, it’s really important that a user can move through the score with relative ease. For example, the APACHE II Score is widely criticized for being incredibly complex, long, and requiring a huge number of data points. And if you’re missing one of them, you then have to potentially order additional laboratory tests to calculate it. When possible, scores should be straightforward and easy to perform with as few pieces of clinical data as possible.

Those are some of the criteria that help us determine if a piece of research should join the MDCalc reference list. We’ll dive deeper into some of these categories, as well as talk more about poor clinical decision instruments next.

Discover: the qSOFA Score for Sepsis!

9515829903_9494aaa5f8_mBy now, hopefully you’ve heard that new sepsis definitions and criteria were released last month, and we wanted to take a moment to give a little deeper of a dive on one of the big new additions: quickSOFA (qSOFA).

Wait, why did sepsis get re-defined?

If you recall, we’d been using the SIRS Criteria and Sepsis definitions for many years, Continue reading “Discover: the qSOFA Score for Sepsis!”

The Challenges of the Medical Gold Standard Test

Long before there were lab tests and x-rays and CT scans, doctors were diagnosing disease. Diseases were described – as they still are today – as a collection of signs and symptoms. (A syndrome is technically any collection of signs of symptoms, with the term disease suggesting “disorder” or derangement from normal.) At some point, doctors started cutting the dead open to see what was actually happening to these patients on the inside, and then describing those findings as well. Medical school is still taught this way. You study a disease’s pathology at length, including what it looks like and what’s happening at a cellular level.

And since the beginning of time, patients have always come with only their signs and symptoms; patients don’t carry around a placard telling you what their disease is. So doctors started wondering, “There must be some way to figure out which of these patients with vomiting and abdominal pain have appendicitis, without having to do surgery on them.” And as testing began, so did the idea of a gold standard: the best, most absolute proof that a patient has a certain disease. If you have the gold standard, you’ve got the disease. In the case of appendicitis, it’s an inflamed, infected appendix when the surgeon cuts you open, with signs of appendicitis when the pathologist looks under the microscope after surgery.

But there’s often a few problems with the gold standard concept when you apply it to us humans:

  • First, the gold standard isn’t always so physically apparent as a swollen appendix. To take the most abstract example, how do you come up with a gold standard for say, depression, or alcoholism? (They exist, but they’re obviously not based on what depression looks like under a microscope.)
  • Second, the gold standard test is very often very invasive, so you don’t always want to use the gold standard to diagnose every single disease. Imagine if we just had to cut everyone open who have vomiting and abdominal pain? Or if we cut into every person’s brain with a headache to see which ones have a brain tumor?
  • Next, even the gold standard test can be imperfect. Gout’s gold standard is joint fluid showing monosodium urate crystals, but experts even admit that this test isn’t 100% reliable. Maybe the fluid you get just happens to not have any crystals in it by pure luck. Or maybe there’s too few crystals to find.
  • Not only can the gold standard be invasive, but it can be really resource intensive. Take for example the gold standard for knowing if a patient has bacteria growing in their blood. It can sometimes take 3-5 days (and almost always at least 24 hours) for these tests to give results. Who can wait five days for a test when the disease might leave the patient dead in two?
  • Finally, depending on the disease, sometimes we don’t even need to use the gold standard. The disease is so mild and temporary that the gold standard is just a waste. Take the common cold: while there’s certainly tests we can do to confirm that a patient with a runny nose has a cold… who cares? It’s a cold!

And thus, other testing was born. Lab tests, CT scans, MRIs, EKGs, and even scores and calculators like we have on MDCalc. These were great, but it’s taken decades (and this work is on-going) to figure out how good these tests are compared to that “gold standard.” And to do this work, we have to do both tests and get the gold standard and then see how good the test was. For example, a CT scan is a test we often order for patients we think have appendicitis. And while CT is an excellent test to see which patients will have a gold standard appendicitis, even CT isn’t perfect. (It’s probably about 95-98%, which is pretty incredible, but still not perfect).

The scores on MDCalc are used just like lab tests or CT scans: how good are they at predicting which diseases or outcomes a patient has compared to the gold standard?

Discover the Acute Gout Diagnosis Rule!

As part of our mission to help physicians discover new calculators that may help them in their practice (and potentially reduce unnecessary testing and provide better, more evidence-based, efficient and safer care to patients) we’re starting a new series to update physicians about new calculators added to the site that they may not be aware of.

First up, a disease that is near and dear to many physicians’ joints: gout!

TL;DR: The Acute Gout Score is a validated decision instrument that aims to reduce unnecessary testing for gout (we’re talking joint aspiration), encourage appropriate testing, and prevent other critical arthritis diagnoses from being missed. That’s pretty much everything you want from a score like this.

The Background and Goals: Gout is often diagnosed clinically by physicians, so researchers wanted to know — how good are physicians at this? They also wanted to see if they could improve this diagnosis and help risk stratify patients into high risk groups that could be safely started on gout treatment, medium risk groups that would benefit the most from joint aspiration (which is often painful, and does carry some risk), and low risk groups where it’s probably not gout and other causes of joint pain should be explored.

gout photo The Study: They took patients with monoarthritis and asked them a bunch of questions, examined them, took blood work, and then tapped everyone’s painful joint (the gold standard) and then looked to see which criteria predicted gout. They also asked physicians to predict which patients had gout, to see how good physicians are compared to the score.

The Results: The variables in the score were obviously the most associated with gout, with a high serum uric acid level being the most predictive, followed by the affected joint being the big toe’s metatarsophalangeal. (Tophus prescence was actually the most predictive — 100% — but was a pretty uncommon finding (12.9%).)

So they pulled out tophus (figuratively; that would hurt otherwise), and then ran a bunch of statistical analyses, and found that the rule was very good, with an AUC of 0.85 if you used labs, and 0.82 if you didn’t have lab results.

This score then got validated in another (ethnically similar) population. Gout was very likely in patients with a score of ≥8 (80% of these patients had gout), and was very unlikely in patients with a score ≤4 (only 2.8%). (The score did better than these family physicians, by the way.)

Our Take: If you’re sure it’s gout, you’re probably right. But if you have any concerns or thoughts that it might not be, or something isn’t totally fitting, try this score. It can help you figure out who you should probably tap or at least follow closely if they’re not improving or worsening — and in which patients you should broaden your differential, because it probably isn’t gout at all.

How-To: Display Calculators Outside Your Specialty, and Do Complex Searches in MDCalc iOS

We had an outstanding review from the iMedicalApps team (thank you, Douglas!) and quickly realized our extensive search feature might not be getting all the love we think it deserves. We spent a lot of time trying to figure out how to make filters and searches intuitive and easy to understand while still maintaining some powerful filtering abilities, so I figured I’d show a few examples.

The search and filter system lets you display calculators outside of your specialty, and can even combine specialties. Just start typing the speciality you want, and then tap it when it appears in the quick filter bar below:
Continue reading “How-To: Display Calculators Outside Your Specialty, and Do Complex Searches in MDCalc iOS”

What’s a Receiver Operating Curve (ROC)? What’s the Area Under Curve (AUC)? And why do I care?

Or: How to tell if a test is helpful or not.

TL;DR: A really good, accurate test has a ROC line that hugs the upper left corner of the graph and has an AUC very close to 1.0, and a worthless one has an AUC of 0.5.

I want to give you a simple way to tell if the scores and tests that you rely on (and many of which we publish on MDCalc) are good — and how good they are at separating patients with the disease you’re worried about from those without having the disease you’re worried about.

That simple way is called the Area Under Curve (AUC), or the c-statistic, and you get it from the Receiver Operating Curve (ROC). We’ll talk about the ROC curves you might see in papers, but first we have to go back to diseases, testing, sensitivity, and specificity.

We all know that sensitivity and specificity are almost always at odds. In almost all diseases, there’s some overlap in patients between health and disease when we try to apply a test to them. If we tried to make a rule for myocardial infaraction based only on “Does the patient have chest pain?” we know that many patients with myocardial infarction — but not all — have chest pain. So we’re going to miss some patients with MI if they don’t have chest pain, using that simple rule.

This graph summarizes this well:

Sensitivity and Specificity Curves

From StomponStep1.com

So what we really want to know is: If I’m going to a use a test to determine if someone has a disease I’m worried about, is that a good test? And that’s called accuracy. Accuracy says how well a test separates people into groups with the disease, and groups without the disease.

Would “Does the patient have chest pain?” be a good test for myocardial infarction? No, of course not. Because it doesn’t separate people into “Having MI” and “Not having MI” very well.

But there’s lots of other tests for myocardial infarction. How bad is “Does the patient have chest pain?” compared to other tests? And that’s where the ROC and the AUC come in. They let you compare and objectify how good or how bad two diagnostic tests are (how accurate they are).

One final issue: to use these tests, you have to have a continuous outcome (so “Does the patient have chest pain, “Yes/No”) actually wouldn’t work, but “How bad is your chest pain, on a scale of 0-10?” would work just fine. (One way people get around this with labs that use cut-offs is to run the numbers with multiple cut-offs: Lactate <2, Lactate 2-4, or Lactate >4, for example.)

The ROC plots true positives against false positives. Y Axis: True Positives. X axis: False Positives. You want lots of the former and none of the latter, so if you just plot these out at different cutoffs or levels, you get points on the graph. Connect those points, and that makes the curve. That’s it.

Let’s say you’re looking at troponin for diagnosing myocardial infarction. If a cutoff of 0.01 has mostly false positives and few true positives, it’s really sensitive but not very specific at all.

A cutoff of 0.5 is going to be less sensitive but more specific:

And a troponin of 25 is very specific but not very sensitive. Or: it’s really rare to have a false negative with a troponin of 25, but it’s going to miss a lot of the true positives if your cutoff is 25, too.

Now’s let take it one step further: if you calculate how much area on the graph is under the curve, that’s the AUC (area under curve). And the AUC lets you compare tests easily by seeing how much area each test takes up on that standard graph.

Here’s a rough way of categorizing AUCs, which range from 0.5 – 1.

  • 0.90-1.0 = Excellent Test and Accuracy
  • 0.80-0.90 = Good Test and Accuracy
  • 0.70-0.80 = Fair Test and Accuracy
  • 0.60-0.70 = Poor Test and Accuracy
  • 0.50-0.60 = Failed Test and Accuracy

For you visual learners, we’ve got a chart! Let’s look at a few tests for diagnosing myocardial infarction:

  1. Worthless Test: “How Bad Does Your Ankle Hurt?”
  2. Slightly Better Test: “How Bad Does Your Chest Hurt?”
  3. Better Test: “How Bad Does Your Chest Hurt And Is Your EKG Concerning for Heart Attack?”
  4. Good Test: “Is Your EKG Concerning and What is Your Troponin Level?”
  5. Very Good Test: “Is Your EKG Concerning and What is Your Troponin Level and Repeat Troponin Level at 6 Hours?”

And each curve for each test:

And now, each area for each test:

Hopefully we’ve shed some light on what can often be a pretty confusing topic. Our goal is to start documenting and categorizing AUCs for tests for calculators on MDCalc, so that we can compare apples to apples when users are trying to evaluate how accurate a test on the site is.

Next up: The Problems of the Gold Standard!

Looking for more? The University of Nebraska Medical Center has a great overview of ROCs and AUCs and Rahul Patwari has an excellent Youtube video:

Note: This only applies to users who already downloaded the app this first week! (Thank you thank you thank you!) If you’re newly downloading the app (or you delete the app from your phone and then re-install) you should not have these issues.

Update Saturday March 12: Some people are reporting favorites removed with this update. We’re looking into this.

As we were preparing to release the app outside of the US, we had to make some changes to the app to distribute it internationally. These changes unfortunately broke the download mechanism and so our early adopter users may see a blank screen briefly. Very sorry!

To upgrade:

  1. Download the new version of the app from the iTunes App Store (still free!)
  2. When you open the app you may see a blank list without equations. Do not panic.
  3. Go to the Settings icon in the upper right hand corner, and swipe down to the “Reload Calculators” button and tap it, and then hit “Okay.”
  4. Once all the calculators have downloaded, you’re as good as new.

A GIF version below:

Continue reading “Upgrading the MDCalc App to v1.1”

Upgrading the MDCalc App to v1.1

Welcome to Paging MDCalc!

Users are always emailing us asking, “Hey, which calculator is best for chest pain,” or “When shouldn’t I be using a particular calculator?”

We also wanted a place for further discussions about decision support in medicine, as well as a place to provide tips and tricks to using our brand new iPhone app, so we thought we’d make a separate content site for extended prose, discussion, and support.

Welcome to Paging MDCalc!