Mistakes make it into many radiology reports. Deep learning can help fix that.

John Zech
6 min readDec 23, 2018

--

This post discusses our open access journal paper.

Imagine you are in an emergency department taking care of a patient who is having difficulty speaking.

You want to ensure that their symptoms are not due to a stroke. You order a CT angiogram of the head and neck to evaluate if there are any obstructed blood vessels in the brain that could have caused a stroke. A negative angiogram would not fully rule out a stroke, but any positive findings would be immediately concerning, and you would need to act on them quickly.

Example CT angiogram of the head courtesy of Mikael Häggström

The scan is completed, the radiologist reviews it, and they share their report with you electronically. You quickly scan down to the Impression section at the bottom to see what their conclusion was:

“THERE IS EVIDENCE OF A SIGNIFICANT STENOSIS, DISSECTION, OR THROMBOSIS.”

How do you react? You’re concerned but confused by how vague the report is. Why didn’t they say what the specific problem was? You scroll up and down the full report but can’t find any more detail on what is wrong: everything specifically commented on is negative.

You call the radiologist, and after re-reviewing the imaging, they tell you that no abnormalities were present and this line was mistaken. They amend their report:

“ADDENDUM: PLEASE NOTE CORRECTED IMPRESSION: THERE IS NO EVIDENCE OF A SIGNIFICANT STENOSIS, DISSECTION, OR THROMBOSIS.”

Mistakes like this happen

A 2015 review at the Mayo Clinic found error rates in radiology reports as high as 19.7% in neuroradiology and as low as 3.2% in chest x-rays, with a 9.7% overall rate.

A radiologist dictating into a radiology report by w:User:Zackstarr — w:File:Radiologist in San Diego CA 2010.jpg, CC BY-SA 3.0, Link

Radiologists dictate their reports directly into the medical record. They are under pressure to work quickly and are frequently interrupted while creating reports. All of that leads to mistakes.

Some may just be embarrassing, like inadvertently dictating into a report when talking to a colleague:

“THANK YOU HAVE A GOOD NIGHT”

Others, like the imagined example at the beginning of this article, can create serious clinical confusion. The review at Mayo found that nearly 20% of such errors were clinically significant.

Deep NLP for radiology

Researchers have published on deep learning methods to help radiologists interpret images (e.g., here, here, here). In our paper, we show they’re also useful for correcting the text of radiology reports.

It’s crazy that my text messages get corrected for errors with deep learning, but radiologists have no proofreading tools available except a spell-checker, which won’t catch the errors shown above. Radiology reports often have an immediate impact on patient care, like determining if a patient can be safely sent home from the emergency room or needs to be admitted to the hospital. Mistakes can have serious consequences.

While we might want to use existing grammar checking software, they cannot parse the highly specialized language and syntax of radiology reports. I know this because I tried — it didn’t work.

We need an approach customized to the strange lexical domain of radiology.

Our approach

To train a model that could identify and correct sentences containing errors, we needed training examples for our model. We took radiology reports (head CTs and chest x-rays) and artificially created errors by randomly introducing insertions, substitutions, or deletions of words from sentences in each report. The corrupted sentences with their original uncorrupted sentences gave us the “typo” and “corrected” sentence pairs we use to train a sequence-to-sequence model to detect errors. We used the model to predict each uncorrupted sentence using the corrupted sentence as input:

At a high level, this approach is similar to the ones used by tools like Grammarly when correcting e-mails. We believed it would work well on radiology reports because these reports are actually quite simple lexically (discussed here). Things tend to be said in similar ways, but in a way that is different from typical English prose, making it difficult to use pre-existing tools for this problem.

Results

We found that our approach worked pretty well at correcting these random insertions, substitutions, and deletions of words in radiology reports:

“Seq2seq detected 90.3% and 88.2% of corrupted sentences with 97.7% and 98.8% specificity in same-site, same-modality test sets for head CTs and chest radiographs, respectively.”

In other words, our model — trained to randomly introduced errors — was able to detect around 90% of such errors on new reports with very high specificity. That is an encouraging demonstration that an approach like this can work to catch errors in radiology reports.

We then took this model — again, trained on randomly created errors — and tried to use it to catch actual typographical errors in reports made by radiologists on a subset of our data. We found that 157/400 sentences it flagged as having errors actually had one (PPV 38.6%), and that the vast majority of sentences it deemed error-free were in fact correct (789/800, NPV 98.6%).

Generalization

We were curious to see how well this approach would generalize — if we trained a model to head CT reports at one hospital and used them at another, would they still work? And if we trained a model to head CT reports and used them on chest x-ray reports, would they work?

We found, basically, no — the performance degraded significantly. We really needed to train the model with the specific type of report (chest x-ray, head CT, etc.) from the specific hospital to get good performance. We could roll up reports from different hospitals into one model and still get good performance, but the model had to see the type of reports you were asking it to predict to do well.

This made sense to us — reporting templates are often very different across institutions, and certainly very different between different types of exams (e.g., head CT vs chest x-ray).

Conclusion and going forward

tl;dr: a deep learning model trained to identify errors in radiology reports worked well on simulated data, and worked okay on real-world data — but it needs to see reports from your specific hospital to work.

There are some clear directions to go in next to keep improving this approach.

It should be extended to consider the report in its entirely rather than modeling each sentence in isolation. Breaking each sentence apart means we lose information that would be useful in catching errors (i.e., inconsistencies between the findings and impression sections).

Additional work should be done to introduce specific types of important errors in a thoughtful way in training data to catch the most dangerous mistakes — i.e., incorrect negation, laterality errors, etc.

Finally, we need to harness the fact that radiologists frequently do correct many of their own errors in real-time, providing an amazing resource to learn how they want their reports corrected. Those corrections are much more useful training examples — the model would adapt to your specific corrections and learn what you need. That would be straightforward to do technically, and would lead to strong real-world usefulness.

All of these refinements could help improve the accuracy of this approach and improve the sensitivity and PPV to levels where most errors were caught and most corrections suggested were correct.

Even though it’s still very much a work in progress, I would want to have this available to me when I’m dictating reports to help me avoid these kinds of mistakes. With continued refinement, I expect there will come a day in the not too distant future when radiologists will demand this type of software and will balk at the idea of working without it. I hope companies like Nuance start to incorporate this type of approach into their dictation software and that they do it in a way that is effective rather than frustrating. If they don’t, I’ll be running this on top of their software myself.

To learn more, read the full research article here.

--

--

John Zech
John Zech

Written by John Zech

Radiology resident @ColumbiaRadRes, passionate about machine learning. @johnrzech

No responses yet