MIT News - Computer Science and Artificial Intelligence Laboratory 09月25日
研究揭示放射科医生在描述病情时的自信度问题
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

一项新研究发现,放射科医生在描述医学影像(如X光片)中的特定病变时,其表达的自信程度往往与其诊断的实际准确率不符。例如,当医生使用“非常可能”等词语时,可能表现出过度自信,而使用“可能”时则可能表现出自信不足。麻省理工学院的研究人员与哈佛医学院的合作者开发了一个框架,用于量化放射科医生使用自然语言表达确定性时的可靠性。该框架不仅能帮助放射科医生选择更准确的词语以提高报告的可靠性,还能有效衡量和改进大型语言模型在表达自信度时的准确性,使模型生成的置信度词语与其预测的准确性更加匹配。这项研究有望通过提高临床报告的准确性,最终使患者受益。

🔬 **放射科医生自信度与诊断准确性存在偏差**:研究表明,放射科医生在描述X光片等医学影像中的病变时,其使用的词语(如“可能”、“很可能”)所表达的自信程度,与实际的诊断准确率并不总是高度一致。例如,“非常可能”有时会过度自信,“可能”有时会自信不足,这种不匹配可能影响后续的临床决策。

💡 **创新框架量化语言的可靠性**:麻省理工学院的研究团队开发了一种新的框架,能够量化放射科医生使用自然语言表达确定性时的可靠性。该框架将语言中的不确定性词语视为概率分布,而非单一数值,从而更精细地捕捉其含义,并评估其与实际情况的匹配程度。

✅ **提升报告准确性与模型校准**:该框架不仅能为放射科医生提供如何使用更准确的词语来提高临床报告可靠性的建议,还能应用于大型语言模型,帮助它们更好地校准其预测的自信度表达,使其与实际预测的准确性保持一致,从而提高AI在医学领域的应用可靠性。

🏥 **改善临床决策与患者获益**:通过提高放射科医生在报告中描述病变可能性的准确性,这项研究有望改善临床信息的可信度。这对于指导医生做出更明智的治疗决策至关重要,最终将使患者成为最大的受益者。

Due to the inherent ambiguity in medical images like X-rays, radiologists often use words like “may” or “likely” when describing the presence of a certain pathology, such as pneumonia.

But do the words radiologists use to express their confidence level accurately reflect how often a particular pathology occurs in patients? A new study shows that when radiologists express confidence about a certain pathology using a phrase like “very likely,” they tend to be overconfident, and vice-versa when they express less confidence using a word like “possibly.”

Using clinical data, a multidisciplinary team of MIT researchers in collaboration with researchers and clinicians at hospitals affiliated with Harvard Medical School created a framework to quantify how reliable radiologists are when they express certainty using natural language terms.

They used this approach to provide clear suggestions that help radiologists choose certainty phrases that would improve the reliability of their clinical reporting. They also showed that the same technique can effectively measure and improve the calibration of large language models by better aligning the words models use to express confidence with the accuracy of their predictions.

By helping radiologists more accurately describe the likelihood of certain pathologies in medical images, this new framework could improve the reliability of critical clinical information.

“The words radiologists use are important. They affect how doctors intervene, in terms of their decision making for the patient. If these practitioners can be more reliable in their reporting, patients will be the ultimate beneficiaries,” says Peiqi Wang, an MIT graduate student and lead author of a paper on this research.

He is joined on the paper by senior author Polina Golland, a Sunlin and Priscilla Chou Professor of Electrical Engineering and Computer Science (EECS), a principal investigator in the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL), and the leader of the Medical Vision Group; as well as Barbara D. Lam, a clinical fellow at the Beth Israel Deaconess Medical Center; Yingcheng Liu, at MIT graduate student; Ameneh Asgari-Targhi, a research fellow at Massachusetts General Brigham (MGB); Rameswar Panda, a research staff member at the MIT-IBM Watson AI Lab; William M. Wells, a professor of radiology at MGB and a research scientist in CSAIL; and Tina Kapur, an assistant professor of radiology at MGB. The research will be presented at the International Conference on Learning Representations.

Decoding uncertainty in words

A radiologist writing a report about a chest X-ray might say the image shows a “possible” pneumonia, which is an infection that inflames the air sacs in the lungs. In that case, a doctor could order a follow-up CT scan to confirm the diagnosis.

However, if the radiologist writes that the X-ray shows a “likely” pneumonia, the doctor might begin treatment immediately, such as by prescribing antibiotics, while still ordering additional tests to assess severity.

Trying to measure the calibration, or reliability, of ambiguous natural language terms like “possibly” and “likely” presents many challenges, Wang says.

Existing calibration methods typically rely on the confidence score provided by an AI model, which represents the model’s estimated likelihood that its prediction is correct.

For instance, a weather app might predict an 83 percent chance of rain tomorrow. That model is well-calibrated if, across all instances where it predicts an 83 percent chance of rain, it rains approximately 83 percent of the time.

“But humans use natural language, and if we map these phrases to a single number, it is not an accurate description of the real world. If a person says an event is ‘likely,’ they aren’t necessarily thinking of the exact probability, such as 75 percent,” Wang says.

Rather than trying to map certainty phrases to a single percentage, the researchers’ approach treats them as probability distributions. A distribution describes the range of possible values and their likelihoods — think of the classic bell curve in statistics.

“This captures more nuances of what each word means,” Wang adds.

Assessing and improving calibration

The researchers leveraged prior work that surveyed radiologists to obtain probability distributions that correspond to each diagnostic certainty phrase, ranging from “very likely” to “consistent with.”

For instance, since more radiologists believe the phrase “consistent with” means a pathology is present in a medical image, its probability distribution climbs sharply to a high peak, with most values clustered around the 90 to 100 percent range.

In contrast the phrase “may represent” conveys greater uncertainty, leading to a broader, bell-shaped distribution centered around 50 percent.

Typical methods evaluate calibration by comparing how well a model’s predicted probability scores align with the actual number of positive results.

The researchers’ approach follows the same general framework but extends it to account for the fact that certainty phrases represent probability distributions rather than probabilities.

To improve calibration, the researchers formulated and solved an optimization problem that adjusts how often certain phrases are used, to better align confidence with reality.

They derived a calibration map that suggests certainty terms a radiologist should use to make the reports more accurate for a specific pathology.

“Perhaps, for this dataset, if every time the radiologist said pneumonia was ‘present,’ they changed the phrase to ‘likely present’ instead, then they would become better calibrated,” Wang explains.

When the researchers used their framework to evaluate clinical reports, they found that radiologists were generally underconfident when diagnosing common conditions like atelectasis, but overconfident with more ambiguous conditions like infection.

In addition, the researchers evaluated the reliability of language models using their method, providing a more nuanced representation of confidence than classical methods that rely on confidence scores. 

“A lot of times, these models use phrases like ‘certainly.’ But because they are so confident in their answers, it does not encourage people to verify the correctness of the statements themselves,” Wang adds.

In the future, the researchers plan to continue collaborating with clinicians in the hopes of improving diagnoses and treatment. They are working to expand their study to include data from abdominal CT scans.

In addition, they are interested in studying how receptive radiologists are to calibration-improving suggestions and whether they can mentally adjust their use of certainty phrases effectively.

“Expression of diagnostic certainty is a crucial aspect of the radiology report, as it influences significant management decisions. This study takes a novel approach to analyzing and calibrating how radiologists express diagnostic certainty in chest X-ray reports, offering feedback on term usage and associated outcomes,” says Atul B. Shinagare, associate professor of radiology at Harvard Medical School, who was not involved with this work. “This approach has the potential to improve radiologists’ accuracy and communication, which will help improve patient care.”

The work was funded, in part, by a Takeda Fellowship, the MIT-IBM Watson AI Lab, the MIT CSAIL Wistron Research Collaboration, and the MIT Jameel Clinic.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

放射科医生 医学影像 置信度校准 自然语言处理 人工智能 临床报告 Radiologist Medical Imaging Confidence Calibration Natural Language Processing Artificial Intelligence Clinical Reporting
相关文章