Towards Fairer Health Recommendations: finding informative unbiased samples via Word Sense Disambiguation
There have been growing concerns around high-stake applications that rely on models trained with biased data, which consequently produce biased predictions, often harming the most vulnerable. In particular, biased medical data could cause health-related applications and recommender systems to create...
Saved in:
Main Authors: | , , , , , , , |
---|---|
Format: | Journal Article |
Language: | English |
Published: |
11-09-2024
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | There have been growing concerns around high-stake applications that rely on
models trained with biased data, which consequently produce biased predictions,
often harming the most vulnerable. In particular, biased medical data could
cause health-related applications and recommender systems to create outputs
that jeopardize patient care and widen disparities in health outcomes. A recent
framework titled Fairness via AI posits that, instead of attempting to correct
model biases, researchers must focus on their root causes by using AI to debias
data. Inspired by this framework, we tackle bias detection in medical curricula
using NLP models, including LLMs, and evaluate them on a gold standard dataset
containing 4,105 excerpts annotated by medical experts for bias from a large
corpus. We build on previous work by coauthors which augments the set of
negative samples with non-annotated text containing social identifier terms.
However, some of these terms, especially those related to race and ethnicity,
can carry different meanings (e.g., "white matter of spinal cord"). To address
this issue, we propose the use of Word Sense Disambiguation models to refine
dataset quality by removing irrelevant sentences. We then evaluate fine-tuned
variations of BERT models as well as GPT models with zero- and few-shot
prompting. We found LLMs, considered SOTA on many NLP tasks, unsuitable for
bias detection, while fine-tuned BERT models generally perform well across all
evaluated metrics. |
---|---|
DOI: | 10.48550/arxiv.2409.07424 |