Development of an AI-Powered Medical Bias Detection and Explainability System for Case Studies, Clinical Narratives, and Evaluation Assessments

Shweta Sharma|A Data Science Project|February 2026|ABIM Standards

RoBERTa Demo Few-Shot Demo View PDF Synthetic Dataset Cite

Abstract

This research presents the development and evaluation of an AI-powered medical bias detection system designed to identify and classify biases in medical case studies, clinical narratives, evaluation assessments, and healthcare documentation. The study employed a multi-phase experimental approach, beginning with the generation of 3,500 synthetic samples across seven bias categories, followed by comparative analysis of fine-tuned transformer models (RoBERTa-base and Bio-ClinicalBERT), and culminating in the implementation of a few-shot prompting approach using large language models (GPT-4o and Gemini). Our findings reveal that while fine-tuned models achieved high accuracy (up to 98%), they struggled with semantic overlap and domain generalization. Ultimately, the few-shot prompting approach demonstrated superior performance and was deployed as an end-to-end bias detection pipeline, acting as a core explainability layer for AI audits and governance within internal medicine assessment systems. This work provides both a comprehensive taxonomy and a practical framework for scaling bias mitigation across diverse medical content.

Keywords: Medical bias detection, clinical NLP, transformer models, few-shot prompting, algorithmic fairness, healthcare AI auditing

1. Introduction

Medical certification and assessment programs increasingly explore automation for content generation, feedback drafting, candidate support, and workflow triage. In these settings, bias can appear in subtle but impactful ways: stereotyping trainees, stigmatizing patients, or embedding structural inequities into evaluation criteria. Separately, algorithmic systems can amplify existing inequities through proxy features (e.g., accent, grammar, insurance status) or biased historical labels.

This paper describes the ABIM AI Bias Checker, a robust framework for detecting bias in medical narratives and assessments. Beyond clinical vignettes, this system is designed to provide an explainability and governance layer for:

Medical research case studies and documentation;
Psychometric evaluation and assessment narratives;
AI-generated feedback and drafting systems;
Cross-organizational AI audits for long-term algorithmic fairness.

2. Background and Motivation

2.1 Bias in healthcare and medical evaluation

Bias in healthcare has been documented across clinical decision support, risk prediction, documentation practices, and resource allocation. Algorithmic bias in population health management has been shown to produce systematic racial disparities when cost is used as a proxy for need [1]. Bias also appears in narrative evaluations and structured assessments through stereotyped expectations and inequitable norms [2, 5]. A practical ABIM-themed classifier must therefore capture both language harms (stigmatizing wording, stereotypes) and system harms (structural constraints, algorithmic scoring issues).

3. ABIM Bias Taxonomy (7 Labels)

We define seven labels that reflect biases commonly observed in healthcare communication, documentation, and assessment, as well as algorithmic fairness concerns.

Label	Definition	Common Manifestations
no_bias	Clinically appropriate, neutral language; no stereotypes, stigma, or inequitable assumptions.	Evidence-based reasoning; respectful patient-centered descriptions.
demographic_bias	Biased assumptions linked to race/ethnicity, gender, age, language, immigration status, training pathway.	Coded language ("from that neighborhood"); lower expectations for IMGs.
clinical_stigma	Biased evaluation framing in trainee or candidate assessment that lacks objectivity.	Blame framing ("lack of motivation"); "drug-seeking" shortcuts.
assessment_bias	Bias in how trainees/candidates are evaluated or scored; unfair norms in rubrics.	Penalizing shared decision-making; accent equated with incompetence.
algorithmic_bias	Bias arising from automated scoring, AI-generated feedback, or data-driven rubrics.	Proxy features drive lower scores; historical label bias in training data.
documentation_bias	Biased framing in charting or case descriptions that labels patients without context.	"Non-compliant" without barriers; negative descriptors not clinically necessary.
structural_bias	System-level inequity due to policies, resourcing, or institutional constraints.	Rigid requirements disadvantaging part-time physicians; unequal access to resources.

EXPERIMENT 1

Synthetic Dataset Generation

We generated 3,500 synthetic clinical vignettes using GPT-4o, balanced across seven bias categories (500 samples each). Each sample was designed to reflect realistic ABIM-style internal medicine documentation, including patient histories, clinical assessments, and feedback narratives.

Figure 1: Dataset Distribution (7 Categories)

CategorySamples (Total: 3,500)

✅no_bias

500

👥demographic_bias

500

🏥clinical_stigma

500

⚖️assessment_bias

500

🤖algorithmic_bias

500

📄documentation_bias

500

🏛️structural_bias

500

Balanced dataset • GPT-4o generated • Validated by domain experts

⚠️ Finding:

Initial training on all 7 categories revealed significant semantic overlap between related categories — particularly documentation_bias vs. clinical_stigma, and structural_bias vs. algorithmic_bias. This led to high confusion rates and motivated the consolidation to 4 categories in Experiment 2.

EXPERIMENT 2

Fine-Tuned Transformer Models

We fine-tuned pretrained transformers — RoBERTa-base [3] and Bio-ClinicalBERT [2] — by attaching a task-specific classification head to the pooled representation. We experimented with partial layer freezing and LoRA adapters to stabilize training and reduce overfitting on synthetic data.

Training Configuration

Max length: 256 tokens
Batch size: 16
Epochs: 3–6 (Early stopping on Macro-F1)
Optimizer: AdamW
Learning Rate: 2e-5 to 5e-5

Figure 2: RoBERTa Training Curves (4-Label, 6 Epochs)

Best checkpoint at Epoch 5 — Accuracy: 98.67%, Macro F1: 98.67%. Slight degradation in Epoch 6 suggests early stopping was optimal.

Figure 3: Model Comparison — Accuracy & Macro F1

RoBERTa (7-label)Overfitting on semantic overlap

Accuracy

91.5%

Macro F1

88.3%

Bio-ClinicalBERT (7-label)Domain-specific but lower performance

Accuracy

89.2%

Macro F1

85.7%

RoBERTa (4-label) ✦Best — consolidated categories

Accuracy

98.67%

Macro F1

98.67%

💡 Key Finding:

Consolidating from 7 to 4 bias categories (no_bias, demographic_bias, clinical_stigma_bias, assessment_bias) improved RoBERTa accuracy from 91.5% → 98.67%. The 4-label model with LoRA adapters is deployed below for interactive testing.

Figure 4: Fine-Tuned RoBERTa Classifier (Interactive)● Live Model

RoBERTa + LoRA • Fine-Tuned Model

Quick Test Samples

Figure 4. Interactive fine-tuned RoBERTa classifier with LoRA adapters. Classifies clinical text into 4 bias categories with confidence scoring and AI-generated explanations.

EXPERIMENT 3

Few-Shot Prompting Approach

Building on the limitations of fine-tuned models (overfitting to synthetic patterns, inability to explain predictions), we implemented a few-shot prompting pipeline [4] using GPT-4o. This approach uses 5 curated example pairs spanning all bias categories, enabling:

4 primary categories with 11 granular sub-types
Multi-bias detection (intersectional analysis)
Evidence-based explanations with exact text citations
Actionable recommendations for bias mitigation
Confidence scores and severity ratings (NONE → CRITICAL)

Figure 5: Few-Shot Pipeline — Performance by Bias Category

✅

No Bias

96%

Precision

95%

Recall

95.5%

Evidence-based practicePatient-centered languageNeutral documentation

👥

Demographic Bias

94%

Precision

93%

Recall

93.5%

Racial/Ethnic BiasGender BiasAge BiasSocioeconomic Bias

🏥

Clinical Stigma

93%

Precision

91%

Recall

92%

Weight StigmaPain DismissalMental Health StigmaLifestyle Judgment

⚖️

Assessment Bias

92%

Precision

93%

Recall

92.5%

Diagnostic BiasCompetency Assessment BiasTreatment Decision Bias

🏆 Result:

The few-shot approach demonstrated superior generalization over fine-tuned models, with the ability to detect intersectional biases, provide granular sub-type classifications across 11 categories, and generate human-readable explanations — making it suitable for production AI auditing workflows.

Figure 6: Few-Shot GPT-4o Bias Detection Pipeline (Interactive)● Live Model

GPT-4o Powered • Few-Shot LearningWaking Up...

Audit Clinical Documentation

Detect hidden biases in clinical vignettes and research protocols using our advanced few-shot prompting engine. Identifies 11 specific bias types including racial profiling, stigma, and diagnostic anchoring.

Select a Benchmark Example

Artifact Analysis Console

Input clinical vignettes for real-time bias detection against ABIM standards.

0 chars

Figure 6. The deployed few-shot prompting pipeline allowing real-time bias detection on clinical vignettes. Detects 4 bias categories with 11 sub-types, provides evidence, recommendations, and audit scoring.

6. Discussion

The system can be integrated into a recurring evaluation harness to run periodic bias detection on newly generated vignettes and track model drift. Including a short rationale encourages transparent mapping between a label and the text, supporting annotation calibration.

Approach Comparison Matrix

Analyzing the tradeoff between rigid classification and semantic few-shot prompting.

Feature	Experiment 2: Fine-Tuning	Experiment 3: Few-Shot
Architecture	Fine-tuned RoBERTa + LoRA	GPT-4o / Gemini Pro 1.5
Data Required	3,500+ Synthetic Samples	5-10 Gold Standard Examples
Explainability	Low (Class Probabilities)	High (Detailed Evidence Reasoning)
Granularity	Rigid (4 Fixed Categories)	Dynamic (11+ Nested Sub-types)
Generalization	Risk of Overfitting to Synthetic	Robust to Unseen Linguistic Nuance

ABIM AI Governance

Acts as a core explainability layer for internal AI systems. Beyond item-writing, it audits medical case studies and evaluation assessments to ensure organizational standards for fairness are met.

Universal Applicability

Scaling to provide real-time bias detection across all ABIM documentation. From trainee evaluations to complex medical research case studies, the system prevents bias before it enters the ecosystem.

Limitations

Synthetic data is not ground truth for real-world deployment. Common risks include style artifacts (model-specific phrasing) and incomplete coverage of real ABIM item-writing norms. Future work will involve blending synthetic data with carefully governed de-identified real-world text.

Selected References

[1]

Obermeyer, Z., Powers, B., Vogeli, C., & Mullainathan, S. (2019). Dissecting racial bias in an algorithm used to manage the health of populations. Science, 366(6464), 447–453.

[2]

Alsentzer, E., Murphy, J., Boag, W., et al. (2019). Publicly available clinical BERT embeddings. In Proceedings of the 2nd Clinical NLP Workshop (pp. 72–78). ACL.

[3]

Liu, Y., Ott, M., Goyal, N., et al. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692.

[4]

Brown, T. B., Mann, B., Ryder, N., et al. (2020). Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems (Vol. 33).

[5]

Rotenstein, L. S., et al. (2021). Differences in Narrative Evaluations of Internal Medicine Residents by Gender and Race. JAMA Network Open, 4(9).