Training Medical LLM Predictors: Process, Costs, and Optimization with DSPy.rb
I spent the weekend building an Adverse Drug Event detector with DSPy.rb. Small test sets gave me 100% recall (too good to be true). A proper 200-example evaluation revealed the reality: 75% recall. Here’s what I learned about medical AI and sample sizes.
The Dataset: HuggingFace ADE Corpus V2
The ade-benchmark-corpus/ade_corpus_v2
dataset is massive but has a critical characteristic: it contains mostly obvious cases from published medical literature.
- 23,516 total examples
- 6,821 positive ADEs (29%), 16,695 negative (71%)
- Real patient reports with drug-effect relationships
What the Data Actually Looks Like
Patient Report | Has ADE | Sample Size Lesson |
---|---|---|
“Patient experienced severe nausea after taking aspirin for headache” | true | Obvious case - tiny samples catch these easily |
“During the first days of arsenic trioxide treatment a rapid decrease in the D-dimers was seen…” | true | Subtle case - only found this in 200-example test |
“Patient has been taking lisinopril for 6 months with excellent BP control” | false | Clear negative - easy to classify |
The problem: dataset bias. Published medical literature contains mostly clear-cut cases. The subtle ADEs that trip up your model only appear when you test with larger, more representative samples.
The Direct Approach: One API Call
Instead of a complex multi-stage pipeline, I built a single predictor that does everything in one shot:
class ADEDirectPredictor < DSPy::Signature
description "Analyze medical text to detect adverse drug events"
input do
const :text, String, description: "Medical report text"
end
output do
const :has_ade, T::Boolean, description: "Whether text describes an ADE"
const :confidence, Float, description: "Confidence score (0-1)"
const :reasoning, String, description: "Explanation of the decision"
end
end
Simple. Direct. One API call. The LLM reads the medical text and decides: ADE or not?
Optimization: What DSPy.rb Actually Does
DSPy.rb doesn’t throw more data at the problem. It systematically improves your prompts.
Baseline Prompt (Generic)
"Analyze medical text to detect adverse drug events"
Vague. No methodology. The LLM guesses.
After SimpleOptimizer (Development)
optimizer = DSPy::SimpleOptimizer.new(
metric: -> (example, pred) { evaluator.call(example, pred) },
max_bootstrapped_examples: 3
)
optimized = optimizer.compile(predictor, trainset: training_data)
Result: “Analyze the provided medical text to determine if there is an adverse drug event present by evaluating symptoms and medication relationships.”
Better. More specific methodology.
After MIPROv2 (Production)
optimizer = DSPy::Teleprompt::MIPROv2.new(
metric: -> (example, pred) { evaluator.call(example, pred) },
num_candidates: 10
)
optimized = optimizer.compile(predictor, trainset: training_data)
Result: “Analyze the provided medical text along with symptoms and medications. Determine if there is an adverse drug event by evaluating the relationships between drugs and symptoms. Output your confidence level and reasoning.”
Systematic. Medical methodology. Explicit output structure.
The Reality Check: Sample Sizes Matter
Here’s where things got interesting (and humbling):
Small Sample Results (8-25 examples)
- Recall: 100% 🤔
- Precision: 90%+
- F1 Score: 95%+
- The reaction: “This is suspiciously perfect…”
Statistical Reality Check (200 examples)
- Recall: 75% ✅ More realistic
- False Negative Rate: 25% - significant medical concern
- F1 Score: 66.7% - honest performance
- The missed case: Subtle arsenic trioxide effects
Lesson: Dataset bias in ADE Corpus V2 means small samples catch only obvious cases. Real evaluation needs statistical significance.
Performance Comparison
Testing with 200 examples using DSPy.rb’s native evaluation framework:
Approach | Recall | Precision | F1 | Speed | Cost/1K |
---|---|---|---|---|---|
Baseline | 45% | 50% | 47% | 1.2s | $0.15 |
SimpleOptimizer | 68% | 70% | 69% | 1.2s | $0.15 |
MIPROv2 | 75% | 67% | 71% | 1.2s | $0.15 |
The optimization improved recall by 30 percentage points - that’s 30% fewer missed ADEs. In medical contexts, this matters.
Medical Industry Context and Standards
Where the Industry Actually Stands
A 2024 meta-analysis of 59 studies found ML-based ADE detection averages 62-65% sensitivity1. For comparison, physicians miss even more—systematic underreporting is well-documented2. The Lancet Digital Health review identified AI as promising but noted most systems are still in early development3.
The 75% recall achieved puts this slightly above the 62-65% industry average.
For context, FDA-approved medical screening tests vary widely:
- Shield blood test for colorectal cancer: 83% sensitivity4
- Genetic predisposition panels: ≥99% sensitivity5
Regulatory Considerations
The FDA’s guidance on AI/ML-based Software as Medical Device (SaMD) requires “Good Machine Learning Practice” including transparency and real-world performance monitoring6. While this ADE detection system is a research prototype, production deployment would require:
- FDA 510(k) premarket notification or De Novo classification
- Continuous performance monitoring post-deployment
- Documentation of training data and model versions
- Clinical validation studies
The Money Reality
Using GPT-4o-mini (current pricing):
- Input tokens: ~744 per prediction
- Output tokens: ~38 per prediction
- Cost per prediction: ~$0.00013
- Daily cost (10K predictions): $1.34
- Monthly: $40
The optimization cost? $0.05 one-time. Break-even after 131 predictions.
What I Actually Learned
- Sample size matters enormously in medical AI evaluation
- 100% recall claims should trigger skepticism - real medicine is messy
- DSPy.rb optimization works - 30% recall improvement is substantial
- Direct approach beats complexity - one API call vs multi-stage pipelines
- Dataset bias is real - published cases != real-world distribution
Speed vs Accuracy Trade-offs
From the development logs:
- Optimization time: ~30 seconds total
- Bulk of time: Finding better instructions (67%)
- Testing candidates: Only 10% of optimization time
- Bootstrap attempts: 23% (often fails with vague baselines)
MIPROv2 generates and tests multiple instruction variations automatically. The creative work (generating instructions) takes longer than evaluation.
Production Considerations
For medical screening systems:
- 75% recall means 25% of ADEs are missed
- This requires clinical validation and human oversight
- FDA guidance emphasizes continuous performance monitoring
- Real medical records are messier than published literature
Architecture choice:
- Direct predictor: simpler, cheaper, faster
- Multi-stage: better for research where you need intermediate results
- Both achieve similar recall in practice
Bottom Line
DSPy.rb provides systematic LLM optimization for medical applications. Key takeaways:
- 3 training examples can improve recall by 30%
- 200 test examples reveal realistic performance (75% vs claimed 100%)
- Direct approach beats architectural complexity
- Sample size determines whether your evaluation is meaningful or misleading
The complete implementation: github.com/vicentereig/dspyrb-examples
Reality: 75% recall, 25% missed ADEs, $40/month for 10K daily predictions. DSPy.rb optimizes the prompt, not wishful thinking.
References
-
Chen Z, et al. “Predicting adverse drug event using machine learning based on electronic health records: a systematic review and meta-analysis.” Frontiers in Pharmacology, 2024. DOI: 10.3389/fphar.2024.1497397. Meta-analysis of 59 studies: ML models achieve 62-65% sensitivity, 75% specificity. ↩
-
Bates DW, et al. “The potential of artificial intelligence to improve patient safety: a scoping review.” npj Digital Medicine, 2021;4:54. DOI: 10.1038/s41746-021-00423-6. Comprehensive review of AI applications in patient safety, including ADE detection. ↩
-
Syrowatka A, et al. “Key use cases for artificial intelligence to reduce the frequency of adverse drug events: a scoping review.” The Lancet Digital Health, 2022;4(2):e137-48. DOI: 10.1016/S2589-7500(21)00229-6. Comprehensive review of AI applications for ADE prevention. ↩
-
FDA News Release. “FDA approves first blood test as primary screening option for colorectal cancer.” July 29, 2024. Shield test: 83% sensitivity for cancer, 13% for advanced adenomas. ↩
-
FDA 510(k) Premarket Notification Database. “Invitae Multi-Cancer Panel.” K213950. 2022. Required performance: ≥99.0% positive agreement, ≥99.9% negative agreement. ↩
-
FDA. “Artificial Intelligence and Machine Learning (AI/ML)-Enabled Medical Devices.” FDA Guidance Document, 2021. Good Machine Learning Practice for Medical Device Development: Guiding Principles. ↩