Training Medical LLM Predictors: Process, Costs, and Optimization with DSPy.rb

I spent the weekend building an Adverse Drug Event detector with DSPy.rb. Small test sets gave me 100% recall (too good to be true). A proper 200-example evaluation revealed the reality: 75% recall. Here’s what I learned about medical AI and sample sizes.

The Dataset: HuggingFace ADE Corpus V2

The ade-benchmark-corpus/ade_corpus_v2 dataset is massive but has a critical characteristic: it contains mostly obvious cases from published medical literature.

23,516 total examples
6,821 positive ADEs (29%), 16,695 negative (71%)
Real patient reports with drug-effect relationships

What the Data Actually Looks Like

Patient Report	Has ADE	Sample Size Lesson
“Patient experienced severe nausea after taking aspirin for headache”	true	Obvious case - tiny samples catch these easily
“During the first days of arsenic trioxide treatment a rapid decrease in the D-dimers was seen…”	true	Subtle case - only found this in 200-example test
“Patient has been taking lisinopril for 6 months with excellent BP control”	false	Clear negative - easy to classify

The problem: dataset bias. Published medical literature contains mostly clear-cut cases. The subtle ADEs that trip up your model only appear when you test with larger, more representative samples.

The Direct Approach: One API Call

Instead of a complex multi-stage pipeline, I built a single predictor that does everything in one shot:

class ADEDirectPredictor < DSPy::Signature
  description "Analyze medical text to detect adverse drug events"
  
  input do
    const :text, String, description: "Medical report text"
  end
  
  output do
    const :has_ade, T::Boolean, description: "Whether text describes an ADE"
    const :confidence, Float, description: "Confidence score (0-1)"
    const :reasoning, String, description: "Explanation of the decision"
  end
end

Simple. Direct. One API call. The LLM reads the medical text and decides: ADE or not?

Optimization: What DSPy.rb Actually Does

DSPy.rb doesn’t throw more data at the problem. It systematically improves your prompts.

Baseline Prompt (Generic)

"Analyze medical text to detect adverse drug events"

Vague. No methodology. The LLM guesses.

After SimpleOptimizer (Development)

optimizer = DSPy::SimpleOptimizer.new(
  metric: -> (example, pred) { evaluator.call(example, pred) },
  max_bootstrapped_examples: 3
)

optimized = optimizer.compile(predictor, trainset: training_data)

Result: “Analyze the provided medical text to determine if there is an adverse drug event present by evaluating symptoms and medication relationships.”

Better. More specific methodology.

After MIPROv2 (Production)

optimizer = DSPy::Teleprompt::MIPROv2.new(
  metric: -> (example, pred) { evaluator.call(example, pred) },
  num_candidates: 10
)

optimized = optimizer.compile(predictor, trainset: training_data)

Result: “Analyze the provided medical text along with symptoms and medications. Determine if there is an adverse drug event by evaluating the relationships between drugs and symptoms. Output your confidence level and reasoning.”

Systematic. Medical methodology. Explicit output structure.

The Reality Check: Sample Sizes Matter

Here’s where things got interesting (and humbling):

Small Sample Results (8-25 examples)

Recall: 100% 🤔
Precision: 90%+
F1 Score: 95%+
The reaction: “This is suspiciously perfect…”

Statistical Reality Check (200 examples)

Recall: 75% ✅ More realistic
False Negative Rate: 25% - significant medical concern
F1 Score: 66.7% - honest performance
The missed case: Subtle arsenic trioxide effects

Lesson: Dataset bias in ADE Corpus V2 means small samples catch only obvious cases. Real evaluation needs statistical significance.

Performance Comparison

Testing with 200 examples using DSPy.rb’s native evaluation framework:

Approach	Recall	Precision	F1	Speed	Cost/1K
Baseline	45%	50%	47%	1.2s	$0.15
SimpleOptimizer	68%	70%	69%	1.2s	$0.15
MIPROv2	75%	67%	71%	1.2s	$0.15

The optimization improved recall by 30 percentage points - that’s 30% fewer missed ADEs. In medical contexts, this matters.

Medical Industry Context and Standards

Where the Industry Actually Stands

A 2024 meta-analysis of 59 studies found ML-based ADE detection averages 62-65% sensitivity¹. For comparison, physicians miss even more—systematic underreporting is well-documented². The Lancet Digital Health review identified AI as promising but noted most systems are still in early development³.

The 75% recall achieved puts this slightly above the 62-65% industry average.

For context, FDA-approved medical screening tests vary widely:

Shield blood test for colorectal cancer: 83% sensitivity⁴
Genetic predisposition panels: ≥99% sensitivity⁵

Regulatory Considerations

The FDA’s guidance on AI/ML-based Software as Medical Device (SaMD) requires “Good Machine Learning Practice” including transparency and real-world performance monitoring⁶. While this ADE detection system is a research prototype, production deployment would require:

FDA 510(k) premarket notification or De Novo classification
Continuous performance monitoring post-deployment
Documentation of training data and model versions
Clinical validation studies

The Money Reality

Using GPT-4o-mini (current pricing):

Input tokens: ~744 per prediction
Output tokens: ~38 per prediction
Cost per prediction: ~$0.00013
Daily cost (10K predictions): $1.34
Monthly: $40

The optimization cost? $0.05 one-time. Break-even after 131 predictions.

What I Actually Learned

Sample size matters enormously in medical AI evaluation
100% recall claims should trigger skepticism - real medicine is messy
DSPy.rb optimization works - 30% recall improvement is substantial
Direct approach beats complexity - one API call vs multi-stage pipelines
Dataset bias is real - published cases != real-world distribution

Speed vs Accuracy Trade-offs

From the development logs:

Optimization time: ~30 seconds total
Bulk of time: Finding better instructions (67%)
Testing candidates: Only 10% of optimization time
Bootstrap attempts: 23% (often fails with vague baselines)

MIPROv2 generates and tests multiple instruction variations automatically. The creative work (generating instructions) takes longer than evaluation.

Production Considerations

For medical screening systems:

75% recall means 25% of ADEs are missed
This requires clinical validation and human oversight
FDA guidance emphasizes continuous performance monitoring
Real medical records are messier than published literature

Architecture choice:

Direct predictor: simpler, cheaper, faster
Multi-stage: better for research where you need intermediate results
Both achieve similar recall in practice

Bottom Line

DSPy.rb provides systematic LLM optimization for medical applications. Key takeaways:

3 training examples can improve recall by 30%
200 test examples reveal realistic performance (75% vs claimed 100%)
Direct approach beats architectural complexity
Sample size determines whether your evaluation is meaningful or misleading

The complete implementation: github.com/vicentereig/dspyrb-examples

Reality: 75% recall, 25% missed ADEs, $40/month for 10K daily predictions. DSPy.rb optimizes the prompt, not wishful thinking.

References

Chen Z, et al. “Predicting adverse drug event using machine learning based on electronic health records: a systematic review and meta-analysis.” Frontiers in Pharmacology, 2024. DOI: 10.3389/fphar.2024.1497397. Meta-analysis of 59 studies: ML models achieve 62-65% sensitivity, 75% specificity. ↩
Bates DW, et al. “The potential of artificial intelligence to improve patient safety: a scoping review.” npj Digital Medicine, 2021;4:54. DOI: 10.1038/s41746-021-00423-6. Comprehensive review of AI applications in patient safety, including ADE detection. ↩
Syrowatka A, et al. “Key use cases for artificial intelligence to reduce the frequency of adverse drug events: a scoping review.” The Lancet Digital Health, 2022;4(2):e137-48. DOI: 10.1016/S2589-7500(21)00229-6. Comprehensive review of AI applications for ADE prevention. ↩
FDA News Release. “FDA approves first blood test as primary screening option for colorectal cancer.” July 29, 2024. Shield test: 83% sensitivity for cancer, 13% for advanced adenomas. ↩
FDA 510(k) Premarket Notification Database. “Invitae Multi-Cancer Panel.” K213950. 2022. Required performance: ≥99.0% positive agreement, ≥99.9% negative agreement. ↩
FDA. “Artificial Intelligence and Machine Learning (AI/ML)-Enabled Medical Devices.” FDA Guidance Document, 2021. Good Machine Learning Practice for Medical Device Development: Guiding Principles. ↩