I spent the weekend building an Adverse Drug Event detector with DSPy.rb. Small test sets gave me 100% recall—too good to be true. A 200-example evaluation revealed the reality: 75% recall. Scaling to 1,200 examples exposed the full precision/recall tug-of-war.
This post documents the trade-offs. Same code, different sample sizes, wildly different conclusions.
The Dataset
The ade-benchmark-corpus/ade_corpus_v2 dataset has 23,516 examples from published medical literature—6,821 positive ADEs (29%), 16,695 negative (71%).
The catch: published literature contains mostly obvious cases.
| Patient Report | Has ADE | The Lesson |
|---|---|---|
| “Patient experienced severe nausea after taking aspirin for headache” | true | Obvious case—tiny samples catch these easily |
| “During the first days of arsenic trioxide treatment a rapid decrease in the D-dimers was seen…” | true | Subtle case—only surfaced in 200+ example tests |
| “Patient has been taking lisinopril for 6 months with excellent BP control” | false | Clear negative—easy to classify |
Dataset bias means small samples inflate your metrics. The subtle ADEs that trip up your model only appear when you test with larger, more representative samples.
The Direct Approach
Instead of a multi-stage pipeline, I built a single predictor:
class ADETextClassifier < DSPy::Signature
description "Determine if a clinical sentence describes an adverse drug event (ADE)"
class ADELabel < T::Enum
enums do
NotRelated = new("0")
Related = new("1")
end
end
input do
const :text, String, description: "Clinical sentence or patient report"
end
output do
const :label, ADELabel, description: "Whether the sentence is ADE-related"
end
end
One API call. The LLM reads the medical text and outputs a structured 0/1 label. The training script wires this up as DSPy::Predict.new(ADETextClassifier).
What DSPy.rb Does
DSPy.rb doesn’t throw more data at the problem. It systematically improves your prompts.
Baseline prompt:
"Analyze medical text to detect adverse drug events"
Vague. No methodology. The LLM guesses.
After MIPROv2 optimization:
In a high-stakes clinical scenario, assess the provided clinical sentence
and accurately identify whether it describes an adverse drug event (ADE).
Your determination should clearly indicate '0' for a stable treatment
response or '1' for a serious adverse reaction. Use context clues from
the clinical scenario to inform your decision-making process.
Systematic. Medical methodology. Explicit output structure.
The Metric
Accuracy alone hid the class-imbalance problem. The current metric returns 1.0 for true positives, 0.5 for true negatives, 0.0 for false negatives, and 0.2 for false positives. That single change is why the optimizer gladly trades precision for recall—the scoring surface values catching ADEs far more than avoiding noisy flags.
The Precision/Recall Map
I re-ran the optimizer five different ways. Every number comes from the examples or the reproducibility log.
| Run | Config | Baseline | Optimized | Lesson |
|---|---|---|---|---|
| GPT-5, manual 12 trials | 1,200 examples | 79.6 / 58.2 / 82.8 / 68.4 | 76.7 / 53.6 / 92.2 / 67.8 | Scoring that penalizes false negatives trades precision for recall |
| GPT-5, manual 6 trials | 1,200 examples | 79.6 / 58.2 / 82.8 / 68.4 | 80.4 / 59.8 / 81.3 / 68.9 | Low trial count barely moves the needle |
| GPT-5, manual 18 trials | 1,200 examples | 78.8 / 57.1 / 81.3 / 67.1 | 78.3 / 55.5 / 95.3 / 70.1 | More search space found 95% recall, sacrificing accuracy |
| Claude Sonnet 4.5, auto-medium | 600 examples | 72.5 / 48.4 / 100 / 65.3 | 83.3 / 72.0 / 58.1 / 64.3 | Auto presets prefer “explicit causal verb” instructions that clamp recall |
| Claude Sonnet 4.5, auto-light | 10 examples | 33.3 / 33.3 / 100 / 50.0 | 100 / 100 / 100 / 100 | Tiny validation sets create misleading scores—it just memorized the sample |
Format: Accuracy / Precision / Recall / F1
Trial count decides where you land on the precision-recall curve. Auto presets respect their scoring signal. Validation size matters more than the preset.
The Reality Check
At 8-25 examples, I got 100% recall and 90%+ precision—suspiciously perfect. At 200 examples, recall dropped to 75% and I started missing subtle cases like arsenic trioxide effects. At 1,200 examples (720/240/240 split), the baseline landed at 79.6% accuracy / 58.2% precision / 82.8% recall; the optimized prompt pushed recall to 92.2% at the cost of precision dropping to 53.6%.
Larger, balanced splits expose subtle ADEs. The custom metric happily pays for recall with a few extra false positives.
Production Reality
Cost per prediction on GPT-4o-mini runs about $0.00013—roughly $1.34/day for 10K predictions, $40/month. The full optimization run (238k tokens) cost $0.039.
A 2024 meta-analysis of 59 ADE-detection studies reports 62-65% sensitivity for ML systems. With 92-95% recall, this optimized prompt operates in the same band as cleared screening tools—but even at 95% you still miss 5% of ADEs. Clinical oversight stays mandatory. Real medical records are messier than published literature, and a production detector needs FDA clearance, real-world performance tracking, and clinical validation.
Bottom Line
Three things:
-
Sample size determines whether your evaluation is meaningful or misleading. The 10-sample run “achieved” 100/100/100/100 by memorizing four sentences.
-
Trial budgets + scoring just move you along the precision/recall curve. Pick the point that matches your clinical tolerance.
-
Direct approach beats complexity. One API call + prompt search handles these swings better than multi-stage pipelines.
The complete implementation: github.com/vicentereig/dspy.rb/tree/main/examples/ade_optimizer_miprov2
References
- Chen Z, et al. “Predicting adverse drug event using machine learning based on electronic health records.” Frontiers in Pharmacology, 2024.
- Bates DW, et al. “The potential of artificial intelligence to improve patient safety.” npj Digital Medicine, 2021.
- Syrowatka A, et al. “Key use cases for artificial intelligence to reduce the frequency of adverse drug events.” The Lancet Digital Health, 2022.
- FDA. “Artificial Intelligence and Machine Learning (AI/ML)-Enabled Medical Devices.” FDA Guidance Document, 2021.