Turning Messy Prompts into Repeatable Reasoning Systems

December 2025 – Vicente Reig Rincón de Arellano

Key Takeaways

Stop string-interpolating prompts. Use typed Signatures to define clear contracts between your app and the LLM—specify inputs, outputs, and docstrings that guide the model’s behavior.
Separate the “what” from the “how.” Predictors like Predict, ChainOfThought, and ReAct let you swap reasoning strategies without rewriting your entire prompt.
Measure before you optimize. Evals give you a decision framework—ChainOfThought showed +3% improvement on faithfulness and coherence metrics compared to Predict in medical coding tasks.
Ruby-first, batteries included. DSPy.rb ships with OpenTelemetry observability, tool-wielding agents, and prompt optimization out of the box.

Slide 1

Turning Messy Prompts into Repeatable Reasoning Systems

In a world that changes every other week

Vicente Reig, https://vicente.services

Slide 2

I am a consultant I help founders to solve the challenge of building and scaling reasoning systems.

Vicente Reig, https://vicente.services

(sounds like bee-then-te)

Speaker Notes

Hi, good evening everyone. Thank you for coming and having me today. It’s really cold out there and I appreciate everyone leaving the comfort of their homes and join us.

My name is Vicente. I am consultant helping founders to build AI workflows and agents that are predictable and easy to evolve.

Today, I am walking you through the tools and systems I use to build better products in a world where everything changes every other week.

Slide 3

export const generateEmailPrompt = ({ requesterName, sellerCompanyName, prospectName,
  responsibilities, prospectTitle, successCriteria, reference, pains, gains, salesTip,
  useCase, pitch }: EmailPromptParams) => {
  return `
<context>
You are ${requesterName}, a sales enablement expert with deep understanding of Value Selling.
You work for ${sellerCompanyName}.
</context>
<style>
Use a warm conversational voice and tone. Use natural language appropriate for sales
prospecting. Ensure grammatically correct sentences.
</style>
<pitch>${pitch}</pitch>
<useCase>${useCase}</useCase>
<news>${reference}</news>
<pains>${pains}</pains>
<gains>${gains}</gains>
<responsibilities>${responsibilities}</responsibilities>
<successCriteria>${successCriteria}</successCriteria>
<tasks>
1. Compose a mobile friendly email highlighting how the pitch addresses ${prospectName}'s
   responsibilities, pains, gains, and success criteria.
2. Create a personal connection in the first sentence.
3. The salutation must be informal.
4. Include a call to action as a P.S.
</tasks>
<output>
Sales tip: ${salesTip}
Your response must be valid json with "emailSubject" and "emailBody" keys.
Include no additional text, just the json:
{ "reasoning": "...", "emailSubject": "", "emailBody": "Hi...\\n\\nP.S. call to action" }
<emailSubject>
- no more than 5 words, tailored to individual, succinctly create anxiety
- must not contain emojis or special characters
</emailSubject>
<emailBody>
- use a 5th grade reading level, MUST be less than 100 words
- no more than 40 words per paragraph, no more than 9 words per sentence
- take less than 15 seconds to read, active voice, easy to scan
- Use 2 line breaks between paragraphs
- Each thought should be a new paragraph. Use forceful and decisive language.
- First paragraph: personal and relevant based on recent research
- Second paragraph: specific to fears and pain points
- Third paragraph: seller's value proposition (pitch) for a 5th grader
<callToAction>
- Do not ask for a meet or a time to chat
- Ask if it's worth learning more / Ask for their expert opinion / Clarify if priority
</callToAction>
</emailBody>
</output>
`};

Speaker Notes

Raise your hand if you’ve seen something like this in production. And don’t worry, it doesn’t fit on screen on purpose.

Now raise your hand if you’ve had to modify this when a new model came out. Or when your PM asked for a new feature. Or when the model started hallucinating last Tuesday.

This is the problem. Prompts are load-bearing infrastructure, and we treat them like a string concatenation problem.

and btw this is an actual prompt from a recent project I worked. Nevermind the typescript/

What if there was a better way to work with prompts?

Slide 4

What if we modeled prompts after functions?

f(x) = y

Input types. Output types. Testable. Composable.

Speaker Notes

You define what goes in. You define what comes out. The framework handles the rest. You detach yourself from the nuances of dealing with an LLM.

You can compose prompts into workflows. You can start modeling and evolving your products using a higher level vocabulary.

This is the core idea behind the DSPy Paradigm. DSPy stands for: Declarative Self-Improving Python, which I’ve reframed from a Ruby perspective.

From now on we are gonna be referring as these functions as Signatures.

Slide 5

A Contract between you and your LLM

# Before: String interpolation
prompt = "Summarize this text: #{text}. Keep it to 2-3 sentences..."

# After: Typed contract
class SummarizeTask < DSPy::Signature
  description "Summarize text concisely while preserving key concepts"

  input do
    const :text, String
  end

  output do
    const :summary, String, description: 'Keep it to 2-3 sentences'
  end
end

DSPy.configure do |c|
  c.lm = DSPy::LM.new('claude-opus-4-5-20251101', api_key: ENV['ANTHROPIC_API_KEY'])
end

summarizer = DSPy::Predict.new(SummarizeTask)
summarizer.call(text: 'In a place in La Mancha, whose name I do not wish to remember...').summary

Speaker Notes

We lived in a world where we’d manually write these monolithic prompts.

Now, we can write the contract that governs the relationship between your app and the LLM. You describe the goal of the task, how you are going to provide the necessary information to complete it, and how you want it back.

The framework compiles this into a prompt. So you never write them manually ever again. You get automatic input validations and structured outputs via Sorbet Types. (and few more things OOTB that we will see later)

When the models change, your contract stays the same. Or when the prompting technique change, the compiled prompt adapts.

[pointing at slides] so what happens when want better summaries?

Slide 6

Better Summaries!

One Signature, Two Predictors

# Same contract, different strategies
predict = DSPy::Predict.new(Summarize)
cot     = DSPy::ChainOfThought.new(Summarize) # adds reasoning output field

# Both produce the same output type
article = 'In a place in La Mancha, whose name I do not wish to remember...'
predict.call(text: article).summary    # => String
cot.call(text: article).summary        # => String
cot.call(text: article).reasoning      # => String (bonus)

Same Signature. Different strategies. Same typed output.

Speaker Notes

You learn that Chain of Thought is a good and simple technique to improve the results.

We can compare how the same contract performs under different prompting techniques.

Predict goes straight to the answer. ChainOfThought adds a reasoning step.

And we can run one after the other and compare them. Now is it better?

Slide 7

Now, is it better for real?

(heckler in the background) Aren’t we just vibing all this?

Speaker Notes

The truth is that we really don’t know!

What are we gonna do test it in production? We can do better!

That’s where Evals come in.

Slide 8

Are these Summaries any good? (1/4)

Evals as a Decision Framework

class EvaluateSummary < DSPy::Signature
  description "Evaluate summary quality"

  input do
    const :grounded_summary, GroundedSummary
    const :mindset, EvaluatorMindset
  end

  output do
    const :faithfulness, Integer,
      description: "Score 1-5: factually accurate?"
    const :relevance, Integer,
      description: "Score 1-5: captures key info?"
    const :coherence, Integer,
      description: "Score 1-5: well-structured?"
    const :fluency, Integer,
      description: "Score 1-5: readable?"
  end
end

class EvaluatorMindset < T::Enum
  enums do
    Critical = new('critical')
    Balanced = new('balanced')
    Generous = new('generous')
  end
end

class GroundedSummary < T::Struct
  const :source_text, String
  const :summary, String
end

Speaker Notes

We rolled out our sleeves and wrote another Signature. We can model the prompt and the context using plain Ruby structures.

An evaluating function takes the summary we generated alongside the original text, and we ask it to evaluate it using the 5 dimensions of the G-Eval framework.

Now we’ve got something we can work with.

Slide 9

Which predictor is better?

Predict?

ChainOfThought?

Speaker Notes

Let’s do a quick poll first!

Who thinks Chain of Thought produces better output? [Wait for hands]
Who thinks just asking for the straight answer is better? [Wait for hands]

Alright, how do we measure quality?

Slide 10

Are these Summaries any good? (2/4)

Metrics

def create_judge_metric
  judge = DSPy::ChainOfThought.new(EvaluateSummary)

  ->(_example, prediction) do
    evaluation = judge.call(
      grounded_summary: prediction.grounded_summary,
      mindset: EvaluatorMindset::Critical,
    )
    {
      passed: evaluation.overall_score >= 3.5,
      score: evaluation.overall_score / 5.0, # Normalize to 0-1
      faithfulness: evaluation.faithfulness,
      # ...
    }
  end
end

LLMs judging free form text

Speaker Notes

We need a metric to score the result.

In defining these metrics is where the teams spend a good chunk of the effort.

There’s no one metric to rule them all. It really depends on the problem you are solving.

Some examples:

we are using an LLM as a Judge here to evaluate free form text.
LLMs shine when classifying unstructured information. In that case we’d use accuracy, precision and recall.

Slide 11

Are these Summaries any good? (3/4)

Metrics

llm_judge_metric = create_judge_metric
articles = [ ... ]
# Evaluate Predict
puts "Evaluating DSPy::Predict..."
predict = DSPy::Predict.new(Summarize)
predict_evaluator = DSPy::Evals.new(predict, metric: llm_judge_metric)
predict_metrics = predict_evaluator.evaluate(articles)

# Evaluate ChainOfThought
puts "Evaluating DSPy::ChainOfThought..."
cot = DSPy::ChainOfThought.new(Summarize)
cot_evaluator = DSPy::Evals.new(cot, metric: llm_judge_metric)
cot_metrics = cot_evaluator.evaluate(articles)
print_results(predict_metrics, cot_metrics)

A natural sequence of low-risk and reversible decisions.

Speaker Notes

Evals guide your team’s course of action, informing next steps in the development.

What I really like is that they introduce planned decision checkpoints. A sequence of low-risk and reversible decisions. The cool thing is that they create opportunities to reassess and change course as needed.

They tie everything together: Predictor, examples, metric in a single API. You can run both approaches through the same evaluator. Compare the numbers.

Alright, let’s take a look at the results!

Slide 12

Are these Summaries any good? (4/4)

The Results

Overall Scores

Predict:        93.0%
ChainOfThought: 96.0%
---
Improvement:    +3.0 pts

Per Dimension

Dimension	Delta
Faithfulness	+0.4
Coherence	+0.2
Relevance	+0.0
Fluency	+0.0

CoT wins on faithfulness and coherence. Now you know.

Speaker Notes

CoT barely wins. The interesting part: it wins on faithfulness and coherence. Not relevance or fluency. So there’s a trade off for y’all there to make. Now you know.

The real win isn’t the number. It’s that you can measure it.

Now I mentioned DSPy.rb came with a few tools out of the box.

Slide 13

Batteries Included 1/3

Built-in Observability

Speaker Notes

Because you don’t need Neo’s Matrix Vision to understand what is going in production, observability is one of those.

Every call is traced with any Open Telemetry sink like Langfuse in here.

When something breaks in production, you can see exactly which LLM call failed, review inputs, outputs, costs and response times.

Slide 14

Batteries Included 2/3

Tool wielding agentic loops

require 'dspy'

DSPy.configure do |c|
  c.lm = DSPy::LM.new('openai/gpt-4o-mini', api_key: ENV['OPENAI_API_KEY']) # global
end

research_agent = DSPy::ReAct.new(ResearchAssistant, tools: [
  WebSearchTool.new,
  CalculatorTool.new,
  DataAnalysisTool.new
])

research_agent.configure do |c|
  c.lm = DSPy::LM.new('anthropic/claude-opus-4-5-20251101', api_key: ENV['ANTHROPIC_API_KEY'])
end

result = research_agent.call(
  topic: "Heating up frozen pork buns with a Dutch oven",
  depth: ResearchDepth::Detailed
)

puts "Summary: #{result.summary} Confidence: #{result.confidence}"

Speaker Notes

DSPy.rb comes with three core Predictors, so you have a place to start. They are your workflows and agents building blocks.

We will cover ReAct in a separate talk. In the meantime, you can visit the project’s blog where I am walking through a bunch of examples!

Slide 15

Batteries Included 3/3

Let the Model Write Your Prompts

class ADETextClassifier < DSPy::Signature
  description "Determine if a clinical sentence describes an adverse drug event (ADE)"
  input  { const :text, String }
  output { const :label, ADELabel }
end

# config/optimized_prompts/mipro_v2/20250721112200_ade_text_classifier.txt
- Determine if a clinical sentence describes an adverse drug event (ADE)
+ Evaluate the provided clinical sentence and classify it as describing an adverse
+ drug event (ADE) by identifying key indicators of negative drug reactions. Assign
+ a label of '1' for adverse events and '0' for cases without adverse effects. Be
+ thorough in your analysis, considering the context of medications and pre-existing
+ health issues.

Accuracy : +3.33 pp
Precision: +4.17 pp
Recall   : +6.67 pp
F1 Score : +5.13 pp

Speaker Notes

Last but not least: I said you’d never write a prompt again!

Now we can leverage Evals to let the models optimize our prompts. Once you’ve defined your metrics, you can steer the model in improving your prompts.

This piece is adapted from a classifier project I worked in the past as well.

Slide 16

The MVC Moment for AI

in a world where everything changes.

<div class="text-4xl mb-4">Signatures</div>
<div class="text-gray-400">Contracts that survive model changes</div>

<div class="text-4xl mb-4">Predictors</div>
<div class="text-gray-400">Strategies you can swap</div>

<div class="text-4xl mb-4">Evals</div>
<div class="text-gray-400">Decisions you can measure</div>

gem install dspy

https://vicentereig.github.io/dspy.rb/

Speaker Notes

We have seen Contracts that survive model changes. Strategies you can swap. Decisions you can measure. Prompts that self-improve.

20 years ago the Model View Controller Architecture gave us structure to build web apps and web services. The DSPy Paradigm gives us today the structure for AI.

Thank you all for your time.

Slide 17

dspy.rb

Questions?

Vicente Reig Rincon de Arellano - @highwayvaquero

https://vicentereig.github.io/dspy.rb

Speaker Notes

Happy to take questions if there’s time!