… or how to process DataFrames with RLMs and DSPy

Background: Recursive Language Models

After experimenting with Recursive Language Models (RLMs) since their publication, I’ve been continually impressed by the powerful simplicity of the approach. For some of my existing workflows, it’s been a drop-in replacement, dramatically simplifying tasks like extracting long-form data from 100+ page documents.

The core premise of an RLM is to embed an LLM inside a REPL, giving it access to its inputs symbolically. The distinction discussed in this post, compared to typical coding agents, is interactivity versus programmability. Coding agents provide interactivity out of the box; RLMs and Signatures allow you to “inline” intelligence within a data pipeline.

According to Omar (the paper’s co-author), a model qualifies as an RLM if:

The user prompt is a symbolic object (variable, file, etc.) rather than just a sequence of tokens in the Transformer context window,

and

The model must interact with that symbolic object by writing code in a persistent REPL environment,

and

The code that the model writes must be able to invoke an LLM/RLM inside the REPL (e.g., within loops or recursive functions), and—crucially—not as a discrete sub-agent tool.

This concept is straightforward. The initial RLM implementation focused on strings accessed as variables in the REPL. But, of course, a REPL can support all sorts of types. Allowing an LLM to work with direct access to native variables lets it freely navigate and explore the contours of the data without you having to specify everything up front. Depending on your use case, this can be incredibly powerful.

The first use case that came to mind for extending RLMs beyond strings was DataFrames. Most data science work involves cleaning, combining, analyzing, or reviewing multiple DataFrames at once, so this seemed like a natural fit. Really, the only way today to process structured data is to point a code-writing agent (like Claude Code or Codex) to generate custom scripts for specific tasks. Depending on your objective, this approach can be brittle (for example, outlining a data cleaning approach as a skill). While it can work, it’s not exactly ergonomic for integrating into larger systems or pipelines. DSPy shines here again thanks to its flexible abstractions that don’t “get in your way.” Without it, you would likely be doing a ton of custom parsing and wrangling to get everything into the format you want to work with.

DSPy, RLMs, and the Sandbox

DSPy added RLM support the same week the work was published (Alex Zhang is a PhD student working with Omar Khattab, the creator of DSPy, both at MIT CSAIL).

To add DataFrame support, I have proposed a PR that establishes a protocol for defining how custom types—like DataFrames—can be exposed to the RLM sandbox.

After much discussion, a protocol was introduced that enables anyone to define their own custom type for use with RLMs. The PR implements a new SandboxSerializable protocol (using typing.Protocol) with four abstract methods: sandbox_setup, to_sandbox, sandbox_assignment, and rlm_preview. Any type implementing this protocol automatically inherits a concrete to_repl_variable() method. This standardizes how types interact with the REPL while also allowing the flexibility to specify imports and fine-tune serialization logic as needed.

This matters because, under the hood, DSPy uses Deno and Pyodide to actually execute REPL code. For types that need extra imports, this lets us specify what’s needed in the sandbox at setup. Fortunately for us, Pyodide supports pyarrow and pandas out of the box.

Using this protocol, we implement the methods required by the new SandboxSerializable class. With that in mind, defining an RLM-compatible DataFrame object in DSPy is straightforward (boilerplate omitted below for brevity; see the full implementation here).

Now we can pass a native DataFrame object into a DSPy signature and use it with an RLM module.

Example Usage

For a full implementation of this example, see here.

Let’s set up a simulated scenario using mock data generated by generate_cohort_data.py. In this example, we create a bunch of fake data for users, events, and subscriptions, and embed signals we expect the RLM agent to discover (e.g., the worst channel should be paid_campaign_x with the highest churn of ~45%).

With the data saved as Parquet files, we load our DataFrame class and read the files with pandas as usual:

import pandas as pd
import dspy

# Import the DataFrame class we just built
from dataframe import DataFrame

users = pd.read_parquet("users.parquet")
events = pd.read_parquet("events.parquet")
subscriptions = pd.read_parquet("subscriptions.parquet")

Next, we wrap each of these DataFrames so they are recognized as SandboxSerializable:

wrapped_users = DataFrame(users)
wrapped_events = DataFrame(events)
wrapped_subscriptions = DataFrame(subscriptions)

# This is what the LLM sees in its prompt:
print(wrapped_users.rlm_preview())

The LLM will see a preview like this:

DataFrame: 10,000 rows x 7 columns

Columns:
  user_id: int64
  email: str
  name: str
  signup_date: datetime64[us]
  acquisition_channel: str
  plan_at_signup: str
  country: str

Sample (first 3 rows):
   user_id                      email              name signup_date acquisition_channel plan_at_signup country
0        1  [email protected]        Brian Yang  2024-01-29             organic           free      US
1        2   [email protected]  Jonathan Johnson  2024-01-27     p...

With the data loaded, we can define the DSPy signature as usual:

class CohortRetentionAnalysis(dspy.Signature):
    """You are a data analyst investigating why user retention is dropping.

    Investigate the data step by step. Compute retention by cohort, segment by acquisition channel, compare feature usage between retained and churned users, and identify the root cause of churn.
    """

    users: DataFrame = dspy.InputField(desc="User profiles with signup_date, acquisition_channel, country")
    events: DataFrame = dspy.InputField(desc="Feature usage events with user_id, event_type, timestamp")
    subscriptions: DataFrame = dspy.InputField(desc="Subscription records with status, plan, mrr, cancellation_reason")

    overall_churn_rate: float = dspy.OutputField(desc="Overall churn rate as a decimal (e.g. 0.25 for 25%)")
    worst_channel: str = dspy.OutputField(desc="Acquisition channel with highest churn rate")
    key_finding: str = dspy.OutputField(desc="The main insight about what differentiates churned users")
    recommendations: str = dspy.OutputField(desc="2-3 actionable recommendations based on the analysis")

Here you see we can define very specific fields: churn rate (float), worst channel, and some free-form recommendations. You can be as precise or loose here as you’d like.

That’s pretty much it—you now have an in-line RLM that will recursively explore the data until it is ready to produce your typed responses.

I recommend enabling verbose output; it’s fascinating to watch the LLM reason through the data step by step:

(click to expand) Show RLM step-by-step reasoning excerpts

...
2026/03/08 15:52:10 INFO dspy.predict.rlm: RLM iteration 1/15
Reasoning: First step is to explore the data structure and key distributions: shapes, date ranges, status values, acquisition channels, plans, nulls, and some sample cancellation reasons/event types. Then I can define churn/retention properly and compute cohort analyses in later steps.
Code:

print("users shape:", users.shape)
print("events shape:", events.shape)
print("subscriptions shape:", subscriptions.shape)

...

2026/03/08 15:52:21 INFO dspy.predict.rlm: RLM iteration 2/15
Reasoning: Next I should define churn/retention precisely and compute the main aggregates. Since `subscriptions.status` already indicates active vs cancelled, overall churn rate can be cancelled / total users. For retention by cohort, monthly signup cohorts with retained = active likely works, but I should also be careful about censoring because later cohorts have less time to churn. A good first pass is to compute churn/retention by signup month and by acquisition channel, plus a cohort-by-channel pivot. Then I can identify the worst channel and inspect whether its churn differs by plan mix. After that I’ll compare feature usage between retained and churned users, ideally normalizing by account age / pre-churn activity windows. First, compute merged base table, churn by cohort/channel/plan, and some cross-tabs.
Code:

import pandas as pd
import numpy as np

base = users.merge(subscriptions, on="user_id", how="inner", suffixes=("_user", "_sub"))
base["is_churned"] = base["status"].eq("cancelled")
base["is_retained"] = base["status"].eq("active")
base["signup_month"] = base["signup_date"].dt.to_period("M").astype(str)

overall_churn_rate = base["is_churned"].mean()

...

2026/03/08 15:53:15 INFO dspy.predict.rlm: RLM iteration 5/15
Reasoning: We have a very strong signal, but the perfect 0% churn among paid_campaign_x users who used advanced_reports could be suspicious or still valid. Before submitting, I should verify counts/consistency and formulate a concise key finding. I also want to see if paid_campaign_x differs in advanced_reports adoption over cohorts and whether no users who churned ever used advanced_reports in that channel. Then I can safely submit with overall churn rate, worst channel, root cause, and recommendations focused on acquisition quality/onboarding to advanced_reports.
Code:

# Final verification and compact summary stats for submission
print("Overall churn rate check:", base["is_churned"].mean())

worst_channel = channel_stats["churn_rate"].idxmax()
print("Worst channel:", worst_channel)
print(channel_stats.loc[worst_channel])

# Verify the striking advanced_reports pattern
pcx = fa[fa["acquisition_channel"] == "paid_campaign_x"].copy()
print("\nPaid Campaign X advanced_reports usage vs churn crosstab:")
print(pd.crosstab(pcx["used_advanced_reports"], pcx["status"], margins=True))

...

Results

After some work, you get output like the following:

print(f"Overall Churn Rate: {result.overall_churn_rate:.1%}")
print(f"Worst Channel:     {result.worst_channel}")
print(f"\nKey Finding:\n  {result.key_finding}")
print(f"\nRecommendations:\n  {result.recommendations}")
print(f"\nRLM completed in {len(result.trajectory)} iterations")

Overall Churn Rate: 23.8%
Worst Channel:     paid_campaign_x

Key Finding:
  Churn is 23.77% overall, and paid_campaign_x is the clear outlier with 44.76% churn across every signup cohort and plan. The strongest behavioral signal is early adoption of advanced_reports: users who adopt it churn far less overall (14.7% vs 27.7%), while paid_campaign_x has by far the lowest advanced_reports adoption (18.7% vs ~33-35% for other channels). Within paid_campaign_x, churn is concentrated among users who never adopt advanced_reports, indicating the retention drop is primarily an acquisition quality/onboarding problem rather than a broad product usage decline.

Recommendations:
  Pause or tighten paid_campaign_x targeting; redesign onboarding for campaign_x users to drive advanced_reports activation in the first 14 days; align ad and landing page messaging to the product's core reporting use case; and track weekly activation-to-retention funnels by channel with advanced_reports adoption as the leading indicator.

RLM completed in 7 iterations

This is a straightforward example, but the RLM accurately predicted the expected churn rate and correctly identified the worst channel based on our mock data.

Metric	Expected	Actual
Overall Churn Rate	~23–25%	23.8%
Worst Channel	paid_campaign_x (≈45% churn)	paid_campaign_x (44.76% churn)
Advanced Reports Use	~33–35% adoption outside PCX	33–35% for other channels, 18.7% for PCX
PCX w/ Advanced Rep.	Should have lowest adoption	18.7% (lowest, as expected)
PCX Churn among Non-Users of Advanced Reports	Highest	Churn concentrated among non-users of feature
RLM Reasoning	Finds root cause; links advanced_reports adoption to retention drop in PCX	Correctly identifies retention problem as onboarding/feature adoption
Recommendations	Focus on channel and onboarding improvements	Pausing/restricting channel and improving onboarding to boost advanced_reports activation

The RLM matched the expected churn rates, pinpointed the problematic channel, and accurately traced retention issues to behavioral patterns and onboarding—a strong demonstration of DSPy’s recursive capabilities.

Next Steps

There are several promising directions to continue exploring with RLMs:

Support for Additional Types: Extend the protocol to other complex data structures—such as images, geospatial data, or domain-specific objects—to enable more advanced analysis.
Prompt Optimization with GEPA: Use GEPA’s optimize_anything to evolve the solver code itself—the prompt template, signature, RLM parameters, and helper tools—optimizing for accuracy across the benchmark. With 257 tasks and a proper train/validation split, there’s room to discover domain-specific improvements that generalize across question types. There’s an interesting branch here where one tries to optimize the RLM trajectories themselves.
Interactive Debugging & Traceability: Explore improved debugging tools for RLM sessions, add support for step-by-step execution tracebacks, and enable interactive inspection of intermediate results within the sandbox environment. There’s plenty of ongoing work in this area.
End-to-End Pipelines: Integrate RLM-empowered objects into larger data pipelines or external orchestration systems, automating complex multi-stage analyses.

Appendix: Benchmarking DABench

To measure how well this approach works at scale, I ran the RLM solver against InfiAgent-DABench, which is a benchmark of 257 data analysis questions across 68 CSV files, spanning summary statistics, correlation analysis, distribution analysis, feature engineering, outlier detection, and machine learning. Each question provides a CSV file, a question, constraints, and an expected answer in @field[value] format.

The solver is simple: load the CSV as a DataFrame, pass it to the RLM along with the question and constraints, and let the model write Python code iteratively in the sandbox to arrive at the answer.

class DataAnalysisTask(dspy.Signature):
    """You are a data analyst. Given a dataset and a question, write Python code
    to analyze the data and produce the answer.

    The `data` variable is a pandas DataFrame already loaded in memory.
    Read the constraints carefully for methodology requirements.
    Format your answer exactly as specified in format_spec using @field[value] notation.
    """

    data: DataFrame = dspy.InputField(desc="The dataset as a pandas DataFrame")
    question: str = dspy.InputField(desc="The data analysis question to answer")
    constraints: str = dspy.InputField(desc="Methodology constraints and requirements")
    format_spec: str = dspy.InputField(desc="Required answer format using @field[value] notation")
    answer: str = dspy.OutputField(desc="The answer formatted per format_spec")

def run_task(question, constraints, format_spec, csv_path, verbose=False):
    data = DataFrame(pd.read_csv(csv_path))
    rlm = dspy.RLM(DataAnalysisTask, max_iterations=15, verbose=verbose)
    result = rlm(data=data, question=question, constraints=constraints, format_spec=format_spec)
    return result.answer

That’s the entire solver. No special prompting for different question types, no retry logic, no post-processing—just a generic signature and the RLM.

Results

I tested two models across all 257 questions with four parallel workers:

Model	Easy (82)	Medium (87)	Hard (88)	Total (257)	Avg Iters	Avg Time
Qwen 3.5 397B	72 (88%)	79 (91%)	72 (82%)	223 (86.8%)	2.8	24.4s
MiniMax M2.7	75 (91%)	75 (86%)	72 (82%)	222 (86.4%)	6.1	73.7s

Somewhat surprisingly, both models achieve approximately 87% accuracy with the same solver code. The notable difference is efficiency: Qwen solves tasks in 2.8 iterations on average (24s), while MiniMax requires 6.1 iterations (74s) to reach the same accuracy. This suggests Qwen is better at planning its analysis upfront, while MiniMax takes a more exploratory approach.

The tasks that both models struggle with tend to involve ambiguous ground truth (e.g., population vs. sample standard deviation, or questions where the expected answer depends on specific preprocessing choices not fully specified in the constraints). A handful of questions appear to have debatable answers, which is expected in any benchmark of this size.

Running Your Own Benchmarks

The evaluation harness supports parallel execution and structured JSON output for easy comparison:

# Run a baseline
uv run python eval_with_solver.py --model openrouter/qwen/qwen3.5-397b-a17b -p 4

# Compare results across models
uv run python compare_results.py eval_results/*/

Each run saves structured results to eval_results/<timestamp>/results.json, making it easy to track improvements over time or compare models.

The full code is available here.