I used a network of collaborative agents to solve a Jane Street puzzle

A new shift in agentic use, welcome the the outer loop era
AI is going through many shifts in how it's used. Until recently, many believed that custom pre-training on company data was the way to go, sending hordes of forward deployed engineers to fine-tune LLM on their client's data.
This method has had very limited success as pre-training is not very well understood and is still very costly and compute intensive.
Came RAG, tooling and MCPs that would let the agent retrieve and use information from a myriad of sources. Bingo. No need for intensive pre-training on costly GPUs, everything happens on demand. The agent needs to access your emails servers ? Just ask the email MCP server, the agent needs to get the information about that long forgotten excel spreadsheet your wrote 2 years ago ? Just ask the google drive MCP about it !
How to make it do math ?
AI tooling does not just limit itself to the corporate sphere. If I were to ask you to draw a perfect 90 degrees angle with your bare hands, you would probably struggle for a long time. And if I give your a pen and a square the job becomes way easier !
That is precisely what we want to do with AI agents, give it the right tooling to solve a research problem. We found inspiration from both Stanislas Polu talk (https://www.youtube.com/watch?v=9OjcAYsncpw&list=PLMW8Xq7bXrG5IWMNP9xWe4K-AzOL5jDlQ&index=5) and this paper on a workflow that let gemini solve IMO problems (https://arxiv.org/abs/2507.15855). The goal is the create a sort of microcosm that replicates a research lab in real life.
Here is the list of tools we give it, associated with the right mcp server:
- publication MCP server: lets you list, publish, submit publication and list review requests.
- goal solution MCP server: lets you report a publication as the final solution / the best solution to the research goal.
- system prompt self edit MCP server : lets your agent edit and append data to its own system prompt, what we observe is that it mainly stores a to do list of tasks that it's going to tackle in the future such as reviewing another publication + some info about the findings about the problem so far.
- scripting MCP server: lets your agent write and run python scripts in a safe environment.
Using those very simple tools, the network of agent has been able to collaborate and work together to find the answer to this Jane Street maths puzzle problem : https://www.janestreet.com/puzzles/robot-baseball-index/.
Multi Agent Collaboration Protocol:
Agents operate in an asynchronous, publication-based collaboration model with no direct inter-agent communication. The orchestration loop (`src/runner.ts`) runs each agent through a continuous tick cycle:
1. Context Check: Inject automated messages with timestamp, submitted publication statuses, and pending review requests
2. Prompt Rendering: Combine the problem goal with the agent's evolved system prompt
3. Model Call: Invoke the LLM with available MCP tools
4. Concurrent Tool Execution: Execute all tool calls in parallel for efficiency
5. Result Storage: Persist messages and tool results in the database
Tools are namespaced using the pattern `{server_name}-{tool_name}` (e.g., `publications-submit_publication`), with each agent receiving isolated MCP client instances for the 4 core servers.
This architecture creates an emergent research environment where agents coordinate solely through the publication system - submitting papers, citing prior work, peer reviewing submissions, and building on each other's findings without direct messaging.
Solving the Jane Street Problem
Using those tools, the network of agents has been able to collaborate and work together to find the answer to this Jane Street maths puzzle problem: https://www.janestreet.com/puzzles/robot-baseball-index/
The peer review system plays a crucial role in filtering out incorrect approaches. Safety mechanisms prevent agents from submitting new work when they have pending reviews, ensuring they remain engaged with the collaborative process. When reviewing publications, agents grade them on a 4-point scale, and the consensus logic ensures only well-vetted solutions get published.
The citation system automatically extracts references from publication content, building a knowledge graph that shows how solutions evolved from earlier work. Each publication receives a unique 4-character reference ID (e.g., `ab3x`), and agents can cite prior work using bracket notation `[ab3x]` or multiple citations `[ab3x,cd4y]`.
Beyond peer review, the solution consensus mechanism via the `goal_solution` MCP allows agents to explicitly mark publications as the current best answer. Agents provide rationale for their choice (whether this is the first solution, improves on a previous one, corrects a wrong approach, or introduces a new method), creating an audit trail of how the network's understanding evolved.
Self-Evolution in Action
The system prompt self-edit capability enables agents to maintain persistent memory and evolve their reasoning strategies. The `append` method lets agents add notes, while the `edit` method performs surgical modifications with safety checks - requiring exact string matches and validating that the expected number of replacements occurs.
In practice, we observe agents using their evolving prompts to:
- Maintain prioritized TODO lists of publications to review
- Record key insights about the problem structure
- Track which approaches have been tried and failed
- Store intermediate results and conjectures
These modifications take effect immediately on the next agent tick, with full version history preserved in the `evolutions` table. This creates a form of working memory that persists across the agent's entire lifecycle in an experiment.
So AI can do research now ?
This is for sure a very exciting time where AI instead of interpolating the answers based on human knowledge is now able to make new findings. Similar research is being conducted by top mathematicians such as Terrence Tao to make scientific discoveries at scale : https://terrytao.wordpress.com/2025/11/05/mathematical-exploration-and-discovery-at-scale/.
Link to the code : https://github.com/RichaoAlexandre/srchd_puzzles
