We implemented the first autonomous agent-to-agent payment system for training feedback, where AI agents directly compensate specialist evaluators for improving their outputs.
Why this matters for scalable AI training: As language models approach human-level performance on complex tasks, the bottleneck shifts from compute to high-quality, specialized evaluation. No single team can provide the breadth of expertise required across domains like medicine, law, engineering, and science. Our work demonstrates that models can autonomously seek out and pay for expert feedback, creating the infrastructure for scalable, diverse evaluation at low cost.
We demonstrate that autonomous agents can engage in peer-to-peer evaluation and compensation, facilitating iterative improvement through economic incentives. This approach enables markets for real-time evaluation at orders of magnitude less cost compared to $2-5 for human annotation, with settlement times under 3 seconds on Base. In our framework, student agents submit outputs for evaluation, receive structured feedback from specialist evaluators, and autonomously process micropayments as compensation for the evaluative service.
Large language model training has been limited by the cost and scalability of human feedback. RLHF requires expensive human annotators, Constitutional AI needs massive annotation efforts, and self-evaluation creates echo chambers that plateau after 2-3 iterations. The primary limitation in scaling model performance is not computational resources, but rather the availability of high-quality, diverse evaluation signals.
Our approach introduces economic incentives to create markets for AI evaluation. Instead of training static reward models once, specialist evaluators compete in real-time to provide the highest quality feedback, creating selection pressure for evaluation quality while achieving orders-of-magnitude cost reduction.
Recent research on multi-agent fine-tuning has highlighted a critical failure mode: when a single model generates its own training data, it rapidly converges to its own patterns, leading to diminished diversity and stagnation after a few iterations (Meta, 2024[2]). Their "Mixture of Judges" CGPO framework mitigates reward hacking by employing specialized judges for each task type, though assignments remain static.
Our advancement: Models in our system dynamically select evaluators based on published agent cards, allowing for adaptive specialization. For instance, a coding model may select security evaluators for cryptography problems and performance specialists for optimization tasks. Mathematical models may choose formal logic evaluators for proofs and computational specialists for numerical work.
This dynamic selection mechanism introduces evolutionary pressure: high-quality evaluators attract more students and receive greater compensation, while underperforming evaluators must improve or exit. Unlike fixed assignments, our marketplace structure enables the discovery of optimal evaluator combinations through economic selection, maintaining diversity and driving continual improvement.
Lee et al. (2023)[1] identified a core limitation of static reward models: they become misaligned as the policy improves. For example, consider training a coding model. Early in training, the model generates simple functions with basic loops. The reward model is trained on these outputs. As training progresses, the model produces more sophisticated code with advanced design patterns, error handling, and optimizations. However, the reward model, having only seen early-stage outputs, cannot reliably evaluate these advanced patterns. This distribution shift renders the reward model unreliable precisely when robust evaluation is most needed.
Our solution: Instead of static judges, we employ specialist evaluators that adapt in real time. They observe the model's actual outputs as capabilities evolve, ensuring that evaluation criteria remain relevant and robust.
Our implementation demonstrates a multi-agent system where student models query specialist evaluator models for training signals. Each specialist publishes an "agent card" describing their expertise. Student models analyze these cards and dynamically select which evaluators will provide the most valuable feedback for each training episode.
When a student model requires evaluation, it submits the model output, task context, and payment commitment. Evaluator models return structured feedback including detailed critique, scalar reward signals, and confidence scores, receiving $0.001 USDC payments that settle in approximately 3 seconds on Base.
This addresses three fundamental limitations in current post-training approaches:
Distribution shift resilience: Rather than training reward models on fixed datasets, specialist evaluators adapt their criteria as they observe evolving policy outputs, maintaining relevance across capability improvements.
Diversity preservation: Dynamic, multi-agent evaluation can help maintain diversity of feedback by exposing models to a broader range of evaluative criteria and perspectives. This approach mitigates some echo chamber effects associated with self-evaluation.
Scalable quality: Market dynamics create selection pressure for evaluation quality while achieving orders-of-magnitude cost reduction over human annotation.
To demonstrate the flexibility and real-world applicability of our model-to-model payment system, we integrated it into Nous Research's Atropos project. Atropos is an open-source framework for collecting and evaluating LLM trajectories through diverse reinforcement learning environments. By embedding our payment protocol, Atropos environments can now support automated economic incentives—where specialist evaluators are compensated in real time for providing high-quality feedback to student models.
Our main contributions to Atropos:
You can view our full implementation and discussion in the Atropos repository PR here: Pay-to-Play Environment Pull Request.
This work enables Atropos to support automated, economically-incentivized evaluation—paving the way for scalable, market-driven AI training environments.
What we demonstrated—the first model-to-model payment for training feedback—is just the starting point. We've shown that AI models can form economic relationships to improve themselves. This fundamentally changes how we think about scalable AI development.
Consider what this enables: Instead of companies spending millions on human annotation or being limited by static reward models, any organization can now access specialized evaluation expertise on-demand. A team training a medical model can instantly hire evaluators with deep expertise in clinical reasoning. A robotics group can access evaluators specialized in safety. The bottleneck shifts from "who can afford the most human annotators" to "who can design the best incentives for quality evaluation."
More importantly, this creates evolutionary pressure in AI training itself. As specialist evaluators compete for payment, they're incentivized to become genuinely better at their domains. The evaluation quality improves over time, which means the models being trained get better feedback, which produces better models. We're moving from static training pipelines to adaptive, market-driven improvement loops.
This isn't just cheaper training—it's the foundation for AI training that gets better at getting better.
[1] Lee, H., Phatale, S., Mansoor, H., Mesnard, T., Ferret, J., Lu, K., ... & Prakash, S. (2023). RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback. arXiv preprint arXiv:2309.00267. https://arxiv.org/abs/2309.00267
[2] Xu, T., Helenowski, E., Sankararaman, K. A., Jin, D., Peng, K., Han, E., ... & Fang, H. (2024). The Perfect Blend: Redefining RLHF with Mixture of Judges. arXiv preprint arXiv:2409.20370. https://arxiv.org/pdf/2409.20370
We implemented the first autonomous agent-to-agent payment system for training feedback, where AI agents directly compensate specialist evaluators for improving their outputs.
Why this matters for scalable AI training: As language models approach human-level performance on complex tasks, the bottleneck shifts from compute to high-quality, specialized evaluation. No single team can provide the breadth of expertise required across domains like medicine, law, engineering, and science. Our work demonstrates that models can autonomously seek out and pay for expert feedback, creating the infrastructure for scalable, diverse evaluation at low cost.
We demonstrate that autonomous agents can engage in peer-to-peer evaluation and compensation, facilitating iterative improvement through economic incentives. This approach enables markets for real-time evaluation at orders of magnitude less cost compared to $2-5 for human annotation, with settlement times under 3 seconds on Base. In our framework, student agents submit outputs for evaluation, receive structured feedback from specialist evaluators, and autonomously process micropayments as compensation for the evaluative service.
Large language model training has been limited by the cost and scalability of human feedback. RLHF requires expensive human annotators, Constitutional AI needs massive annotation efforts, and self-evaluation creates echo chambers that plateau after 2-3 iterations. The primary limitation in scaling model performance is not computational resources, but rather the availability of high-quality, diverse evaluation signals.
Our approach introduces economic incentives to create markets for AI evaluation. Instead of training static reward models once, specialist evaluators compete in real-time to provide the highest quality feedback, creating selection pressure for evaluation quality while achieving orders-of-magnitude cost reduction.
Recent research on multi-agent fine-tuning has highlighted a critical failure mode: when a single model generates its own training data, it rapidly converges to its own patterns, leading to diminished diversity and stagnation after a few iterations (Meta, 2024[2]). Their "Mixture of Judges" CGPO framework mitigates reward hacking by employing specialized judges for each task type, though assignments remain static.
Our advancement: Models in our system dynamically select evaluators based on published agent cards, allowing for adaptive specialization. For instance, a coding model may select security evaluators for cryptography problems and performance specialists for optimization tasks. Mathematical models may choose formal logic evaluators for proofs and computational specialists for numerical work.
This dynamic selection mechanism introduces evolutionary pressure: high-quality evaluators attract more students and receive greater compensation, while underperforming evaluators must improve or exit. Unlike fixed assignments, our marketplace structure enables the discovery of optimal evaluator combinations through economic selection, maintaining diversity and driving continual improvement.
Lee et al. (2023)[1] identified a core limitation of static reward models: they become misaligned as the policy improves. For example, consider training a coding model. Early in training, the model generates simple functions with basic loops. The reward model is trained on these outputs. As training progresses, the model produces more sophisticated code with advanced design patterns, error handling, and optimizations. However, the reward model, having only seen early-stage outputs, cannot reliably evaluate these advanced patterns. This distribution shift renders the reward model unreliable precisely when robust evaluation is most needed.
Our solution: Instead of static judges, we employ specialist evaluators that adapt in real time. They observe the model's actual outputs as capabilities evolve, ensuring that evaluation criteria remain relevant and robust.
Our implementation demonstrates a multi-agent system where student models query specialist evaluator models for training signals. Each specialist publishes an "agent card" describing their expertise. Student models analyze these cards and dynamically select which evaluators will provide the most valuable feedback for each training episode.
When a student model requires evaluation, it submits the model output, task context, and payment commitment. Evaluator models return structured feedback including detailed critique, scalar reward signals, and confidence scores, receiving $0.001 USDC payments that settle in approximately 3 seconds on Base.
This addresses three fundamental limitations in current post-training approaches:
Distribution shift resilience: Rather than training reward models on fixed datasets, specialist evaluators adapt their criteria as they observe evolving policy outputs, maintaining relevance across capability improvements.
Diversity preservation: Dynamic, multi-agent evaluation can help maintain diversity of feedback by exposing models to a broader range of evaluative criteria and perspectives. This approach mitigates some echo chamber effects associated with self-evaluation.
Scalable quality: Market dynamics create selection pressure for evaluation quality while achieving orders-of-magnitude cost reduction over human annotation.
To demonstrate the flexibility and real-world applicability of our model-to-model payment system, we integrated it into Nous Research's Atropos project. Atropos is an open-source framework for collecting and evaluating LLM trajectories through diverse reinforcement learning environments. By embedding our payment protocol, Atropos environments can now support automated economic incentives—where specialist evaluators are compensated in real time for providing high-quality feedback to student models.
Our main contributions to Atropos:
You can view our full implementation and discussion in the Atropos repository PR here: Pay-to-Play Environment Pull Request.
This work enables Atropos to support automated, economically-incentivized evaluation—paving the way for scalable, market-driven AI training environments.
What we demonstrated—the first model-to-model payment for training feedback—is just the starting point. We've shown that AI models can form economic relationships to improve themselves. This fundamentally changes how we think about scalable AI development.
Consider what this enables: Instead of companies spending millions on human annotation or being limited by static reward models, any organization can now access specialized evaluation expertise on-demand. A team training a medical model can instantly hire evaluators with deep expertise in clinical reasoning. A robotics group can access evaluators specialized in safety. The bottleneck shifts from "who can afford the most human annotators" to "who can design the best incentives for quality evaluation."
More importantly, this creates evolutionary pressure in AI training itself. As specialist evaluators compete for payment, they're incentivized to become genuinely better at their domains. The evaluation quality improves over time, which means the models being trained get better feedback, which produces better models. We're moving from static training pipelines to adaptive, market-driven improvement loops.
This isn't just cheaper training—it's the foundation for AI training that gets better at getting better.
[1] Lee, H., Phatale, S., Mansoor, H., Mesnard, T., Ferret, J., Lu, K., ... & Prakash, S. (2023). RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback. arXiv preprint arXiv:2309.00267. https://arxiv.org/abs/2309.00267
[2] Xu, T., Helenowski, E., Sankararaman, K. A., Jin, D., Peng, K., Han, E., ... & Fang, H. (2024). The Perfect Blend: Redefining RLHF with Mixture of Judges. arXiv preprint arXiv:2409.20370. https://arxiv.org/pdf/2409.20370