Researchers at the University of Science and Technology of China have developed a new reinforcement learning (RL) framework that helps train large language models (LLMs) for complex agentic tasks beyond well-defined problems such as math and coding.
Their framework, Agent-R1, is compatible with popular RL algorithms and shows considerable improvement on reasoning tasks that require multiple retrieval stages and multi-turn interactions with tools.
The framework is built on a redefinition of the RL paradigm that takes into account the dynamic nature of agentic applications that require interacting with evolving environments and imperfect information. This framing is much more similar to real-world applications and can have important uses for agentic tasks in enterprise settings.
RL has become a cornerstone of training LLMs for well-defined reasoning tasks. In areas like mathematics and coding, the model receives a clear signal: The answer is either right or wrong. This makes it relatively straightforward to reward or penalize its behavior.
But this approach struggles with agentic tasks that require models to work in interactive environments, develop dynamic memories across conversations, perform multi-step reasoning and respond to unpredictable feedback. Training agents with RL for these scenarios presents unique challenges, especially in multi-turn interactions where designing effective rewards is complex and the trained agent often fails to generalize to the messy, unpredictable nature of real-world environments.
To address these challenges, the University of Science and Technology researchers revisited the fundamental framework of RL, known as the Markov Decision Process (MDP). An MDP models decision-making using four key components: a state space (the set of possible states an agent can be in); an action space (what the agent can do); a state transition probability (the state to which an action will likely lead); and a reward function (whether the outcome is good or bad). The paper proposes extending this framework to better suit LLM agents.
In the new formulation, the state space is expanded to include not just the current state (the current sequence of tokens generated by the model) but the entire history of interactions and environmental feedback. Actions are still fundamentally about generating text, but specific sequences of text can now trigger external tools, like an API call. State transitions become unpredictable, or "stochastic," because the outcome depends not just on the tokens the model predicts but also on the environment's response, which depends on external factors. Finally, the reward system becomes more granular, incorporating intermediate "process rewards" for successfully completing steps along the way, rather than just a single reward at the very end. This provides more frequent and precise guidance to the agent during training.
This last bit is especially important and addresses the “sparse reward” problem that most RL frameworks face. When the agent receives a single reward signal based on the final outcome, it does not learn from the right and wrong intermediate steps it has taken along the way. Process rewards solve this problem by providing feedback signals on these intermediate steps, making the learning process much more efficient.
“These extensions are crucial for enabling reinforcement learning algorithms to train sophisticated Agents capable of complex, multi-step reasoning and interaction within dynamic environments,” the researchers write in their paper.
Based on the extended MDP definition, the researchers developed Agent-R1, a flexible and user-friendly training platform for RL-based LLM agents. It extends traditional single-turn RL frameworks to handle the multi-turn, interactive nature of agentic tasks, allowing for seamless integration with diverse environments.
The most significant difference lies in the "rollout phase," where the agent generates responses. In single-turn RL, the model generates a response once. In multi-turn RL, the process involves a series of complex back-and-forth interactions.
Agent-R1 achieves this flexible multi-turn rollout with two core modules: Tool and ToolEnv. The Tool module acts as an executor for specific actions such as calling an API or accessing a database. When invoked, a Tool performs its action and returns the direct, raw outcome. In contrast, the ToolEnv module is the orchestrator and interpreter. It takes the output from the Tool and determines how that outcome affects the agent's state and the overall task progress. ToolEnv manages state transitions, calculates reward signals based on tool outcomes and packages the new state information for the agent.
In short, when an action is complete, the Tool reports "what happened," while ToolEnv dictates "what this outcome means for the agent and the task."
The researchers tested Agent-R1 on the challenging task of multi-hop question answering, which requires complex reasoning, information retrieval across multiple documents and multi-step decision-making. They trained Qwen2.5-3B-Instruct on QA datasets and evaluated its performance on the HotpotQA and 2WikiMultihopQA datasets. They also tested it on the Musique dataset, which was out of the domain of tasks the agent was trained on.
They compared various RL algorithms trained with Agent-R1 against two baselines: Naive RAG, a single-pass retrieval method where an LLM answers based on one set of retrieved documents, and Base Tool Call, which uses the model's native function-calling ability without specialized RL training.
The results demonstrated that all RL-trained agents substantially outperformed the baselines. GRPO, an RL algorithm used in advanced reasoning models like DeepSeek-R1, delivered the best overall performance.
“These results robustly validate Agent-R1’s efficacy in training powerful LLM agents via end-to-end RL, showing consistent, substantial gains over baselines across diverse datasets and RL algorithms,” the researchers write.
These findings can be significant for the enterprise, where there is a strong push to apply RL and reasoning beyond well-defined domains. A framework designed to handle messy, multi-turn interactions with users and dynamic environments can pave the way for new agents capable of solving complex problems in real-world settings.
“We hope Agent-R1 provides a foundation for future work on scalable and unified RL training for agentic LLMs,” the researchers conclude.