Exploring DeepSeek-R1's Agentic Capabilities Through Code Actions
I ran a quick experiment investigating how DeepSeek-R1 carries out on agentic jobs, regardless of not supporting tool use natively, and I was rather pleased by preliminary results. This experiment runs DeepSeek-R1 in a single-agent setup, where the model not just plans the actions however likewise formulates the actions as executable Python code. On a subset1 of the GAIA recognition split, DeepSeek-R1 surpasses Claude 3.5 Sonnet by 12.5% outright, from 53.1% to 65.6% proper, and other designs by an even bigger margin:
The experiment followed model use guidelines from the DeepSeek-R1 paper and the design card: Don't use few-shot examples, prevent adding a system prompt, and set the temperature to 0.5 - 0.7 (0.6 was used). You can find further examination details here.
Approach
DeepSeek-R1's strong coding abilities allow it to function as an agent without being clearly trained for tool use. By allowing the model to create actions as Python code, it can flexibly interact with environments through code execution.
Tools are executed as Python code that is included straight in the timely. This can be a simple function definition or a module of a larger plan - any valid Python code. The design then creates code actions that call these tools.
Arise from performing these actions feed back to the model as follow-up messages, driving the next actions up until a last response is reached. The agent framework is a simple iterative coding loop that mediates the discussion between the design and its environment.
Conversations
DeepSeek-R1 is utilized as chat model in my experiment, where the model autonomously pulls additional context from its environment by utilizing tools e.g. by utilizing a search engine or fetching data from web pages. This drives the discussion with the environment that continues until a final response is reached.
In contrast, o1 models are understood to carry out badly when used as chat models i.e. they do not try to pull context during a discussion. According to the connected post, o1 designs carry out best when they have the complete context available, with clear instructions on what to do with it.
Initially, I likewise tried a complete context in a single timely approach at each step (with arise from previous actions consisted of), however this resulted in significantly lower scores on the GAIA subset. Switching to the conversational approach explained above, smfsimple.com I had the ability to reach the reported 65.6% performance.
This raises an intriguing question about the claim that o1 isn't a chat design - maybe this observation was more pertinent to older o1 designs that did not have tool usage capabilities? After all, isn't tool usage support an important system for enabling designs to pull additional context from their environment? This conversational approach certainly appears effective for DeepSeek-R1, though I still need to carry out similar explores o1 designs.
Generalization
Although DeepSeek-R1 was mainly trained with RL on math and coding tasks, it is remarkable that generalization to agentic tasks with tool use via code actions works so well. This ability to generalize to agentic jobs advises of current research by DeepMind that reveals that RL generalizes whereas SFT memorizes, visualchemy.gallery although generalization to tool use wasn't examined in that work.
Despite its capability to generalize to tool usage, DeepSeek-R1 frequently produces very long thinking traces at each action, compared to other models in my experiments, the usefulness of this model in a single-agent setup. Even easier jobs in some cases take a very long time to complete. Further RL on agentic tool use, be it through code actions or not, might be one option to enhance effectiveness.
Underthinking
I also observed the underthinking phenomon with DeepSeek-R1. This is when a reasoning model often switches in between various reasoning ideas without adequately checking out appealing paths to reach a proper service. This was a significant factor for overly long thinking traces produced by DeepSeek-R1. This can be seen in the tape-recorded traces that are available for download.
Future experiments
Another common application of thinking designs is to utilize them for planning just, while using other models for producing code actions. This could be a potential brand-new feature of freeact, if this separation of functions shows useful for machinform.com more complex tasks.
I'm also curious about how reasoning designs that currently support tool usage (like o1, o3, ...) carry out in a single-agent setup, with and wiki.snooze-hotelsoftware.de without producing code actions. Recent developments like OpenAI's Deep Research or Hugging Face's open-source Deep Research, which likewise uses code actions, look intriguing.