Exploring DeepSeek-R1's Agentic Capabilities Through Code Actions
I ran a fast experiment investigating how DeepSeek-R1 carries out on agentic tasks, regardless of not supporting tool use natively, and I was quite impressed by initial results. This experiment runs DeepSeek-R1 in a setup, where the model not only prepares the actions but also creates the actions as executable Python code. On a subset1 of the GAIA recognition split, DeepSeek-R1 surpasses Claude 3.5 Sonnet by 12.5% absolute, from 53.1% to 65.6% proper, and other designs by an even larger margin:
The experiment followed model usage standards from the DeepSeek-R1 paper and the model card: Don't utilize few-shot examples, prevent including a system prompt, and set the temperature to 0.5 - 0.7 (0.6 was utilized). You can find more examination details here.
Approach
DeepSeek-R1's strong coding capabilities enable it to serve as an agent without being explicitly trained for tool use. By allowing the model to produce actions as Python code, it can flexibly engage with environments through code execution.
Tools are carried out as Python code that is consisted of straight in the prompt. This can be a basic function definition or a module of a bigger bundle - any valid Python code. The design then creates code actions that call these tools.
Results from carrying out these actions feed back to the design as follow-up messages, driving the next steps up until a final answer is reached. The agent structure is an easy iterative coding loop that mediates the discussion between the model and its environment.
Conversations
DeepSeek-R1 is used as chat model in my experiment, where the model autonomously pulls additional context from its environment by utilizing tools e.g. by utilizing an online search engine or fetching information from web pages. This drives the conversation with the environment that continues until a last answer is reached.
On the other hand, o1 models are known to perform poorly when utilized as chat designs i.e. they do not try to pull context throughout a conversation. According to the connected post, o1 models perform best when they have the complete context available, with clear guidelines on what to do with it.
Initially, I likewise attempted a full context in a single prompt approach at each step (with outcomes from previous steps consisted of), however this led to substantially lower ratings on the GAIA subset. Switching to the conversational approach explained above, I had the ability to reach the reported 65.6% performance.
This raises an interesting question about the claim that o1 isn't a chat model - possibly this observation was more pertinent to older o1 models that did not have tool usage abilities? After all, isn't tool use support an important system for allowing models to pull extra context from their environment? This conversational approach certainly appears effective for DeepSeek-R1, though I still require to perform similar experiments with o1 models.
Generalization
Although DeepSeek-R1 was mainly trained with RL on math and coding jobs, it is impressive that generalization to agentic jobs with tool use via code actions works so well. This ability to generalize to agentic tasks advises of recent research by DeepMind that shows that RL generalizes whereas SFT remembers, although generalization to tool usage wasn't investigated because work.
Despite its capability to generalize to tool usage, DeepSeek-R1 often produces extremely long thinking traces at each action, compared to other models in my experiments, limiting the usefulness of this model in a single-agent setup. Even simpler tasks often take a long time to complete. Further RL on agentic tool usage, be it through code actions or not, might be one option to improve performance.
Underthinking
I also observed the underthinking phenomon with DeepSeek-R1. This is when a thinking model often switches between various reasoning thoughts without sufficiently checking out appealing courses to reach a right option. This was a significant factor for excessively long thinking traces produced by DeepSeek-R1. This can be seen in the taped traces that are available for download.
Future experiments
Another common application of thinking models is to utilize them for preparing just, while utilizing other models for bybio.co generating code actions. This could be a possible brand-new function of freeact, if this separation of roles proves beneficial for more complex jobs.
I'm also curious about how thinking designs that already support tool usage (like o1, o3, ...) perform in a single-agent setup, with and without producing code actions. Recent advancements like OpenAI's Deep Research or Hugging Face's open-source Deep Research, which also utilizes code actions, look intriguing.