Exploring DeepSeek-R1's Agentic Capabilities Through Code Actions
I ran a quick experiment investigating how DeepSeek-R1 performs on agentic tasks, despite not supporting tool usage natively, and I was rather satisfied by preliminary results. This experiment runs DeepSeek-R1 in a single-agent setup, where the model not just plans the actions however also develops the actions as executable Python code. On a subset1 of the GAIA validation split, DeepSeek-R1 surpasses Claude 3.5 Sonnet by 12.5% absolute, from 53.1% to 65.6% correct, setiathome.berkeley.edu and other designs by an even larger margin:
The experiment followed design use guidelines from the DeepSeek-R1 paper and the model card: Don't use few-shot examples, avoid adding a system prompt, and set the temperature level to 0.5 - 0.7 (0.6 was used). You can discover additional evaluation details here.
Approach
DeepSeek-R1's strong coding capabilities allow it to act as a representative without being clearly trained for tool use. By allowing the design to generate actions as Python code, it can flexibly engage with environments through code execution.
Tools are carried out as Python code that is included straight in the timely. This can be a basic function definition or a module of a larger bundle - any legitimate Python code. The model then produces code actions that call these tools.
Arise from executing these actions feed back to the design as follow-up messages, driving the next steps until a last answer is reached. The representative structure is a simple iterative coding loop that moderates the discussion in between the model and its environment.
Conversations
DeepSeek-R1 is utilized as chat model in my experiment, where the design autonomously pulls extra context from its environment by utilizing tools e.g. by utilizing an online search engine or bring data from websites. This drives the conversation with the environment that continues until a last response is reached.
In contrast, o1 models are known to perform badly when used as chat models i.e. they don't attempt to pull context throughout a conversation. According to the linked post, o1 designs carry out best when they have the full context available, with clear guidelines on what to do with it.
Initially, I likewise tried a complete context in a approach at each step (with outcomes from previous steps included), wiki.eqoarevival.com but this caused significantly lower scores on the GAIA subset. Switching to the conversational method explained above, I had the ability to reach the reported 65.6% performance.
This raises an interesting question about the claim that o1 isn't a chat design - possibly this observation was more pertinent to older o1 designs that did not have tool use abilities? After all, isn't tool use support an important mechanism for pipewiki.org making it possible for designs to pull extra context from their environment? This conversational technique certainly seems reliable for DeepSeek-R1, though I still need to perform similar experiments with o1 designs.
Generalization
Although DeepSeek-R1 was mainly trained with RL on math and coding tasks, wiki.woge.or.at it is impressive that generalization to agentic tasks with tool use through code actions works so well. This capability to generalize to agentic jobs reminds of recent research by DeepMind that shows that RL generalizes whereas SFT memorizes, although generalization to tool use wasn't investigated in that work.
Despite its capability to generalize to tool use, DeepSeek-R1 frequently produces extremely long reasoning traces at each action, compared to other models in my experiments, limiting the usefulness of this model in a single-agent setup. Even simpler tasks in some cases take a long period of time to complete. Further RL on agentic tool usage, be it through code actions or not, imoodle.win could be one alternative to enhance performance.
Underthinking
I also observed the underthinking phenomon with DeepSeek-R1. This is when a thinking model often switches in between different reasoning thoughts without adequately checking out promising courses to reach a correct solution. This was a significant factor for excessively long thinking traces produced by DeepSeek-R1. This can be seen in the recorded traces that are available for download.
Future experiments
Another typical application of reasoning models is to use them for preparing only, while using other models for generating code actions. This might be a possible new feature of freeact, if this separation of roles shows useful for more complex tasks.
I'm likewise curious about how reasoning models that already support tool usage (like o1, islider.ru o3, ...) perform in a single-agent setup, with and akropolistravel.com without producing code actions. Recent advancements like OpenAI's Deep Research or Hugging Face's open-source Deep Research, which likewise uses code actions, look intriguing.