Exploring DeepSeek-R1's Agentic Capabilities Through Code Actions
I ran a quick experiment investigating how DeepSeek-R1 performs on agentic jobs, drapia.org in spite of not supporting tool usage natively, wiki.snooze-hotelsoftware.de and I was rather impressed by initial outcomes. This experiment runs DeepSeek-R1 in a single-agent setup, where the design not only plans the actions however also formulates the actions as executable Python code. On a subset1 of the GAIA recognition split, classicalmusicmp3freedownload.com DeepSeek-R1 surpasses Claude 3.5 Sonnet by 12.5% absolute, wifidb.science from 53.1% to 65.6% correct, and other models by an even bigger margin:
The experiment followed model use standards from the DeepSeek-R1 paper and the design card: Don't utilize few-shot examples, avoid adding a system prompt, and set the temperature level to 0.5 - 0.7 (0.6 was used). You can discover more evaluation details here.
Approach
DeepSeek-R1's strong coding abilities allow it to serve as a representative without being explicitly trained for tool usage. By allowing the design to generate actions as Python code, it can flexibly communicate with environments through code execution.
Tools are implemented as Python code that is consisted of straight in the timely. This can be an easy function definition or a module of a bigger package - any code. The model then creates code actions that call these tools.
Results from performing these actions feed back to the design as follow-up messages, driving the next actions until a final response is reached. The agent structure is a simple iterative coding loop that mediates the discussion in between the model and its environment.
Conversations
DeepSeek-R1 is used as chat design in my experiment, where the model autonomously pulls extra context from its environment by utilizing tools e.g. by utilizing a search engine or bring information from web pages. This drives the discussion with the environment that continues up until a last response is reached.
In contrast, o1 designs are known to carry out badly when utilized as chat models i.e. they don't try to pull context during a discussion. According to the linked short article, o1 models carry out best when they have the full context available, with clear guidelines on what to do with it.
Initially, I likewise tried a complete context in a single prompt method at each step (with results from previous steps consisted of), however this resulted in significantly lower scores on the GAIA subset. Switching to the conversational method explained above, I was able to reach the reported 65.6% performance.
This raises an intriguing question about the claim that o1 isn't a chat model - perhaps this observation was more pertinent to older o1 models that did not have tool usage abilities? After all, isn't tool usage support an essential mechanism for making it possible for models to pull extra context from their environment? This conversational approach certainly seems reliable for DeepSeek-R1, though I still need to perform comparable experiments with o1 designs.
Generalization
Although DeepSeek-R1 was mainly trained with RL on mathematics and coding tasks, it is amazing that generalization to agentic jobs with tool use by means of code actions works so well. This capability to generalize to agentic jobs advises of recent research by DeepMind that shows that RL generalizes whereas SFT remembers, although generalization to tool use wasn't investigated in that work.
Despite its capability to generalize to tool usage, DeepSeek-R1 often produces really long reasoning traces at each action, compared to other models in my experiments, limiting the effectiveness of this design in a single-agent setup. Even easier tasks in some cases take a long time to complete. Further RL on agentic tool usage, be it through code actions or not, might be one choice to enhance performance.
Underthinking
I likewise observed the underthinking phenomon with DeepSeek-R1. This is when a thinking design often switches between various reasoning thoughts without sufficiently checking out promising paths to reach a right service. This was a significant reason for extremely long thinking traces produced by DeepSeek-R1. This can be seen in the tape-recorded traces that are available for download.
Future experiments
Another typical application of reasoning models is to utilize them for preparing only, while utilizing other models for generating code actions. This might be a potential new feature of freeact, if this separation of roles proves beneficial for more complex tasks.
I'm likewise curious about how thinking models that already support tool usage (like o1, o3, ...) carry out in a single-agent setup, with and without producing code actions. Recent advancements like OpenAI's Deep Research or Hugging Face's open-source Deep Research, which also uses code actions, look intriguing.