Exploring the smolagents Library: A Deep Dive into MultiStepAgent, CodeAgent, and ToolCallingAgent

23 min readFeb 8, 2025

In the realm of artificial intelligence, agents are entities that interact with environments to achieve specific goals. The smolagents library from HuggingFace provides a robust framework for building such agents. It offers three primary agent types: MultiStepAgent, CodeAgent, and ToolCallingAgent. Each of these agents is designed to solve tasks in unique ways, leveraging language models and tools to execute actions and interpret results.

Note: This blog post is based on v1.8.0 .

This blog post will explore the functionality, implementation, and use cases of each agent type. We will delve into the code to highlight key sections and explain how they contribute to the overall behaviour of the agents. Additionally, we will discuss when to use one agent over another and the differences between them.

Here is the repo for the smolagents library:

And here is the documentation:

Here is How it works in a simple GIF:

Understanding the MultiStepAgent

The MultiStepAgent is the foundational class for all agents in the smolagents library. It implements the ReAct framework, which involves cycles of action and observation until the objective is achieved or the maximum number of steps is reached. This agent serves as a blueprint for more specialized agents like CodeAgent and ToolCallingAgent.

Key Features

  1. ReAct Framework: The agent performs a sequence of actions based on input from a language model and observes the environment’s response.
  2. Customizable Tools: Developers can define and integrate custom tools that the agent can use during its execution.
  3. Planning Capability: The agent periodically plans its next steps using a planning interval, ensuring it stays on track toward achieving its goal.

Code Highlights

In order to understand what happens when a user asks the agent to solve a task, let’s see how the flow of input to output works. The MultiStepAgent class is designed to solve tasks step by step using the ReAct framework, which involves cycles of thinking (actions generated by a language model) and observing (results obtained from the environment or tools). Below, I will explain the flow of how the run() method works, breaking it down into its main steps, including planning, tool calling, and other important parts.

1. Initialization in run()

The run method in the MultiStepAgent class is a critical function that orchestrates the entire process of task execution by the agent. It serves as the entry point for running tasks and manages the flow of operations, including initialization, memory management, logging, and step-by-step execution. Below, I will explain the method in detail, breaking it into its key components and explaining how each part contributes to the overall functionality.

Purpose of the Method

The run method is responsible for executing a given task using the ReAct framework. It initializes the necessary components, manages the conversation state, and either streams intermediate results or returns the final output after completing all steps. This method ensures that the agent operates efficiently and adapts dynamically to new information during execution.

Key Components

1. Input Parameters

The method accepts several parameters that define how the task should be executed:

  • task (str) : The primary task the agent needs to solve. This is the main input provided by the user.
  • stream (bool, default False) : Determines whether the agent should return intermediate results as they are generated (streaming mode) or only return the final result at the end.
  • reset (bool, default True) : Specifies whether the agent should reset its memory and start fresh or continue from where it left off in a previous run.
  • images (list[str], optional ) : Paths to images that may be relevant to the task.
  • additional_args (dict, optional ) : Any additional variables (e.g., dataframes, images) that the agent can use during execution. These are added to the agent's state dictionary.

2. Task Initialization

The method begins by assigning the task to the agent's self.task attribute. If additional_args are provided, they are added to the agent's state dictionary (self.state) and appended to the task description. This ensures that the agent has access to these variables during execution.

if additional_args is not None:
self.state.update(additional_args)
self.task += f"""
You have been provided with these additional arguments, that you can access using the keys as variables in your python code:
{str(additional_args)}."""

This step is particularly useful for tasks that require external inputs, such as data analysis or image processing.

3. System Prompt Initialization

The system prompt is a crucial component of the agent’s operation. It provides context and instructions to the language model (LLM) about how to approach the task. The initialize_system_prompt method generates the system prompt based on predefined templates and the agent's configuration.

self.system_prompt = self.initialize_system_prompt()
self.memory.system_prompt = SystemPromptStep(system_prompt=self.system_prompt)

The system prompt is stored in the agent’s memory as a SystemPromptStep, ensuring it is available for future reference.

4. Memory and Monitor Reset

If the reset parameter is set to True, the agent clears its memory and resets the monitoring metrics. This ensures that the agent starts with a clean slate for the new task.

if reset:
self.memory.reset()
self.monitor.reset()

Resetting the memory is essential when the task is unrelated to previous runs, while resetting the monitor ensures accurate tracking of performance metrics.

5. Logging the Task

The agent logs the task details for transparency and debugging purposes. This includes the task content, the model being used, and any optional title or subtitle.

self.logger.log_task(
content=self.task.strip(),
subtitle=f"{type(self.model).__name__} - {(self.model.model_id if hasattr(self.model, 'model_id') else '')}",
level=LogLevel.INFO,
title=self.name if hasattr(self, "name") else None,
)

6. Adding the Task to Memory

The task is added to the agent’s memory as a TaskStep. This step includes any associated images and serves as the starting point for the agent's execution.

self.memory.steps.append(TaskStep(task=self.task, task_images=images))

Storing the task in memory ensures that it is available for future reference during planning and execution.

7. Execution Mode

The method supports two execution modes: streaming and non-streaming.

a. Streaming Mode

If stream=True, the method calls the _run generator function, which yields intermediate results as they are generated. This allows users to observe the agent's progress in real-time.

if stream:
return self._run(task=self.task, images=images)

Streaming mode is useful for tasks that require continuous feedback or monitoring.

b. Non-Streaming Mode

If stream=False, the method collects all intermediate results and returns only the final output. This is achieved using a deque to extract the last step's result.

return deque(self._run(task=self.task, images=images), maxlen=1)[0]

Non-streaming mode is suitable for scenarios where only the final result is needed.

Interaction with _run

The _run method is a generator function that performs the actual step-by-step execution of the task. It handles the following:

  1. Planning Steps: Periodically updates the plan based on the planning_interval.
  2. Action Generation: Generates actions using the language model.
  3. Tool Execution: Executes tools based on the generated actions.
  4. Observation Logging: Logs observations and errors for future reference.
  5. Final Answer Provision: Provides a final answer if the maximum number of steps is reached without solving the task.

The run method delegates the core execution logic to_run, ensuring modularity and separation of concerns. We will dig deeper into the _run method later in the post.

2. Planning Step

Before diving into the main execution loop, the agent may perform an initial planning step if configured to do so (planning_interval is set). The planning_step method in the MultiStepAgent class is a crucial component of the agent's decision-making process. It is responsible for periodically generating or updating a plan to guide the agent toward solving the given task. This method ensures that the agent operates in a structured and goal-oriented manner, leveraging both its memory and the language model (LLM) to formulate plans and update them as needed.

Let us break down the method in detail:

Purpose of the Method

The planning_step method is invoked at regular intervals during the agent's execution, determined by the planning_interval parameter. Its primary purpose is to:

  1. Generate an Initial Plan: On the first step, the agent creates a high-level plan based on the task, available tools, and any initial facts it knows.
  2. Update the Plan: For subsequent steps, the agent revises its plan by incorporating new observations, facts, and progress made so far.
  3. Log the Plan and Facts: The method logs the generated plan and facts for transparency and debugging purposes.

This structured approach helps the agent stay focused on the task and adapt dynamically to new information.

Key Components

1. Inputs

The method takes three arguments:

  • task : A string describing the task the agent needs to solve.
  • is_first_step : A boolean indicating whether this is the first planning step. If True, the method generates an initial plan; otherwise, it updates the existing plan.
  • step : The current step number, used to calculate remaining steps and provide context for planning.

2. Initial Planning

If is_first_step is True, the method performs the following steps:

a. Extracting Initial Facts

The agent starts by extracting initial facts about the task using the language model:

message_prompt_facts = {
"role": MessageRole.SYSTEM,
"content": [{"type": "text", "text": self.prompt_templates["planning"]["initial_facts"]}],
}
input_messages = [message_prompt_facts]
chat_message_facts: ChatMessage = self.model(input_messages)
answer_facts = chat_message_facts.content
  • The message_prompt_facts template prompts the LLM to generate relevant facts about the task.
  • The LLM processes the prompt and returns the extracted facts (answer_facts).

b. Generating the Initial Plan

Using the extracted facts, the agent formulates an initial plan:

message_prompt_plan = {
"role": MessageRole.USER,
"content": [
{
"type": "text",
"text": populate_template(
self.prompt_templates["planning"]["initial_plan"],
variables={
"task": task,
"tools": self.tools,
"managed_agents": self.managed_agents,
"answer_facts": answer_facts,
},
),
}
],
}
chat_message_plan: ChatMessage = self.model([message_prompt_plan], stop_sequences=["<end_plan>"])
answer_plan = chat_message_plan.content
  • The message_prompt_plan template combines the task, tools, managed agents, and extracted facts to create a detailed plan.
  • The LLM processes the prompt and generates the plan (answer_plan).

c. Logging and Storing the Plan

The generated plan and facts are logged and stored in the agent’s memory:

final_plan_redaction = f"""Here is the plan of action that I will follow to solve the task:
```
{answer_plan}
```"""
final_facts_redaction = f"""Here are the facts that I know so far:
```
{answer_facts}
```""".strip()
self.memory.steps.append(
PlanningStep(
model_input_messages=input_messages,
plan=final_plan_redaction,
facts=final_facts_redaction,
model_output_message_plan=chat_message_plan,
model_output_message_facts=chat_message_facts,
)
)
self.logger.log(
Rule("[bold]Initial plan", style="orange"),
Text(final_plan_redaction),
level=LogLevel.INFO,
)
  • The plan and facts are formatted into readable strings and added to the agent’s memory as a PlanningStep.
  • The logger records the plan for visibility.

3. Updating the Plan

If is_first_step is False, the method updates the existing plan by incorporating new information:

a. Extracting Updated Facts

The agent retrieves updated facts from its memory and combines them with pre-defined prompts:

facts_update_pre_messages = {
"role": MessageRole.SYSTEM,
"content": [{"type": "text", "text": self.prompt_templates["planning"]["update_facts_pre_messages"]}],
}
facts_update_post_messages = {
"role": MessageRole.SYSTEM,
"content": [{"type": "text", "text": self.prompt_templates["planning"]["update_facts_post_messages"]}],
}
input_messages = [facts_update_pre_messages] + memory_messages + [facts_update_post_messages]
chat_message_facts: ChatMessage = self.model(input_messages)
facts_update = chat_message_facts.content
  • The memory_messages include the agent's past actions, observations, and errors.
  • The LLM processes the combined input to generate updated facts (facts_update).

b. Generating the Updated Plan

Using the updated facts, the agent revises its plan:

update_plan_pre_messages = {
"role": MessageRole.SYSTEM,
"content": [
{
"type": "text",
"text": populate_template(
self.prompt_templates["planning"]["update_plan_pre_messages"], variables={"task": task}
),
}
],
}
update_plan_post_messages = {
"role": MessageRole.SYSTEM,
"content": [
{
"type": "text",
"text": populate_template(
self.prompt_templates["planning"]["update_plan_pre_messages"],
variables={
"task": task,
"tools": self.tools,
"managed_agents": self.managed_agents,
"facts_update": facts_update,
"remaining_steps": (self.max_steps - step),
},
),
}
],
}
chat_message_plan: ChatMessage = self.model(
[update_plan_pre_messages] + memory_messages + [update_plan_post_messages],
stop_sequences=["<end_plan>"],
)
answer_plan = chat_message_plan.content
  • The updated plan considers the remaining steps (self.max_steps - step) and incorporates the latest facts.
  • The LLM processes the prompt and generates the revised plan (answer_plan).

c. Logging and Storing the Updated Plan

The updated plan and facts are logged and stored in the agent’s memory:

final_plan_redaction = textwrap.dedent(
f"""I still need to solve the task I was given:
```
{task}
```
Here is my new/updated plan of action to solve the task:
```
{answer_plan}
```"""
)
final_facts_redaction = textwrap.dedent(
f"""Here is the updated list of the facts that I know:
```
{facts_update}
```"""
)
self.memory.steps.append(
PlanningStep(
model_input_messages=input_messages,
plan=final_plan_redaction,
facts=final_facts_redaction,
model_output_message_plan=chat_message_plan,
model_output_message_facts=chat_message_facts,
)
)
self.logger.log(
Rule("[bold]Updated plan", style="orange"),
Text(final_plan_redaction),
level=LogLevel.INFO,
)
  • The updated plan and facts are formatted and added to the agent’s memory.
  • The logger records the updated plan for visibility.

3. Core Execution Logic

The _run method in the MultiStepAgent class is a critical function that orchestrates the step-by-step execution of a task using the ReAct framework. It operates as a generator, yielding intermediate results at each step, and ensures that the agent progresses toward solving the task within the constraints of a maximum number of steps (max_steps). Below, I will explain the method in detail, breaking it into its key components and explaining how each part contributes to the overall functionality.

Purpose of the Method

The _run method is responsible for executing a given task in a structured and iterative manner. It manages the following:

  1. Planning: Periodically updates the plan based on the agent’s progress.
  2. Action Execution: Generates actions using the language model (LLM) and executes them.
  3. Observation Logging: Records observations, errors, and other relevant information for future reference.
  4. Final Answer Provision: Provides a final answer if the task is solved or if the maximum number of steps is reached without success.

By operating as a generator, the method allows users to observe the agent’s progress in real-time (streaming mode) or retrieve only the final result (non-streaming mode).

Key Components

1. Initialization

The method begins by initializing variables that will be used throughout the execution:

  • final_answer : Tracks whether the task has been solved. Initially set to None.
  • self.step_number : Tracks the current step number, starting from 1.
  • memory_step : Represents the current step in the agent's memory, including details such as the step number, start time, and associated images.
final_answer = None
self.step_number = 1
while final_answer is None and self.step_number <= self.max_steps:
step_start_time = time.time()
memory_step = ActionStep(
step_number=self.step_number,
start_time=step_start_time,
observations_images=images,
)

This initialization ensures that the agent starts with a clean slate for each step and tracks the progress over time.

2. Planning Step

If a planning_interval is defined, the agent periodically updates its plan. This occurs when the current step number is a multiple of the planning interval.

if self.planning_interval is not None and self.step_number % self.planning_interval == 1:
self.planning_step(
task,
is_first_step=(self.step_number == 1),
step=self.step_number,
)
  • Initial Planning: On the first step, the agent generates an initial plan based on the task, available tools, and any initial facts.
  • Updated Planning: For subsequent steps, the agent revises its plan by incorporating new observations, facts, and progress made so far.

The planning_step method (described before) ensures that the agent operates in a structured and goal-oriented manner, adapting dynamically to new information.

3. Logging the Step

The agent logs the current step number for transparency and debugging purposes:

self.logger.log_rule(f"Step {self.step_number}", level=LogLevel.INFO)

Logging provides visibility into the agent’s activities and helps developers understand its decision-making process.

4. Executing One Step

The core of the _run method is the execution of one step, which involves generating an action and observing the result:

final_answer = self.step(memory_step)

The step method, implemented in child classes (e.g., CodeAgent, ToolCallingAgent which will be explained later), performs the following:

  1. Action Generation: Uses the LLM to generate an action based on the current state of the agent’s memory.
  2. Tool Execution: Executes the generated action using the appropriate tool.
  3. Observation Logging: Records the result of the action (observation) in the agent’s memory.

If the action produces a final answer, the agent terminates early and returns the result.

5. Final Answer Validation

If a final answer is generated, the agent validates it using a list of predefined checks (final_answer_checks):

if final_answer is not None and self.final_answer_checks is not None:
for check_function in self.final_answer_checks:
try:
assert check_function(final_answer, self.memory)
except Exception as e:
final_answer = None
raise AgentError(f"Check {check_function.__name__} failed with error: {e}", self.logger)

These checks ensure that the final answer meets specific criteria before being accepted. If any check fails, the agent continues searching for a solution.

6. Error Handling

If an error occurs during execution, the agent records it in the memory_step object:

except AgentError as e:
memory_step.error = e

This ensures that errors are logged and can be reviewed later for debugging purposes.

7. Finalizing the Step

After completing the step, the agent finalizes the memory_step object and appends it to its memory:

finally:
memory_step.end_time = time.time()
memory_step.duration = memory_step.end_time - step_start_time
self.memory.steps.append(memory_step)

The agent also triggers any registered callbacks, allowing external systems to monitor or modify its behavior dynamically:

for callback in self.step_callbacks:
if len(inspect.signature(callback).parameters) == 1:
callback(memory_step)
else:
callback(memory_step, agent=self)

Finally, the agent increments the step number and yields the completed memory_step:

self.step_number += 1
yield memory_step

8. Handling Maximum Steps

If the agent reaches the maximum number of steps without solving the task, it attempts to provide a final answer based on its memory:

if final_answer is None and self.step_number == self.max_steps + 1:
error_message = "Reached max steps."
final_answer = self.provide_final_answer(task, images)
final_memory_step = ActionStep(
step_number=self.step_number, error=AgentMaxStepsError(error_message, self.logger)
)
final_memory_step.action_output = final_answer
final_memory_step.end_time = time.time()
final_memory_step.duration = memory_step.end_time - step_start_time
self.memory.steps.append(final_memory_step)
for callback in self.step_callbacks:
if len(inspect.signature(callback).parameters) == 1:
callback(final_memory_step)
else:
callback(final_memory_step, agent=self)
yield final_memory_step

The provide_final_answer method summarizes the agent's interactions and generates a response based on the available information.

9. Yielding the Final Answer

Finally, the method yields the final answer after handling all steps:

yield handle_agent_output_types(final_answer)

This ensures that the output is properly formatted and ready for use by the caller.

The _run method is the backbone of the MultiStepAgent's functionality. By managing task execution, memory, logging, and error handling, it ensures that the agent operates efficiently and adapts dynamically to new information. This method exemplifies how the ReAct framework combines reasoning, action, and observation to solve complex problems, making it a powerful tool for a wide range of applications.

Other Methods

There are other methods in the class that I don’t dig deeper into and only explain at a high level:

1. initialize_system_prompt

This method is designed to be implemented by child classes (e.g., CodeAgent, ToolCallingAgent). It generates the system prompt that provides context and instructions to the language model (LLM). The system prompt typically includes details about the task, available tools, and any constraints or guidelines for the agent.

2. write_memory_to_messages

This method converts the agent’s memory into a format suitable for input to the LLM. It includes past actions, observations, errors, and plans, ensuring the LLM has access to the agent’s history. This helps the agent maintain continuity and avoid redundant steps.

3. visualize

The visualize method creates a rich tree visualization of the agent's structure, including its tools, managed agents, and memory. This is particularly useful for debugging and understanding the agent's internal state.

4. extract_action

This method parses the LLM’s output to extract the action and rationale. It splits the output using a predefined token (e.g., “Action:”) and ensures that the action is correctly formatted for execution. If the output does not conform to the expected format, an error is raised.

5. provide_final_answer

When the agent reaches the maximum number of steps without solving the task, this method attempts to provide a final answer based on the agent’s memory. It constructs a prompt summarizing the agent’s interactions and asks the LLM to generate a response.

6. execute_tool_call

This method executes a tool call with the provided arguments. It replaces placeholders in the arguments with actual values from the agent’s state and invokes the appropriate tool or managed agent. If the tool call fails, detailed error messages are logged to help diagnose the issue.

7. replay

The replay method provides a step-by-step replay of the agent's actions, observations, and errors. If the detailed parameter is set to True, it also displays the agent's memory at each step. This is primarily used for debugging and analysis.

8. __call__

This method allows the agent to be called as a managed agent by another agent. It adds additional prompting for the managed agent, runs the task, and wraps the output in a standardized format. This is useful for hierarchical agent architectures where one agent manages others.

9. Helper Methods

  • get_variable_names : Extracts variable names from a Jinja2 template string.
  • populate_template : Renders a Jinja2 template with the provided variables, ensuring a dynamic generation of prompts.
  • handle_agent_output_types : Processes the agent's output to handle different types (e.g., text, images, audio) before returning it to the user.

ToolCallingAgent

The ToolCallingAgent is a specialized agent within the smolagents library that leverages JSON-like tool calls to interact with external tools and managed agents. It builds upon the foundational MultiStepAgent class and introduces specific mechanisms for generating, executing, and managing tool calls using the capabilities of the underlying language model (LLM). Below, I will explain the key components and functionality of the ToolCallingAgent in detail.

Purpose of the ToolCallingAgent

The ToolCallingAgent is designed to solve tasks by invoking predefined tools or managed agents in a structured manner. It uses the LLM's ability to generate JSON-like tool calls, ensuring clarity and precision in its interactions. This makes it particularly suitable for scenarios where the agent needs to interact with APIs, databases, or other external systems.

Key Features

1- JSON-Like Tool Calls:

  • The agent generates tool calls in a JSON-like format, which includes the tool name, arguments, and an identifier (tool_call_id).
  • This structured approach ensures that the agent can invoke tools and interpret their outputs consistently.

2- Dynamic Tool Execution:

  • The agent dynamically executes tools based on the generated tool calls.
  • If the tool call specifies a “final answer,” the agent terminates early and returns the result.

3- State Management:

  • The agent maintains a state dictionary (self.state) to store intermediate results, such as outputs from tools or observations.
  • This allows the agent to reference previous results when generating subsequent actions.

4- Logging and Transparency:

  • The agent logs each step, including tool calls, observations, and errors, providing visibility into its decision-making process.

5- Fallback Mechanism:

  • If the agent reaches the maximum number of steps without solving the task, it attempts to provide a final answer based on its memory.

Detailed Explanation of Methods

1. initialize_system_prompt

This method generates the system prompt that provides context and instructions to the LLM. It populates the prompt template with details about the available tools and managed agents:

def initialize_system_prompt(self) -> str:
system_prompt = populate_template(
self.prompt_templates["system_prompt"],
variables={"tools": self.tools, "managed_agents": self.managed_agents},
)
return system_prompt
  • The populate_template function renders the Jinja2 template with the provided variables, ensuring dynamic generation of the system prompt.
  • The system prompt serves as a guide for the LLM, helping it understand the task, available tools, and constraints.

2. step

The step method performs one iteration of the ReAct framework, involving action generation, tool execution, and observation logging. Below is a detailed breakdown:

a. Generating Model Output

The agent prepares input messages by converting its memory into a format suitable for the LLM. It then generates a response, which includes the next tool call:

model_message: ChatMessage = self.model(
memory_messages,
tools_to_call_from=list(self.tools.values()),
stop_sequences=["Observation:"],
)
  • The tools_to_call_from parameter specifies the list of available tools that the LLM can invoke.
  • The stop_sequences parameter ensures that the LLM stops generating output when it encounters the "Observation:" token.

b. Parsing Tool Call

The agent parses the tool call from the model’s output:

tool_call = model_message.tool_calls[0]
tool_name, tool_call_id = tool_call.function.name, tool_call.id
tool_arguments = tool_call.function.arguments
  • The tool call includes the tool name, arguments, and an identifier (tool_call_id).
  • If the model does not generate any tool calls, an exception is raised.

c. Logging the Tool Call

The agent logs the tool call for transparency:

self.logger.log(
Panel(Text(f"Calling tool: '{tool_name}' with arguments: {tool_arguments}")),
level=LogLevel.INFO,
)
  • The log provides visibility into the agent’s actions, making it easier to debug and analyze its behavior.

d. Handling Final Answer

If the tool call specifies a “final answer,” the agent extracts the answer and returns it:

if tool_name == "final_answer":
if isinstance(tool_arguments, dict):
if "answer" in tool_arguments:
answer = tool_arguments["answer"]
else:
answer = tool_arguments
else:
answer = tool_arguments
if (
isinstance(answer, str) and answer in self.state.keys()
): # if the answer is a state variable, return the value
final_answer = self.state[answer]
self.logger.log(
f"[bold {YELLOW_HEX}]Final answer:[/bold {YELLOW_HEX}] Extracting key '{answer}' from state to return value '{final_answer}'.",
level=LogLevel.INFO,
)
else:
final_answer = answer
self.logger.log(
Text(f"Final answer: {final_answer}", style=f"bold {YELLOW_HEX}"),
level=LogLevel.INFO,
)
memory_step.action_output = final_answer
return final_answer
  • If the answer is a state variable, the agent retrieves its value from the state dictionary.
  • The final answer is logged and returned to the caller.

e. Executing Other Tools

For non-final tool calls, the agent executes the specified tool and processes the result:

observation = self.execute_tool_call(tool_name, tool_arguments)
observation_type = type(observation)
if observation_type in [AgentImage, AgentAudio]:
if observation_type == AgentImage:
observation_name = "image.png"
elif observation_type == AgentAudio:
observation_name = "audio.mp3"
self.state[observation_name] = observation
updated_information = f"Stored '{observation_name}' in memory."
else:
updated_information = str(observation).strip()
self.logger.log(
f"Observations: {updated_information.replace('[', '|')}",
level=LogLevel.INFO,
)
memory_step.observations = updated_information
return None
  • The execute_tool_call method invokes the tool with the provided arguments.
  • If the observation is an image or audio, it is stored in the state dictionary for future reference.
  • The observation is logged and added to the agent’s memory.

Use Cases

— API Interaction:

  • The ToolCallingAgent is ideal for interacting with RESTful APIs or other web services. Its structured tool calls ensure precise and reliable communication.

— Task Automation:

  • The agent can automate workflows involving multiple tools or services, such as data processing, file management, or report generation.

— Complex Problem Solving:

  • For tasks requiring structured reasoning and interaction with external systems, the ToolCallingAgent provides a robust solution.

The ToolCallingAgent is a powerful tool for building intelligent agents capable of interacting with external systems in a structured and precise manner. By leveraging JSON-like tool calls, it ensures clarity and reliability in its operations, making it an excellent choice for automating workflows, interacting with APIs, and solving complex problems. Its integration with the ReAct framework and comprehensive logging further enhance its capabilities, enabling developers to build efficient and effective solutions.

CodeAgent

The CodeAgent is a specialized agent within the smolagents library, designed to solve tasks by generating and executing Python code snippets. Unlike the ToolCallingAgent, which relies on JSON-like tool calls, the CodeAgent focuses on parsing and executing code generated by the language model (LLM). This makes it particularly well-suited for computational tasks, such as mathematical calculations, data analysis, or simulations. This agent is based on this paper called “Executable Code Actions Elicit Better LLM Agents”.

Below, I will explain the key components and functionality of the CodeAgent in detail, focusing on its unique features and how it operates within the ReAct framework.

Purpose of the CodeAgent

The CodeAgent is tailored for tasks that require computational reasoning or automation, such as mathematical computations, data analysis, or simulations. It leverages the LLM's ability to generate Python code and executes it using either a local interpreter or a remote executor. This makes it particularly suitable for scenarios where precise and dynamic computation is required.

Key Features

— Code Generation:

  • The agent generates Python code snippets based on the task description and available tools.
  • The code is parsed and executed dynamically, enabling real-time problem-solving.

— Authorized Imports:

  • The agent supports a list of authorized imports (additional_authorized_imports) to ensure safe and controlled execution of code.
  • If all imports are authorized ("*"), the agent can import any package, but this may lead to issues if the package is not installed in the environment.

— Remote Execution:

  • The agent optionally uses the E2BExecutor for remote code execution, which is useful for scaling or running code in isolated environments.

— State Management:

  • The agent maintains a state dictionary (self.state) to store intermediate results, such as outputs from executed code snippets.
  • This allows the agent to reference previous results when generating subsequent actions.

— Logging and Transparency:

  • The agent logs each step, including generated code, execution logs, and errors, providing visibility into its decision-making process.

— Fallback Mechanism:

  • If the agent reaches the maximum number of steps without solving the task, it attempts to provide a final answer based on its memory.

Detailed Explanation of Methods

1. Initialization (__init__)

The constructor initializes the agent with the necessary components and configurations:

— Authorized Imports:

self.authorized_imports = list(set(BASE_BUILTIN_MODULES) | set(self.additional_authorized_imports))
  • Combines base built-in modules with additional authorized imports to define the scope of permissible imports.

— Prompt Templates:

prompt_templates = prompt_templates or yaml.safe_load(
importlib.resources.files("smolagents.prompts").joinpath("code_agent.yaml").read_text()
)
  • Loads predefined prompt templates from a YAML file, ensuring consistency in how the agent interacts with the LLM.

— Python Executor:

  • If use_e2b_executor is True, the agent uses the E2BExecutor for remote execution.
  • Otherwise, it defaults to the LocalPythonInterpreter.

— Validation:

  • Ensures compatibility between remote execution and managed agents:
if use_e2b_executor and len(self.managed_agents) > 0:
raise Exception(
f"You passed both {use_e2b_executor=} and some managed agents. Managed agents is not yet supported with remote code execution."
)

2. System Prompt Initialization (initialize_system_prompt)

This method generates the system prompt that provides context and instructions to the LLM:

system_prompt = populate_template(
self.prompt_templates["system_prompt"],
variables={
"tools": self.tools,
"managed_agents": self.managed_agents,
"authorized_imports": (
"You can import from any package you want."
if "*" in self.authorized_imports
else str(self.authorized_imports)
),
},
)
  • The populate_template function renders the Jinja2 template with the provided variables, ensuring dynamic generation of the system prompt.
  • The system prompt includes details about the available tools, managed agents, and authorized imports.

3. Step Execution (step)

The step method performs one iteration of the ReAct framework, involving code generation, parsing, execution, and observation logging. Below is a detailed breakdown:

a. Generating Model Output

The agent prepares input messages by converting its memory into a format suitable for the LLM. It then generates a response, which includes the next code snippet:

chat_message: ChatMessage = self.model(
self.input_messages,
stop_sequences=["<end_code>", "Observation:"],
**additional_args,
)
model_output = chat_message.content
  • The stop_sequences parameter ensures that the LLM stops generating output when it encounters specific tokens.
  • The generated code snippet is stored in model_output.

b. Parsing Code

The agent parses the generated code snippet using the parse_code_blobs function:

code_action = fix_final_answer_code(parse_code_blobs(model_output))
  • The fix_final_answer_code function ensures that the code snippet is properly formatted for execution.

c. Logging the Code

The agent logs the parsed code for transparency:

self.logger.log_code(title="Executing parsed code:", content=code_action, level=LogLevel.INFO)
  • The log provides visibility into the code being executed, making it easier to debug and analyze the agent’s behavior.

d. Executing Code

The agent executes the parsed code using the configured Python executor:

output, execution_logs, is_final_answer = self.python_executor(
code_action,
self.state,
)
  • The python_executor runs the code snippet and returns the output, execution logs, and a flag indicating whether the result is the final answer.

e. Handling Errors

If an error occurs during execution, the agent logs the error and raises an exception:

if "Import of " in error_msg and " is not allowed" in error_msg:
self.logger.log(
"[bold red]Warning to user: Code execution failed due to an unauthorized import - Consider passing said import under `additional_authorized_imports` when initializing your CodeAgent.",
level=LogLevel.INFO,
)
raise AgentExecutionError(error_msg, self.logger)
  • he error message provides guidance on resolving unauthorized import issues.

f. Logging Observations

The agent logs the execution logs and output for transparency:

execution_outputs_console += [
Text(
f"{('Out - Final answer' if is_final_answer else 'Out')}: {truncated_output}",
style=(f"bold {YELLOW_HEX}" if is_final_answer else ""),
),
]
self.logger.log(Group(*execution_outputs_console), level=LogLevel.INFO)
  • The log highlights the final answer if applicable, making it easier to identify the solution.

g. Returning the Result

If the result is the final answer, the agent returns it; otherwise, it continues to the next step:

return output if is_final_answer else None

Use Cases

— Mathematical Computations:

  • The CodeAgent is ideal for tasks requiring complex calculations, such as solving equations or performing statistical analyses.

— Data Analysis:

  • The agent can process and analyze datasets, generate visualizations, and extract insights using Python libraries like Pandas and Matplotlib.

— Script Automation:

  • The agent can automate repetitive tasks, such as file processing or API interactions, by generating and executing Python scripts.

Unlike theToolCallingAgent, which focuses on invoking predefined tools, the CodeAgent emphasizes executing dynamically generated code snippets. The CodeAgent is a powerful tool for building intelligent agents capable of solving computational problems through Python code generation and execution. By leveraging the capabilities of the underlying LLM and Python interpreters, it ensures clarity, precision, and reliability in its operations. Its integration with the ReAct framework and comprehensive logging further enhance its capabilities, enabling developers to build efficient and effective solutions for a wide range of applications. As shown in the image in the beginning of the post, CodeAgents can perform better by less interactions with the environment and less too calls!

Conclusion

The smolagents library from HuggingFace provides a robust and flexible framework for building intelligent agents capable of solving complex tasks through structured reasoning, tool usage, and dynamic planning. The three primary agent types—MultiStepAgent, CodeAgent, and ToolCallingAgent—each serve distinct purposes, catering to different use cases while sharing a common foundation in the ReAct framework.

The MultiStepAgent acts as the backbone of the library, offering a general-purpose solution for tasks that require iterative cycles of action and observation. Its ability to periodically plan and adapt ensures that it remains focused on achieving its objectives, making it suitable for a wide range of applications, from debugging code to analyzing data.

The CodeAgent specializes in computational tasks, leveraging Python code generation and execution to solve problems dynamically. With features like authorized imports and remote execution, it strikes a balance between safety and flexibility, enabling developers to tackle mathematical computations, data analysis, and automation workflows with precision.

The ToolCallingAgent, on the other hand, excels in scenarios requiring structured interactions with external tools or APIs. By generating JSON-like tool calls, it ensures clarity and reliability in invoking predefined actions, making it ideal for API integrations, task automation, and multi-tool workflows.

Each agent type demonstrates the power of combining language models with structured reasoning and external tools. While they share similarities, their differences highlight the importance of choosing the right agent for the task at hand. For instance, the CodeAgent is indispensable for computational tasks, whereas the ToolCallingAgent shines in environments requiring precise tool invocations.

The smolagents library exemplifies how modular design and thoughtful implementation can empower developers to build versatile and efficient solutions. Whether you're automating repetitive tasks, performing complex computations, or interacting with external systems, these agents provide the tools and structure needed to achieve your goals effectively.

As artificial intelligence continues to evolve, frameworks like smolagents pave the way for more sophisticated and capable agents. By understanding the strengths and nuances of each agent type, developers can harness their full potential to solve real-world problems and drive innovation forward.

--

--

Isaac Kargar
Isaac Kargar

Written by Isaac Kargar

AI Researcher | Ph.D. candidate at the Intelligent Robotics Group at Aalto University | https://kargarisaac.github.io/

No responses yet