Exploring the smolagents
Library: A Deep Dive into MultiStepAgent, CodeAgent, and ToolCallingAgent
In the realm of artificial intelligence, agents are entities that interact with environments to achieve specific goals. The smolagents
library from HuggingFace provides a robust framework for building such agents. It offers three primary agent types: MultiStepAgent
, CodeAgent
, and ToolCallingAgent
. Each of these agents is designed to solve tasks in unique ways, leveraging language models and tools to execute actions and interpret results.
Note: This blog post is based on
v1.8.0
.
This blog post will explore the functionality, implementation, and use cases of each agent type. We will delve into the code to highlight key sections and explain how they contribute to the overall behaviour of the agents. Additionally, we will discuss when to use one agent over another and the differences between them.
Here is the repo for the smolagents
library:
And here is the documentation:
Here is How it works in a simple GIF:
Understanding the MultiStepAgent
The MultiStepAgent
is the foundational class for all agents in the smolagents
library. It implements the ReAct framework, which involves cycles of action and observation until the objective is achieved or the maximum number of steps is reached. This agent serves as a blueprint for more specialized agents like CodeAgent
and ToolCallingAgent
.
Key Features
- ReAct Framework: The agent performs a sequence of actions based on input from a language model and observes the environment’s response.
- Customizable Tools: Developers can define and integrate custom tools that the agent can use during its execution.
- Planning Capability: The agent periodically plans its next steps using a planning interval, ensuring it stays on track toward achieving its goal.
Code Highlights
In order to understand what happens when a user asks the agent to solve a task, let’s see how the flow of input to output works. The MultiStepAgent
class is designed to solve tasks step by step using the ReAct framework, which involves cycles of thinking (actions generated by a language model) and observing (results obtained from the environment or tools). Below, I will explain the flow of how the run()
method works, breaking it down into its main steps, including planning, tool calling, and other important parts.
1. Initialization in run()
The run
method in the MultiStepAgent
class is a critical function that orchestrates the entire process of task execution by the agent. It serves as the entry point for running tasks and manages the flow of operations, including initialization, memory management, logging, and step-by-step execution. Below, I will explain the method in detail, breaking it into its key components and explaining how each part contributes to the overall functionality.
Purpose of the Method
The run
method is responsible for executing a given task using the ReAct framework. It initializes the necessary components, manages the conversation state, and either streams intermediate results or returns the final output after completing all steps. This method ensures that the agent operates efficiently and adapts dynamically to new information during execution.
Key Components
1. Input Parameters
The method accepts several parameters that define how the task should be executed:
task
(str
) : The primary task the agent needs to solve. This is the main input provided by the user.stream
(bool
, defaultFalse
) : Determines whether the agent should return intermediate results as they are generated (streaming mode) or only return the final result at the end.reset
(bool
, defaultTrue
) : Specifies whether the agent should reset its memory and start fresh or continue from where it left off in a previous run.images
(list[str]
, optional ) : Paths to images that may be relevant to the task.additional_args
(dict
, optional ) : Any additional variables (e.g., dataframes, images) that the agent can use during execution. These are added to the agent's state dictionary.
2. Task Initialization
The method begins by assigning the task
to the agent's self.task
attribute. If additional_args
are provided, they are added to the agent's state dictionary (self.state
) and appended to the task description. This ensures that the agent has access to these variables during execution.
if additional_args is not None:
self.state.update(additional_args)
self.task += f"""
You have been provided with these additional arguments, that you can access using the keys as variables in your python code:
{str(additional_args)}."""
This step is particularly useful for tasks that require external inputs, such as data analysis or image processing.
3. System Prompt Initialization
The system prompt is a crucial component of the agent’s operation. It provides context and instructions to the language model (LLM) about how to approach the task. The initialize_system_prompt
method generates the system prompt based on predefined templates and the agent's configuration.
self.system_prompt = self.initialize_system_prompt()
self.memory.system_prompt = SystemPromptStep(system_prompt=self.system_prompt)
The system prompt is stored in the agent’s memory as a SystemPromptStep
, ensuring it is available for future reference.
4. Memory and Monitor Reset
If the reset
parameter is set to True
, the agent clears its memory and resets the monitoring metrics. This ensures that the agent starts with a clean slate for the new task.
if reset:
self.memory.reset()
self.monitor.reset()
Resetting the memory is essential when the task is unrelated to previous runs, while resetting the monitor ensures accurate tracking of performance metrics.
5. Logging the Task
The agent logs the task details for transparency and debugging purposes. This includes the task content, the model being used, and any optional title or subtitle.
self.logger.log_task(
content=self.task.strip(),
subtitle=f"{type(self.model).__name__} - {(self.model.model_id if hasattr(self.model, 'model_id') else '')}",
level=LogLevel.INFO,
title=self.name if hasattr(self, "name") else None,
)
6. Adding the Task to Memory
The task is added to the agent’s memory as a TaskStep
. This step includes any associated images and serves as the starting point for the agent's execution.
self.memory.steps.append(TaskStep(task=self.task, task_images=images))
Storing the task in memory ensures that it is available for future reference during planning and execution.
7. Execution Mode
The method supports two execution modes: streaming and non-streaming.
a. Streaming Mode
If stream=True
, the method calls the _run
generator function, which yields intermediate results as they are generated. This allows users to observe the agent's progress in real-time.
if stream:
return self._run(task=self.task, images=images)
Streaming mode is useful for tasks that require continuous feedback or monitoring.
b. Non-Streaming Mode
If stream=False
, the method collects all intermediate results and returns only the final output. This is achieved using a deque to extract the last step's result.
return deque(self._run(task=self.task, images=images), maxlen=1)[0]
Non-streaming mode is suitable for scenarios where only the final result is needed.
Interaction with _run
The _run
method is a generator function that performs the actual step-by-step execution of the task. It handles the following:
- Planning Steps: Periodically updates the plan based on the
planning_interval
. - Action Generation: Generates actions using the language model.
- Tool Execution: Executes tools based on the generated actions.
- Observation Logging: Logs observations and errors for future reference.
- Final Answer Provision: Provides a final answer if the maximum number of steps is reached without solving the task.
The run
method delegates the core execution logic to_run
, ensuring modularity and separation of concerns. We will dig deeper into the _run
method later in the post.
2. Planning Step
Before diving into the main execution loop, the agent may perform an initial planning step if configured to do so (planning_interval
is set). The planning_step
method in the MultiStepAgent
class is a crucial component of the agent's decision-making process. It is responsible for periodically generating or updating a plan to guide the agent toward solving the given task. This method ensures that the agent operates in a structured and goal-oriented manner, leveraging both its memory and the language model (LLM) to formulate plans and update them as needed.
Let us break down the method in detail:
Purpose of the Method
The planning_step
method is invoked at regular intervals during the agent's execution, determined by the planning_interval
parameter. Its primary purpose is to:
- Generate an Initial Plan: On the first step, the agent creates a high-level plan based on the task, available tools, and any initial facts it knows.
- Update the Plan: For subsequent steps, the agent revises its plan by incorporating new observations, facts, and progress made so far.
- Log the Plan and Facts: The method logs the generated plan and facts for transparency and debugging purposes.
This structured approach helps the agent stay focused on the task and adapt dynamically to new information.
Key Components
1. Inputs
The method takes three arguments:
task
: A string describing the task the agent needs to solve.is_first_step
: A boolean indicating whether this is the first planning step. IfTrue
, the method generates an initial plan; otherwise, it updates the existing plan.step
: The current step number, used to calculate remaining steps and provide context for planning.
2. Initial Planning
If is_first_step
is True
, the method performs the following steps:
a. Extracting Initial Facts
The agent starts by extracting initial facts about the task using the language model:
message_prompt_facts = {
"role": MessageRole.SYSTEM,
"content": [{"type": "text", "text": self.prompt_templates["planning"]["initial_facts"]}],
}
input_messages = [message_prompt_facts]
chat_message_facts: ChatMessage = self.model(input_messages)
answer_facts = chat_message_facts.content
- The
message_prompt_facts
template prompts the LLM to generate relevant facts about the task. - The LLM processes the prompt and returns the extracted facts (
answer_facts
).
b. Generating the Initial Plan
Using the extracted facts, the agent formulates an initial plan:
message_prompt_plan = {
"role": MessageRole.USER,
"content": [
{
"type": "text",
"text": populate_template(
self.prompt_templates["planning"]["initial_plan"],
variables={
"task": task,
"tools": self.tools,
"managed_agents": self.managed_agents,
"answer_facts": answer_facts,
},
),
}
],
}
chat_message_plan: ChatMessage = self.model([message_prompt_plan], stop_sequences=["<end_plan>"])
answer_plan = chat_message_plan.content
- The
message_prompt_plan
template combines the task, tools, managed agents, and extracted facts to create a detailed plan. - The LLM processes the prompt and generates the plan (
answer_plan
).
c. Logging and Storing the Plan
The generated plan and facts are logged and stored in the agent’s memory:
final_plan_redaction = f"""Here is the plan of action that I will follow to solve the task:
```
{answer_plan}
```"""
final_facts_redaction = f"""Here are the facts that I know so far:
```
{answer_facts}
```""".strip()
self.memory.steps.append(
PlanningStep(
model_input_messages=input_messages,
plan=final_plan_redaction,
facts=final_facts_redaction,
model_output_message_plan=chat_message_plan,
model_output_message_facts=chat_message_facts,
)
)
self.logger.log(
Rule("[bold]Initial plan", style="orange"),
Text(final_plan_redaction),
level=LogLevel.INFO,
)
- The plan and facts are formatted into readable strings and added to the agent’s memory as a
PlanningStep
. - The logger records the plan for visibility.
3. Updating the Plan
If is_first_step
is False
, the method updates the existing plan by incorporating new information:
a. Extracting Updated Facts
The agent retrieves updated facts from its memory and combines them with pre-defined prompts:
facts_update_pre_messages = {
"role": MessageRole.SYSTEM,
"content": [{"type": "text", "text": self.prompt_templates["planning"]["update_facts_pre_messages"]}],
}
facts_update_post_messages = {
"role": MessageRole.SYSTEM,
"content": [{"type": "text", "text": self.prompt_templates["planning"]["update_facts_post_messages"]}],
}
input_messages = [facts_update_pre_messages] + memory_messages + [facts_update_post_messages]
chat_message_facts: ChatMessage = self.model(input_messages)
facts_update = chat_message_facts.content
- The
memory_messages
include the agent's past actions, observations, and errors. - The LLM processes the combined input to generate updated facts (
facts_update
).
b. Generating the Updated Plan
Using the updated facts, the agent revises its plan:
update_plan_pre_messages = {
"role": MessageRole.SYSTEM,
"content": [
{
"type": "text",
"text": populate_template(
self.prompt_templates["planning"]["update_plan_pre_messages"], variables={"task": task}
),
}
],
}
update_plan_post_messages = {
"role": MessageRole.SYSTEM,
"content": [
{
"type": "text",
"text": populate_template(
self.prompt_templates["planning"]["update_plan_pre_messages"],
variables={
"task": task,
"tools": self.tools,
"managed_agents": self.managed_agents,
"facts_update": facts_update,
"remaining_steps": (self.max_steps - step),
},
),
}
],
}
chat_message_plan: ChatMessage = self.model(
[update_plan_pre_messages] + memory_messages + [update_plan_post_messages],
stop_sequences=["<end_plan>"],
)
answer_plan = chat_message_plan.content
- The updated plan considers the remaining steps (
self.max_steps - step
) and incorporates the latest facts. - The LLM processes the prompt and generates the revised plan (
answer_plan
).
c. Logging and Storing the Updated Plan
The updated plan and facts are logged and stored in the agent’s memory:
final_plan_redaction = textwrap.dedent(
f"""I still need to solve the task I was given:
```
{task}
```
Here is my new/updated plan of action to solve the task:
```
{answer_plan}
```"""
)
final_facts_redaction = textwrap.dedent(
f"""Here is the updated list of the facts that I know:
```
{facts_update}
```"""
)
self.memory.steps.append(
PlanningStep(
model_input_messages=input_messages,
plan=final_plan_redaction,
facts=final_facts_redaction,
model_output_message_plan=chat_message_plan,
model_output_message_facts=chat_message_facts,
)
)
self.logger.log(
Rule("[bold]Updated plan", style="orange"),
Text(final_plan_redaction),
level=LogLevel.INFO,
)
- The updated plan and facts are formatted and added to the agent’s memory.
- The logger records the updated plan for visibility.
3. Core Execution Logic
The _run
method in the MultiStepAgent
class is a critical function that orchestrates the step-by-step execution of a task using the ReAct framework. It operates as a generator, yielding intermediate results at each step, and ensures that the agent progresses toward solving the task within the constraints of a maximum number of steps (max_steps
). Below, I will explain the method in detail, breaking it into its key components and explaining how each part contributes to the overall functionality.
Purpose of the Method
The _run
method is responsible for executing a given task in a structured and iterative manner. It manages the following:
- Planning: Periodically updates the plan based on the agent’s progress.
- Action Execution: Generates actions using the language model (LLM) and executes them.
- Observation Logging: Records observations, errors, and other relevant information for future reference.
- Final Answer Provision: Provides a final answer if the task is solved or if the maximum number of steps is reached without success.
By operating as a generator, the method allows users to observe the agent’s progress in real-time (streaming mode) or retrieve only the final result (non-streaming mode).
Key Components
1. Initialization
The method begins by initializing variables that will be used throughout the execution:
final_answer
: Tracks whether the task has been solved. Initially set toNone
.self.step_number
: Tracks the current step number, starting from 1.memory_step
: Represents the current step in the agent's memory, including details such as the step number, start time, and associated images.
final_answer = None
self.step_number = 1
while final_answer is None and self.step_number <= self.max_steps:
step_start_time = time.time()
memory_step = ActionStep(
step_number=self.step_number,
start_time=step_start_time,
observations_images=images,
)
This initialization ensures that the agent starts with a clean slate for each step and tracks the progress over time.
2. Planning Step
If a planning_interval
is defined, the agent periodically updates its plan. This occurs when the current step number is a multiple of the planning interval.
if self.planning_interval is not None and self.step_number % self.planning_interval == 1:
self.planning_step(
task,
is_first_step=(self.step_number == 1),
step=self.step_number,
)
- Initial Planning: On the first step, the agent generates an initial plan based on the task, available tools, and any initial facts.
- Updated Planning: For subsequent steps, the agent revises its plan by incorporating new observations, facts, and progress made so far.
The planning_step
method (described before) ensures that the agent operates in a structured and goal-oriented manner, adapting dynamically to new information.
3. Logging the Step
The agent logs the current step number for transparency and debugging purposes:
self.logger.log_rule(f"Step {self.step_number}", level=LogLevel.INFO)
Logging provides visibility into the agent’s activities and helps developers understand its decision-making process.
4. Executing One Step
The core of the _run
method is the execution of one step, which involves generating an action and observing the result:
final_answer = self.step(memory_step)
The step
method, implemented in child classes (e.g., CodeAgent
, ToolCallingAgent
which will be explained later), performs the following:
- Action Generation: Uses the LLM to generate an action based on the current state of the agent’s memory.
- Tool Execution: Executes the generated action using the appropriate tool.
- Observation Logging: Records the result of the action (observation) in the agent’s memory.
If the action produces a final answer, the agent terminates early and returns the result.
5. Final Answer Validation
If a final answer is generated, the agent validates it using a list of predefined checks (final_answer_checks
):
if final_answer is not None and self.final_answer_checks is not None:
for check_function in self.final_answer_checks:
try:
assert check_function(final_answer, self.memory)
except Exception as e:
final_answer = None
raise AgentError(f"Check {check_function.__name__} failed with error: {e}", self.logger)
These checks ensure that the final answer meets specific criteria before being accepted. If any check fails, the agent continues searching for a solution.
6. Error Handling
If an error occurs during execution, the agent records it in the memory_step
object:
except AgentError as e:
memory_step.error = e
This ensures that errors are logged and can be reviewed later for debugging purposes.
7. Finalizing the Step
After completing the step, the agent finalizes the memory_step
object and appends it to its memory:
finally:
memory_step.end_time = time.time()
memory_step.duration = memory_step.end_time - step_start_time
self.memory.steps.append(memory_step)
The agent also triggers any registered callbacks, allowing external systems to monitor or modify its behavior dynamically:
for callback in self.step_callbacks:
if len(inspect.signature(callback).parameters) == 1:
callback(memory_step)
else:
callback(memory_step, agent=self)
Finally, the agent increments the step number and yields the completed memory_step
:
self.step_number += 1
yield memory_step
8. Handling Maximum Steps
If the agent reaches the maximum number of steps without solving the task, it attempts to provide a final answer based on its memory:
if final_answer is None and self.step_number == self.max_steps + 1:
error_message = "Reached max steps."
final_answer = self.provide_final_answer(task, images)
final_memory_step = ActionStep(
step_number=self.step_number, error=AgentMaxStepsError(error_message, self.logger)
)
final_memory_step.action_output = final_answer
final_memory_step.end_time = time.time()
final_memory_step.duration = memory_step.end_time - step_start_time
self.memory.steps.append(final_memory_step)
for callback in self.step_callbacks:
if len(inspect.signature(callback).parameters) == 1:
callback(final_memory_step)
else:
callback(final_memory_step, agent=self)
yield final_memory_step
The provide_final_answer
method summarizes the agent's interactions and generates a response based on the available information.
9. Yielding the Final Answer
Finally, the method yields the final answer after handling all steps:
yield handle_agent_output_types(final_answer)
This ensures that the output is properly formatted and ready for use by the caller.
The _run
method is the backbone of the MultiStepAgent
's functionality. By managing task execution, memory, logging, and error handling, it ensures that the agent operates efficiently and adapts dynamically to new information. This method exemplifies how the ReAct framework combines reasoning, action, and observation to solve complex problems, making it a powerful tool for a wide range of applications.
Other Methods
There are other methods in the class that I don’t dig deeper into and only explain at a high level:
1. initialize_system_prompt
This method is designed to be implemented by child classes (e.g., CodeAgent
, ToolCallingAgent
). It generates the system prompt that provides context and instructions to the language model (LLM). The system prompt typically includes details about the task, available tools, and any constraints or guidelines for the agent.
2. write_memory_to_messages
This method converts the agent’s memory into a format suitable for input to the LLM. It includes past actions, observations, errors, and plans, ensuring the LLM has access to the agent’s history. This helps the agent maintain continuity and avoid redundant steps.
3. visualize
The visualize
method creates a rich tree visualization of the agent's structure, including its tools, managed agents, and memory. This is particularly useful for debugging and understanding the agent's internal state.
4. extract_action
This method parses the LLM’s output to extract the action and rationale. It splits the output using a predefined token (e.g., “Action:”) and ensures that the action is correctly formatted for execution. If the output does not conform to the expected format, an error is raised.
5. provide_final_answer
When the agent reaches the maximum number of steps without solving the task, this method attempts to provide a final answer based on the agent’s memory. It constructs a prompt summarizing the agent’s interactions and asks the LLM to generate a response.
6. execute_tool_call
This method executes a tool call with the provided arguments. It replaces placeholders in the arguments with actual values from the agent’s state and invokes the appropriate tool or managed agent. If the tool call fails, detailed error messages are logged to help diagnose the issue.
7. replay
The replay
method provides a step-by-step replay of the agent's actions, observations, and errors. If the detailed
parameter is set to True
, it also displays the agent's memory at each step. This is primarily used for debugging and analysis.
8. __call__
This method allows the agent to be called as a managed agent by another agent. It adds additional prompting for the managed agent, runs the task, and wraps the output in a standardized format. This is useful for hierarchical agent architectures where one agent manages others.
9. Helper Methods
get_variable_names
: Extracts variable names from a Jinja2 template string.populate_template
: Renders a Jinja2 template with the provided variables, ensuring a dynamic generation of prompts.handle_agent_output_types
: Processes the agent's output to handle different types (e.g., text, images, audio) before returning it to the user.
ToolCallingAgent
The ToolCallingAgent
is a specialized agent within the smolagents
library that leverages JSON-like tool calls to interact with external tools and managed agents. It builds upon the foundational MultiStepAgent
class and introduces specific mechanisms for generating, executing, and managing tool calls using the capabilities of the underlying language model (LLM). Below, I will explain the key components and functionality of the ToolCallingAgent
in detail.
Purpose of the ToolCallingAgent
The ToolCallingAgent
is designed to solve tasks by invoking predefined tools or managed agents in a structured manner. It uses the LLM's ability to generate JSON-like tool calls, ensuring clarity and precision in its interactions. This makes it particularly suitable for scenarios where the agent needs to interact with APIs, databases, or other external systems.
Key Features
1- JSON-Like Tool Calls:
- The agent generates tool calls in a JSON-like format, which includes the tool name, arguments, and an identifier (
tool_call_id
). - This structured approach ensures that the agent can invoke tools and interpret their outputs consistently.
2- Dynamic Tool Execution:
- The agent dynamically executes tools based on the generated tool calls.
- If the tool call specifies a “final answer,” the agent terminates early and returns the result.
3- State Management:
- The agent maintains a state dictionary (
self.state
) to store intermediate results, such as outputs from tools or observations. - This allows the agent to reference previous results when generating subsequent actions.
4- Logging and Transparency:
- The agent logs each step, including tool calls, observations, and errors, providing visibility into its decision-making process.
5- Fallback Mechanism:
- If the agent reaches the maximum number of steps without solving the task, it attempts to provide a final answer based on its memory.
Detailed Explanation of Methods
1. initialize_system_prompt
This method generates the system prompt that provides context and instructions to the LLM. It populates the prompt template with details about the available tools and managed agents:
def initialize_system_prompt(self) -> str:
system_prompt = populate_template(
self.prompt_templates["system_prompt"],
variables={"tools": self.tools, "managed_agents": self.managed_agents},
)
return system_prompt
- The
populate_template
function renders the Jinja2 template with the provided variables, ensuring dynamic generation of the system prompt. - The system prompt serves as a guide for the LLM, helping it understand the task, available tools, and constraints.
2. step
The step
method performs one iteration of the ReAct framework, involving action generation, tool execution, and observation logging. Below is a detailed breakdown:
a. Generating Model Output
The agent prepares input messages by converting its memory into a format suitable for the LLM. It then generates a response, which includes the next tool call:
model_message: ChatMessage = self.model(
memory_messages,
tools_to_call_from=list(self.tools.values()),
stop_sequences=["Observation:"],
)
- The
tools_to_call_from
parameter specifies the list of available tools that the LLM can invoke. - The
stop_sequences
parameter ensures that the LLM stops generating output when it encounters the "Observation:" token.
b. Parsing Tool Call
The agent parses the tool call from the model’s output:
tool_call = model_message.tool_calls[0]
tool_name, tool_call_id = tool_call.function.name, tool_call.id
tool_arguments = tool_call.function.arguments
- The tool call includes the tool name, arguments, and an identifier (
tool_call_id
). - If the model does not generate any tool calls, an exception is raised.
c. Logging the Tool Call
The agent logs the tool call for transparency:
self.logger.log(
Panel(Text(f"Calling tool: '{tool_name}' with arguments: {tool_arguments}")),
level=LogLevel.INFO,
)
- The log provides visibility into the agent’s actions, making it easier to debug and analyze its behavior.
d. Handling Final Answer
If the tool call specifies a “final answer,” the agent extracts the answer and returns it:
if tool_name == "final_answer":
if isinstance(tool_arguments, dict):
if "answer" in tool_arguments:
answer = tool_arguments["answer"]
else:
answer = tool_arguments
else:
answer = tool_arguments
if (
isinstance(answer, str) and answer in self.state.keys()
): # if the answer is a state variable, return the value
final_answer = self.state[answer]
self.logger.log(
f"[bold {YELLOW_HEX}]Final answer:[/bold {YELLOW_HEX}] Extracting key '{answer}' from state to return value '{final_answer}'.",
level=LogLevel.INFO,
)
else:
final_answer = answer
self.logger.log(
Text(f"Final answer: {final_answer}", style=f"bold {YELLOW_HEX}"),
level=LogLevel.INFO,
)
memory_step.action_output = final_answer
return final_answer
- If the answer is a state variable, the agent retrieves its value from the state dictionary.
- The final answer is logged and returned to the caller.
e. Executing Other Tools
For non-final tool calls, the agent executes the specified tool and processes the result:
observation = self.execute_tool_call(tool_name, tool_arguments)
observation_type = type(observation)
if observation_type in [AgentImage, AgentAudio]:
if observation_type == AgentImage:
observation_name = "image.png"
elif observation_type == AgentAudio:
observation_name = "audio.mp3"
self.state[observation_name] = observation
updated_information = f"Stored '{observation_name}' in memory."
else:
updated_information = str(observation).strip()
self.logger.log(
f"Observations: {updated_information.replace('[', '|')}",
level=LogLevel.INFO,
)
memory_step.observations = updated_information
return None
- The
execute_tool_call
method invokes the tool with the provided arguments. - If the observation is an image or audio, it is stored in the state dictionary for future reference.
- The observation is logged and added to the agent’s memory.
Use Cases
— API Interaction:
- The
ToolCallingAgent
is ideal for interacting with RESTful APIs or other web services. Its structured tool calls ensure precise and reliable communication.
— Task Automation:
- The agent can automate workflows involving multiple tools or services, such as data processing, file management, or report generation.
— Complex Problem Solving:
- For tasks requiring structured reasoning and interaction with external systems, the
ToolCallingAgent
provides a robust solution.
The ToolCallingAgent
is a powerful tool for building intelligent agents capable of interacting with external systems in a structured and precise manner. By leveraging JSON-like tool calls, it ensures clarity and reliability in its operations, making it an excellent choice for automating workflows, interacting with APIs, and solving complex problems. Its integration with the ReAct framework and comprehensive logging further enhance its capabilities, enabling developers to build efficient and effective solutions.
CodeAgent
The CodeAgent
is a specialized agent within the smolagents
library, designed to solve tasks by generating and executing Python code snippets. Unlike the ToolCallingAgent
, which relies on JSON-like tool calls, the CodeAgent
focuses on parsing and executing code generated by the language model (LLM). This makes it particularly well-suited for computational tasks, such as mathematical calculations, data analysis, or simulations. This agent is based on this paper called “Executable Code Actions Elicit Better LLM Agents”.
Below, I will explain the key components and functionality of the CodeAgent
in detail, focusing on its unique features and how it operates within the ReAct framework.
Purpose of the CodeAgent
The CodeAgent
is tailored for tasks that require computational reasoning or automation, such as mathematical computations, data analysis, or simulations. It leverages the LLM's ability to generate Python code and executes it using either a local interpreter or a remote executor. This makes it particularly suitable for scenarios where precise and dynamic computation is required.
Key Features
— Code Generation:
- The agent generates Python code snippets based on the task description and available tools.
- The code is parsed and executed dynamically, enabling real-time problem-solving.
— Authorized Imports:
- The agent supports a list of authorized imports (
additional_authorized_imports
) to ensure safe and controlled execution of code. - If all imports are authorized (
"*"
), the agent can import any package, but this may lead to issues if the package is not installed in the environment.
— Remote Execution:
- The agent optionally uses the
E2BExecutor
for remote code execution, which is useful for scaling or running code in isolated environments.
— State Management:
- The agent maintains a state dictionary (
self.state
) to store intermediate results, such as outputs from executed code snippets. - This allows the agent to reference previous results when generating subsequent actions.
— Logging and Transparency:
- The agent logs each step, including generated code, execution logs, and errors, providing visibility into its decision-making process.
— Fallback Mechanism:
- If the agent reaches the maximum number of steps without solving the task, it attempts to provide a final answer based on its memory.
Detailed Explanation of Methods
1. Initialization (__init__
)
The constructor initializes the agent with the necessary components and configurations:
— Authorized Imports:
self.authorized_imports = list(set(BASE_BUILTIN_MODULES) | set(self.additional_authorized_imports))
- Combines base built-in modules with additional authorized imports to define the scope of permissible imports.
— Prompt Templates:
prompt_templates = prompt_templates or yaml.safe_load(
importlib.resources.files("smolagents.prompts").joinpath("code_agent.yaml").read_text()
)
- Loads predefined prompt templates from a YAML file, ensuring consistency in how the agent interacts with the LLM.
— Python Executor:
- If
use_e2b_executor
isTrue
, the agent uses theE2BExecutor
for remote execution. - Otherwise, it defaults to the
LocalPythonInterpreter
.
— Validation:
- Ensures compatibility between remote execution and managed agents:
if use_e2b_executor and len(self.managed_agents) > 0:
raise Exception(
f"You passed both {use_e2b_executor=} and some managed agents. Managed agents is not yet supported with remote code execution."
)
2. System Prompt Initialization (initialize_system_prompt
)
This method generates the system prompt that provides context and instructions to the LLM:
system_prompt = populate_template(
self.prompt_templates["system_prompt"],
variables={
"tools": self.tools,
"managed_agents": self.managed_agents,
"authorized_imports": (
"You can import from any package you want."
if "*" in self.authorized_imports
else str(self.authorized_imports)
),
},
)
- The
populate_template
function renders the Jinja2 template with the provided variables, ensuring dynamic generation of the system prompt. - The system prompt includes details about the available tools, managed agents, and authorized imports.
3. Step Execution (step
)
The step
method performs one iteration of the ReAct framework, involving code generation, parsing, execution, and observation logging. Below is a detailed breakdown:
a. Generating Model Output
The agent prepares input messages by converting its memory into a format suitable for the LLM. It then generates a response, which includes the next code snippet:
chat_message: ChatMessage = self.model(
self.input_messages,
stop_sequences=["<end_code>", "Observation:"],
**additional_args,
)
model_output = chat_message.content
- The
stop_sequences
parameter ensures that the LLM stops generating output when it encounters specific tokens. - The generated code snippet is stored in
model_output
.
b. Parsing Code
The agent parses the generated code snippet using the parse_code_blobs
function:
code_action = fix_final_answer_code(parse_code_blobs(model_output))
- The
fix_final_answer_code
function ensures that the code snippet is properly formatted for execution.
c. Logging the Code
The agent logs the parsed code for transparency:
self.logger.log_code(title="Executing parsed code:", content=code_action, level=LogLevel.INFO)
- The log provides visibility into the code being executed, making it easier to debug and analyze the agent’s behavior.
d. Executing Code
The agent executes the parsed code using the configured Python executor:
output, execution_logs, is_final_answer = self.python_executor(
code_action,
self.state,
)
- The
python_executor
runs the code snippet and returns the output, execution logs, and a flag indicating whether the result is the final answer.
e. Handling Errors
If an error occurs during execution, the agent logs the error and raises an exception:
if "Import of " in error_msg and " is not allowed" in error_msg:
self.logger.log(
"[bold red]Warning to user: Code execution failed due to an unauthorized import - Consider passing said import under `additional_authorized_imports` when initializing your CodeAgent.",
level=LogLevel.INFO,
)
raise AgentExecutionError(error_msg, self.logger)
- he error message provides guidance on resolving unauthorized import issues.
f. Logging Observations
The agent logs the execution logs and output for transparency:
execution_outputs_console += [
Text(
f"{('Out - Final answer' if is_final_answer else 'Out')}: {truncated_output}",
style=(f"bold {YELLOW_HEX}" if is_final_answer else ""),
),
]
self.logger.log(Group(*execution_outputs_console), level=LogLevel.INFO)
- The log highlights the final answer if applicable, making it easier to identify the solution.
g. Returning the Result
If the result is the final answer, the agent returns it; otherwise, it continues to the next step:
return output if is_final_answer else None
Use Cases
— Mathematical Computations:
- The
CodeAgent
is ideal for tasks requiring complex calculations, such as solving equations or performing statistical analyses.
— Data Analysis:
- The agent can process and analyze datasets, generate visualizations, and extract insights using Python libraries like Pandas and Matplotlib.
— Script Automation:
- The agent can automate repetitive tasks, such as file processing or API interactions, by generating and executing Python scripts.
Unlike theToolCallingAgent
, which focuses on invoking predefined tools, the CodeAgent
emphasizes executing dynamically generated code snippets. The CodeAgent
is a powerful tool for building intelligent agents capable of solving computational problems through Python code generation and execution. By leveraging the capabilities of the underlying LLM and Python interpreters, it ensures clarity, precision, and reliability in its operations. Its integration with the ReAct framework and comprehensive logging further enhance its capabilities, enabling developers to build efficient and effective solutions for a wide range of applications. As shown in the image in the beginning of the post, CodeAgents can perform better by less interactions with the environment and less too calls!
Conclusion
The smolagents
library from HuggingFace provides a robust and flexible framework for building intelligent agents capable of solving complex tasks through structured reasoning, tool usage, and dynamic planning. The three primary agent types—MultiStepAgent
, CodeAgent
, and ToolCallingAgent
—each serve distinct purposes, catering to different use cases while sharing a common foundation in the ReAct framework.
The MultiStepAgent
acts as the backbone of the library, offering a general-purpose solution for tasks that require iterative cycles of action and observation. Its ability to periodically plan and adapt ensures that it remains focused on achieving its objectives, making it suitable for a wide range of applications, from debugging code to analyzing data.
The CodeAgent
specializes in computational tasks, leveraging Python code generation and execution to solve problems dynamically. With features like authorized imports and remote execution, it strikes a balance between safety and flexibility, enabling developers to tackle mathematical computations, data analysis, and automation workflows with precision.
The ToolCallingAgent
, on the other hand, excels in scenarios requiring structured interactions with external tools or APIs. By generating JSON-like tool calls, it ensures clarity and reliability in invoking predefined actions, making it ideal for API integrations, task automation, and multi-tool workflows.
Each agent type demonstrates the power of combining language models with structured reasoning and external tools. While they share similarities, their differences highlight the importance of choosing the right agent for the task at hand. For instance, the CodeAgent
is indispensable for computational tasks, whereas the ToolCallingAgent
shines in environments requiring precise tool invocations.
The smolagents
library exemplifies how modular design and thoughtful implementation can empower developers to build versatile and efficient solutions. Whether you're automating repetitive tasks, performing complex computations, or interacting with external systems, these agents provide the tools and structure needed to achieve your goals effectively.
As artificial intelligence continues to evolve, frameworks like smolagents
pave the way for more sophisticated and capable agents. By understanding the strengths and nuances of each agent type, developers can harness their full potential to solve real-world problems and drive innovation forward.