How to build a multimodal agentic RAG AI assistant: A guide

Learn to build a multimodal agentic RAG system with retrieval, autonomous decision-making, and voice interaction—plus hands-on implementation.

By Axel Sirota

Apr 1, 2025 • 10 Minute Read

Please set an alt value for this image...

Subscribe to the newsletter

Recent advances in natural language processing have given rise to Retrieval-Augmented Generation (RAG) systems, which blend the power of large language models (LLMs) with real-time information retrieval. While traditional LLMs excel at generating text, they struggle to provide up-to-date or specialized answers when the information isn’t in their training data. RAG systems solve this by fetching relevant documents from an external knowledge base and generating responses that combine both retrieved content and the model’s internal knowledge.

But retrieval alone isn’t enough. AI systems are evolving to become more agentic, meaning they don’t just respond; they take action. An agentic system can decide when to search the web, fetch data, or use external tools to enhance its response. When combined with multimodality—the ability to process text, voice, and other input types—this transforms a simple Q&A system into a dynamic, interactive digital assistant.

By the end of this tutorial, you’ll build an AI assistant that doesn’t just answer your questions—it actively searches Google and retrieves relevant content to improve its responses. We’ll break down the core concepts of RAG, agentic behavior, and multimodal interaction before diving into hands-on implementation.

Table of Contents

Understanding multimodal agentic RAG
Multimodal RAG system architecture: key components explained
Essential components of a multimodal agentic RAG system
How to implement a multimodal agentic RAG system
Detailed Explanations
Final thoughts on building a multimodal agentic RAG assistant

Understanding multimodal agentic RAG

How retrieval-augmented generation (RAG) works

RAG combines two main components:

Retrieval: This component searches an external database (or knowledge base) to fetch documents relevant to the query. In our implementation, we use FAISS (Facebook AI Similarity Search) to index and retrieve sample documents.
Generation: The language model then uses the retrieved context along with the user query to generate an answer. This fusion helps the model provide responses that are both context-aware and informed by up-to-date or specialized external content.

What makes a system agentic?

An agentic system goes beyond simple query-answering. Instead of merely responding, the system decides on an action based on the query. For example, if you ask, “Search for flights from Buenos Aires to Madrid,” an agentic assistant might determine that it should perform a Google search and then fetch details from the result. This autonomy is achieved through frameworks like LangChain that enable the integration of custom tools (e.g., a GoogleSearch tool or a WebFetcher).

Why multimodality matters

Multimodality refers to the system’s ability to handle multiple forms of input and output. In our case, we integrate:

Voice Input: Using a speech recognition library to convert spoken language into text.
Text Output and TTS: The system generates text responses, which are then converted into speech using a text-to-speech engine. This creates a more natural interaction—similar to modern digital assistants like Siri or Alexa.

Multimodal RAG system architecture: key components explained

Our digital assistant consists of several interconnected components:

Voice Input: The system uses the speech_recognition library to capture and transcribe user queries.
Context Retrieval with FAISS: A small knowledge base is indexed with FAISS. When a query is received, the system computes its embedding (using a Sentence Transformer) and retrieves the most relevant document(s) to provide additional context.
Response Generation using the Zephyr Model: We use Hugging Face’s InferenceClient to interface with the Zephyr model. This model generates responses based on a combined prompt containing both the query and the retrieved context.
Agentic Behavior with LangChain: If a query implies that additional actions are required (e.g., “search” or “fetch website”), a LangChain agent autonomously decides which tool to invoke. Two tools are available:
- GoogleSearch: Uses the googlesearch package to perform a web search and then fetch content from the first result.
- WebFetcher: Retrieves content directly from a provided URL.
Text-to-Speech: The system uses pyttsx3 to vocalize responses, providing a fully interactive experience.

To integrate these components seamlessly, we also include a Dummy Pipeline. This adapter wraps our response-generation function so that it meets the expected interface for LangChain’s HuggingFacePipeline. This dummy pipeline ensures that the agent only produces a concise “Final Answer” rather than chain-of-thought details.

Essential components of a multimodal agentic RAG system

Hugging Face InferenceClient with Zephyr Model

What it does: We use the Zephyr model (a lighter alternative to heavier LLMs) for generating responses. The InferenceClient from Hugging Face allows us to interact with the Zephyr model hosted on Hugging Face's Inference API.

Why it works: By using a hosted model via the InferenceClient, we offload the heavy computations to the cloud. This makes it practical to integrate advanced language capabilities into our assistant without needing enormous local resources.

Speech Recognition for Voice Input

What it does: Using the speech_recognition Python library, our system captures audio from the microphone and converts spoken language into text.

Why it works: The library uses services like Google’s Speech API to transcribe speech with reasonable accuracy. We increase the phrase_time_limit to capture longer queries, ensuring that our assistant listens long enough for natural conversation.

Text-to-Speech (TTS)

What it does: Employ the pyttsx3 library to convert the generated text responses back into audible speech.

Why it works: pyttsx3 works offline and is cross-platform, enabling our assistant to “speak” its answers without relying on external services.

FAISS for Context Retrieval

What it does: FAISS is used to create an index of sample documents (a miniature knowledge base). When a query is received, we compute its embedding using a Sentence Transformer model and then retrieve the most relevant document(s) to enrich the query context.

Why it works: Retrieving additional context improves the quality of generated responses by providing the language model with focused background information. FAISS is highly efficient at similarity search, making it a suitable choice for this task.

Agentic Behavior with LangChain

What it does: LangChain is used to wrap custom “tools” that the assistant can invoke if the query implies a need for external actions. We define two tools:

WebFetcher: Retrieves a webpage’s content from a URL.
GoogleSearch: Uses the googlesearch package to perform a search and then fetch the first result’s content.

Why it works: By using LangChain’s agent framework, the assistant can autonomously decide whether to answer a query directly or to take an action (like searching the web) if the query contains keywords (e.g., "search" or "fetch website"). We also include a dummy pipeline to integrate with LangChain without having a full LLM locally.

Dummy Pipeline for LangChain Integration

What it does: The dummy pipeline wraps our generate_response function so that LangChain’s HuggingFacePipeline interface can be satisfied. It also ensures that prompts are formatted properly and instructs the model to output only the final answer.

Why it works: This layer serves as an adapter between our direct generation code and the LangChain agent’s expectations. It prevents chain-of-thought outputs and forces the LLM to return a concise final answer.

How to implement a multimodal agentic RAG system

Below is the complete code with detailed comments and explanations. (Remember to replace <YOUR_HF_API_TOKEN> with your actual token.)

                   import os
import warnings
import torch
import speech_recognition as sr
import pyttsx3
import requests
from huggingface_hub import InferenceClient
import faiss
import numpy as np
from transformers import AutoTokenizer as 
EmbeddingTokenizer, AutoModel as EmbeddingModel
from googlesearch import search  # Ensure you have installed googlesearch-python

# Disable tokenizers parallelism warning and other warnings.
os.environ["TOKENIZERS_PARALLELISM"] = "false"
warnings.filterwarnings("ignore")

# Set your HF_API_TOKEN (replace with your token)
os.environ["HF_API_TOKEN"] = "<YOUR_HF_API_TOKEN>"

#############################
# 1. Initialize InferenceClient with Zephyr Model
#############################
print("Initializing InferenceClient with Zephyr model...")
model_name = "HuggingFaceH4/zephyr-7b-beta"
client = InferenceClient(model=model_name, token=os.environ.get("HF_API_TOKEN"))

#############################
# 2. Voice Input (STT)
#############################
def get_voice_input():
    print("Initializing speech recognition...")
    recognizer = sr.Recognizer()
    with sr.Microphone() as source:
        print("Please speak your query (say 'exit' to quit)...")
        # Increase phrase_time_limit to capture longer queries.
        audio = recognizer.listen(source, phrase_time_limit=20)
    try:
        query = recognizer.recognize_google(audio)
        print("You said:", query)
        return query
    except Exception as e:
        print("Error during recognition:", e)
        return None

#############################
# 3. Generation using InferenceClient and Zephyr Model
#############################
def generate_response(query, context):
    print("Generating response using Zephyr model...")
    # Combine context and query into one message.
    history = [{"role": "user", "content": f"Context: {context}\n{query}"}]
    max_new_tokens = 150
    temperature = 0.7
    top_p = 0.9
    result = client.chat_completion(
         history,
         max_tokens=max_new_tokens,
         stream=False,  # Disable streaming.
         temperature=temperature,
         top_p=top_p,
    )
    answer = result["choices"][0]["message"]["content"]
    return answer

#############################
# 4. Text-to-Speech (TTS)
#############################
def speak_text(text):
    engine = pyttsx3.init()
    engine.say(text)
    engine.runAndWait()

#############################
# 5. Agentic Behavior Tools
#############################
# Tool 1: WebFetcher (requires a URL)
def fetch_website(url: str) -> str:
    try:
        response = requests.get(url)
        if response.status_code == 200:
            return response.text[:500]  # Return first 500 characters.
        else:
            return "Failed to fetch website content."
    except Exception as e:
        return f"An error occurred: {e}"

# Tool 2: GoogleSearch - performs a Google search and fetches the first result.
def google_search(query: str) -> str:
    try:
        results = list(search(query, num_results=1))
        if results:
            url = results[0]
            if not url.startswith("http"):
                return "Invalid URL from Google search: " + str(url)
            print("GoogleSearch found URL:", url)
            content = fetch_website(url)
            return content
        else:
            return "No search results found."
    except Exception as e:
        return f"Search error: {e}"

#############################
# 6. LangChain Agent Initialization with Dummy Pipeline
#############################
from langchain.agents import Tool, initialize_agent
from langchain_huggingface import HuggingFacePipeline

# Define a dummy pipeline class that wraps our generate_response function.
class DummyPipeline:
    def __init__(self, generate_response_fn):
        self.generate_response_fn = generate_response_fn
        # Create a dummy model with a "name_or_path" attribute.
        self.model = type("DummyModel", (), {"name_or_path": "dummy-model"})()
        # Set the task attribute required by HuggingFacePipeline.
        self.task = "text-generation"
        
    def __call__(self, prompt, **kwargs):
        # Ensure prompt is a string.
        if isinstance(prompt, list):
            prompt = " ".join(prompt)
        else:
            prompt = str(prompt)
        # Prepend instruction to output only the final answer.
        modified_prompt = "Final Answer: " + prompt
        result = self.generate_response_fn(modified_prompt, "")
        # Mimic standard pipeline output: a list with a dict containing 
"generated_text".
        return [{"generated_text": result}]

# Instantiate the dummy pipeline.
dummy_pipeline = DummyPipeline(generate_response)

# Wrap it in HuggingFacePipeline for LangChain.
llm = HuggingFacePipeline(pipeline=dummy_pipeline)

# Define our tools.
tools = [
    Tool(
        name="WebFetcher",
        func=fetch_website,
        description="Fetches the content of a website given its URL. Input should be 
a URL."
    ),
    Tool(
        name="GoogleSearch",
        func=google_search,
        description="Searches Google for a query and returns the content of the 
first result. Input should be a search query."
    )
]

# Initialize the agent.
agent = initialize_agent(tools, llm, agent="zero-shot-react-description", verbose=True)

def agentic_assistant(query):
    try:
        # Use invoke (with keyword-only arguments) to allow for error handling.
        result = agent.invoke(input=query, handle_parsing_errors=True)
        if not isinstance(result, str):
            result = str(result)
        return result
    except Exception as e:
        print("Agentic assistant encountered an error:", e)
        # Fallback: generate a plain response without agentic reasoning.
        fallback = generate_response(query, "")
        return fallback

#############################
# 7. FAISS Context Retrieval: Sample Data and Indexing
#############################
embedding_model_name = "sentence-transformers/all-MiniLM-L6-v2"
embedding_tokenizer = EmbeddingTokenizer.from_pretrained(embedding_model_name)
embedding_model = EmbeddingModel.from_pretrained(embedding_model_name)

def compute_embedding(text):
    inputs = embedding_tokenizer(text, return_tensors="pt", truncation=True, 
padding=True)
    with torch.no_grad():
        outputs = embedding_model(**inputs)
    # Mean pooling of token embeddings.
    embeddings = outputs.last_hidden_state.mean(dim=1)
    return embeddings[0].cpu().numpy()

sample_docs = [
    "Retrieval-Augmented Generation (RAG) combines retrieval techniques 
with generative models for context-aware responses.",
    "FAISS is an efficient library for similarity search developed by Facebook.",
    "Multimodal systems integrate various data types such as text, voice, 
and images to enable richer AI interactions."
]

doc_embeddings = np.array([compute_embedding(doc) for doc in sample_docs])
dimension = doc_embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(doc_embeddings)

def retrieve_context(query, k=1):
    query_embedding = compute_embedding(query)
    query_embedding = np.expand_dims(query_embedding, axis=0)
    distances, indices = index.search(query_embedding, k)
    context_docs = [sample_docs[i] for i in indices[0]]
    return " ".join(context_docs)

#############################
# 8. Main Interactive Digital Assistant Function
#############################
def digital_assistant():
    print("Starting interactive digital assistant. Say 'exit' to quit.")
    while True:
        query = get_voice_input()
        if query is None:
            continue
        if query.strip().lower() in ["exit", "quit"]:
            speak_text("Goodbye!")
            break

        # If the query suggests an agentic task, use the agent.
        if any(keyword in query.lower() for keyword in ["fetch website", "search", 
"reservation"]):
            print("Agentic task detected. Processing with agent...")
            answer = agentic_assistant(query)
        else:
            context = retrieve_context(query)
            answer = generate_response(query, context)

        print("Assistant:", answer)
        speak_text(answer)

#############################
# 9. Script Execution
#############################
if __name__ == "__main__":
    digital_assistant()
    
            

Detailed Explanations

Voice Input and Recognition

The get_voice_input function uses the speech_recognition library to capture audio from the microphone. We set the phrase_time_limit to 20 seconds to allow for longer queries. The recognized speech is then transcribed using Google’s Speech Recognition API. This component is critical because it transforms your natural spoken language into text that the system can process.

Language Generation with Zephyr via InferenceClient

The generate_response function constructs a conversation history by combining any retrieved context (from FAISS) with the user’s query. It then calls the Hugging Face InferenceClient's chat_completion method using the Zephyr model. The parameters like max_tokens, temperature, and top_p control the length and creativity of the generated response. This integration enables our assistant to provide coherent, context-aware answers.

Text-to-Speech

The speak_text function uses pyttsx3 to convert the generated text back into speech. This allows the assistant to respond audibly, providing a more natural, interactive experience.

FAISS-based Context Retrieval

The system includes a small knowledge base represented by sample_docs. Each document is converted into a vector embedding using a Sentence Transformer (via the Hugging Face sentence-transformers/all-MiniLM-L6-v2 model). FAISS then indexes these embeddings to allow fast similarity searches. When a query is received, the system retrieves the document(s) most relevant to the query. Adding this context can help the LLM generate more precise and informed answers.

Agentic Behavior with LangChain

Our assistant uses LangChain to add “agentic” capabilities—allowing it to autonomously decide whether to invoke external tools. Two tools are defined:

WebFetcher: Fetches a webpage’s content when provided with a valid URL.
GoogleSearch: Uses the googlesearch library to perform a Google search and then fetches content from the first search result.

LangChain’s agent is initialized with these tools and a dummy pipeline that wraps our generate_response function. The dummy pipeline ensures that the agent has a compatible interface with the LangChain system, and we instruct it to output a final answer by prepending "Final Answer:" to the prompt. If the agent’s output cannot be parsed (for example, due to chain-of-thought details), we have a try/except block that falls back to plain generation using the Zephyr model.

This kind of autonomous decision-making is especially valuable in enterprise applications, where AI assistants need to handle complex workflows, retrieve relevant business data, and automate tasks without human intervention.

Dummy Pipeline for LangChain Integration

The DummyPipeline class acts as an adapter between our generate_response function and the LangChain HuggingFacePipeline interface. It makes sure that the prompt is converted into a string and returns a list of dictionaries with the key "generated_text", which is the expected format. This layer is crucial for ensuring that our agent can invoke our generation function correctly.

Interactive Loop

The digital_assistant function runs an infinite loop that continuously listens for voice queries. It checks if the query includes keywords like "search", "fetch website", or "reservation" to decide whether to use the agent (which can autonomously perform actions) or simply generate a response using retrieved context. The loop only terminates when the user says "exit" or "quit".

Final thoughts on building a multimodal agentic RAG assistant

In this tutorial, we explored the foundations of RAG, agentic behavior, and multimodality, showing how combining retrieval with a generative LLM—enhanced by autonomous decision-making—creates a powerful, interactive AI assistant. Now, you have a system that takes voice commands, retrieves relevant context with FAISS, generates responses using the Zephyr model, and even performs web searches when needed.

Each component plays a vital role:

Voice Input & TTS make interactions more natural.
FAISS-based Context Retrieval enhances responses with relevant information.
The Zephyr Model generates well-informed answers.
LangChain’s Agentic Behavior enables autonomous decision-making.

With this guide and hands-on implementation, you're ready to build, refine, and expand your multimodal agentic RAG system.

Happy coding—your interactive AI journey starts now!

If you're interested in expanding your knowledge on RAG systems, check out my other related guides:

Securing your RAG application: A comprehensive guide – Learn how to protect your RAG system against threats with best practices and example code.
What is RAG: Definition, use cases, and how to implement it – Get a foundational understanding of RAG and how to build a working implementation.
How to implement contextual retrieval for AI applications – Discover how contextual retrieval improves response accuracy and how to implement it effectively.

Axel S.

Axel Sirota is a Microsoft Certified Trainer with a deep interest in Deep Learning and Machine Learning Operations. He has a Masters degree in Mathematics and after researching in Probability, Statistics and Machine Learning optimization, he works as an AI and Cloud Consultant as well as being an Author and Instructor at Pluralsight, Develop Intelligence, and O'Reilly Media.

More about this author

How to build a multimodal agentic RAG AI assistant: A guide

Understanding multimodal agentic RAG

How retrieval-augmented generation (RAG) works

What makes a system agentic?

Why multimodality matters

Multimodal RAG system architecture: key components explained

Essential components of a multimodal agentic RAG system

Hugging Face InferenceClient with Zephyr Model

Speech Recognition for Voice Input

Text-to-Speech (TTS)

FAISS for Context Retrieval

Agentic Behavior with LangChain

Dummy Pipeline for LangChain Integration

How to implement a multimodal agentic RAG system

Detailed Explanations

Voice Input and Recognition

Language Generation with Zephyr via InferenceClient

Text-to-Speech

FAISS-based Context Retrieval

Agentic Behavior with LangChain

Dummy Pipeline for LangChain Integration

Interactive Loop

Final thoughts on building a multimodal agentic RAG assistant

Advance your tech skills today