Beyond Text: Building Multimodal RAG Systems for Images and Tables

Is your RAG system struggling with the rich visual information locked in images and tables within your documents? Traditional text-only approaches often lose crucial context. This article dives into the world of multimodal RAG, showing you how to build systems that can directly understand and retrieve information from images. Whether you prefer leveraging powerful APIs like Cohere's Embed-v4 or running models locally with tools like ColPali, you'll learn practical workflows to unlock deeper insights from your data.

The Blind Spot of Traditional RAG

Most Retrieval Augmented Generation (RAG) systems today operate solely on text. They typically chunk documents, create text embeddings, and retrieve relevant text snippets to augment a large language model's (LLM) response. But what happens when crucial information is presented visually – in charts, diagrams, photos, or tables?

The common workaround involves extracting these visual elements, using a vision-language model (VLM) to generate text descriptions (captions), and then embedding these descriptions. While functional, this approach has significant drawbacks:

Loss of Context: A text caption is often a pale imitation of the original image or table, losing subtle details, spatial relationships, and density of information.
Dependency on Caption Quality: The accuracy of the retrieved information hinges entirely on the quality of the VLM-generated caption and the prompt used to create it.

Problem: Converting images and tables to text for RAG can lead to information loss and reliance on potentially imperfect descriptions.

Embracing Direct Vision Processing

What if we could bypass the text conversion step and index the visual information directly? Approaches like Poli have explored this by encoding image patches using vision encoders. However, these methods often create complex, multi-level embeddings that demand substantial memory and aren't compatible with most standard vector databases.

Enter a new generation of multimodal embedding models.

Cohere's Embed-v4: State-of-the-Art Multimodal Search

Recently, Cohere released Embed-v4, a powerful multimodal embedding model designed for enhanced search capabilities across text and images. Their benchmarks demonstrate state-of-the-art performance on vision-based retrieval tasks, surpassing previous methods.

A key advantage is that Embed-v4 generates fixed-size embeddings, making them directly compatible with popular vector stores.

Workflow: Vision RAG with Embed-v4

Here’s a typical workflow using Cohere's Embed-v4 for vision-based RAG:

Image Embedding: Process each page/image in your documents using the Embed-v4 model (input_type='image_document') to generate image embeddings.
Vector Storage: Store these image embeddings in your chosen vector database.
Query Embedding: When a user submits a query (text or image), embed it using the same Embed-v4 model (input_type='search_query' or 'image_query').
Retrieval: Perform a similarity search in the vector store to find the most relevant image embeddings based on the query embedding.
Generation: Pass the original user query and the retrieved image(s) to a capable multimodal LLM (like Google Gemini).
Answer Synthesis: The multimodal LLM analyzes both the query and the visual context of the image(s) to generate the final answer.

(This implementation overview is inspired by a post from Nils Reimers, VP of AI Search at Cohere.)

Cost and Performance: Embedding Quantization

Multimodal embeddings can be large. Cohere's Embed-v4 offers flexibility with dimensions (e.g., 1024). To manage storage and compute costs, you can apply embedding quantization.

Similar to quantizing LLM weights, you can reduce the precision of embedding vectors (e.g., from 32-bit floats to 8-bit or 4-bit integers). Studies show that quantization can significantly reduce costs while preserving much of the retrieval performance.

Tip: Explore embedding quantization techniques to optimize cost and speed. See this related video on embedding quantization for more details.

Implementation Guide 1: API-Based (Cohere Embed-v4 & Gemini)

This approach uses Cohere's API for embedding/retrieval and Google's Gemini API for generation.

Steps:

Install Libraries: pip install cohere google-generativeai python-dotenv Pillow
Get API Keys:
- Cohere API Key
- Google AI Studio API Key (for Gemini)
Initialize Clients: Set up the Cohere and Gemini clients with your API keys.
Prepare Images:
- Convert documents (like PDFs) into images (one image per page/relevant section).
- Ensure images are in a suitable format (e.g., PNG, JPG) and consider resizing if necessary (though excessive downscaling can hurt quality).
Load Data: Gather your image files (e.g., infographics, scanned tables).

Generate & Store Image Embeddings:

import cohere
import numpy as np
from PIL import Image
import os

# Initialize Cohere client (ensure COHERE_API_KEY is set as env var)
co = cohere.Client(os.environ.get("COHERE_API_KEY"))

image_folder = 'path/to/your/images'
image_files = [os.path.join(image_folder, f) for f in os.listdir(image_folder) if f.endswith(('.png', '.jpg', '.jpeg'))]

image_embeddings = []
for img_path in image_files:
    img = Image.open(img_path)
    response = co.embed(
        images=[img],
        model='embed-multilingual-v3.0', # Or the specific Embed-v4 model name
        input_type='image_document'
    )
    image_embeddings.append(response.embeddings[0])

# Store embeddings (e.g., in a NumPy array or vector DB)
image_embeddings_np = np.array(image_embeddings)
# In production, use a proper vector store (ChromaDB, Pinecone, etc.)

Implement Search/Retrieval:

def search_images(query_text, embeddings_np, image_filenames):
    response = co.embed(
        texts=[query_text],
        model='embed-multilingual-v3.0', # Use the same model
        input_type='search_query'
    )
    query_embedding = np.array(response.embeddings[0])

    # Calculate similarity (e.g., dot product for normalized embeddings)
    similarities = np.dot(embeddings_np, query_embedding)

    # Get top result
    top_index = np.argmax(similarities)
    return image_filenames[top_index], similarities[top_index]

Optional: Re-ranking: Retrieve top N images and use a re-ranker (potentially vision-based) to improve the final selection.

Generate Answer with Gemini:

import google.generativeai as genai
from PIL import Image

# Configure Gemini Client (ensure GOOGLE_API_KEY is set)
genai.configure(api_key=os.environ.get("GOOGLE_API_KEY"))
generation_model = genai.GenerativeModel('gemini-1.5-flash-preview') # Or another multimodal model

user_query = "What is the net profit of Nike?"
retrieved_image_path, score = search_images(user_query, image_embeddings_np, image_files)

print(f"Retrieved: {retrieved_image_path} with score {score:.4f}")

# Pass image and query to Gemini
retrieved_image = Image.open(retrieved_image_path)
response = generation_model.generate_content([user_query, retrieved_image])

print("\nAnswer:")
print(response.text)

Example Results (API Method):

Query: "What is the net profit of Nike?"
- Retrieval: Correctly identifies the Nike infographic.
- Generation: Gemini analyzes the image and extracts the answer: "Based on the Nike Q3 Financial Year 2025 income statement visualization provided, the net profit for Nike ending in February 2025 is $1.8 billion."
Query: "What would be the net profit of Tesla without interest?"
- Retrieval: Retrieves the Tesla infographic.
- Generation: Gemini identifies profit and interest figures from the image, performs the calculation, and provides the result, showcasing reasoning over visual data.

Implementation Guide 2: Local Model (ColPali & Gemini)

This approach keeps the embedding and retrieval local using a model like ColPali, reducing reliance on external APIs for that part. Generation still uses Gemini here, but could be swapped for a local VLM.

Steps:

Install Libraries: pip install vector-retrieval transformers torch (Note: vector-retrieval or similar libraries handle ColBERT/ColPali indexing/search. You might also need poppler-utils for PDF conversion: sudo apt-get install poppler-utils on Debian/Ubuntu).
Hugging Face Token (Optional): May be needed to download models from the Hugging Face Hub.

Load ColPali Model & Index:

from vector_retrieval.models import ColPali # Example import
from vector_retrieval.indexing import Indexer # Example import

# Load the model
model_name = 'google/colpali-mathvista-flan' # Example ColPali model
retriever = ColPali(model_name)

# Point to your image folder and create/load an index
image_folder = 'path/to/your/images'
index_path = 'path/to/save/index'
indexer = Indexer(retriever)
indexer.index_documents(corpus_path=image_folder, index_path=index_path)
# In subsequent runs, load the index: retriever.load_index(index_path)

Perform Search:

def search_local(query_text, retriever, k=1):
    results = retriever.search(query_text, k=k)
    # Results format depends on the library, typically includes doc IDs (filenames) and scores
    return results

Generation (Similar to API method):
- Retrieve the top image path(s) from the local search results.
- Load the image.
- Pass the image and original query to your chosen multimodal LLM (Gemini API or a local VLM).

Example Results (Local Method):

Using the same queries (Nike profit, Google acquisitions, Tesla profit without interest), the local ColPali retriever successfully identifies and ranks the correct infographic highest. This retrieved image is then passed to the generation model (Gemini in this example) for the final answer synthesis, yielding similar results to the API method.

Conclusion: Seeing is Believing for RAG

Vision-based RAG, using direct image indexing and retrieval, significantly enhances the ability to extract information from visually rich documents compared to text-captioning workarounds. Whether you opt for the power and simplicity of APIs like Cohere Embed-v4 or prioritize local control with models like ColPali, multimodal capabilities represent a major step forward.

As enterprise search and information retrieval continue to evolve, integrating visual understanding directly into RAG pipelines will be crucial for unlocking the full value of diverse data sources.

Resources & Further Reading

Cohere Embed-v4: Blog Post
Example Colab Notebook (API Method): Google Colab
Nils Reimers' Post (Inspiration): X.com
Related Video (Embedding Quantization): YouTube
Related Video (Contextual Retrieval): YouTube