Is your RAG system struggling with the rich visual information locked in images and tables within your documents? Traditional text-only approaches often lose crucial context. This article dives into the world of multimodal RAG, showing you how to build systems that can directly understand and retrieve information from images. Whether you prefer leveraging powerful APIs like Cohere's Embed-v4 or running models locally with tools like ColPali, you'll learn practical workflows to unlock deeper insights from your data.
The Blind Spot of Traditional RAG
Most Retrieval Augmented Generation (RAG) systems today operate solely on text. They typically chunk documents, create text embeddings, and retrieve relevant text snippets to augment a large language model's (LLM) response. But what happens when crucial information is presented visually – in charts, diagrams, photos, or tables?
The common workaround involves extracting these visual elements, using a vision-language model (VLM) to generate text descriptions (captions), and then embedding these descriptions. While functional, this approach has significant drawbacks:
- Loss of Context: A text caption is often a pale imitation of the original image or table, losing subtle details, spatial relationships, and density of information.
- Dependency on Caption Quality: The accuracy of the retrieved information hinges entirely on the quality of the VLM-generated caption and the prompt used to create it.
Problem: Converting images and tables to text for RAG can lead to information loss and reliance on potentially imperfect descriptions.
Embracing Direct Vision Processing
What if we could bypass the text conversion step and index the visual information directly? Approaches like Poli have explored this by encoding image patches using vision encoders. However, these methods often create complex, multi-level embeddings that demand substantial memory and aren't compatible with most standard vector databases.
Enter a new generation of multimodal embedding models.
Cohere's Embed-v4: State-of-the-Art Multimodal Search
Recently, Cohere released Embed-v4, a powerful multimodal embedding model designed for enhanced search capabilities across text and images. Their benchmarks demonstrate state-of-the-art performance on vision-based retrieval tasks, surpassing previous methods.
A key advantage is that Embed-v4 generates fixed-size embeddings, making them directly compatible with popular vector stores.
Workflow: Vision RAG with Embed-v4
Here’s a typical workflow using Cohere's Embed-v4 for vision-based RAG:
- Image Embedding: Process each page/image in your documents using the Embed-v4 model (
input_type='image_document'
) to generate image embeddings. - Vector Storage: Store these image embeddings in your chosen vector database.
- Query Embedding: When a user submits a query (text or image), embed it using the same Embed-v4 model (
input_type='search_query'
or'image_query'
). - Retrieval: Perform a similarity search in the vector store to find the most relevant image embeddings based on the query embedding.
- Generation: Pass the original user query and the retrieved image(s) to a capable multimodal LLM (like Google Gemini).
- Answer Synthesis: The multimodal LLM analyzes both the query and the visual context of the image(s) to generate the final answer.
(This implementation overview is inspired by a post from Nils Reimers, VP of AI Search at Cohere.)
Cost and Performance: Embedding Quantization
Multimodal embeddings can be large. Cohere's Embed-v4 offers flexibility with dimensions (e.g., 1024). To manage storage and compute costs, you can apply embedding quantization.
Similar to quantizing LLM weights, you can reduce the precision of embedding vectors (e.g., from 32-bit floats to 8-bit or 4-bit integers). Studies show that quantization can significantly reduce costs while preserving much of the retrieval performance.
Tip: Explore embedding quantization techniques to optimize cost and speed. See this related video on embedding quantization for more details.
Implementation Guide 1: API-Based (Cohere Embed-v4 & Gemini)
This approach uses Cohere's API for embedding/retrieval and Google's Gemini API for generation.
Steps:
- Install Libraries:
pip install cohere google-generativeai python-dotenv Pillow
- Get API Keys:
- Cohere API Key
- Google AI Studio API Key (for Gemini)
- Initialize Clients: Set up the Cohere and Gemini clients with your API keys.
- Prepare Images:
- Convert documents (like PDFs) into images (one image per page/relevant section).
- Ensure images are in a suitable format (e.g., PNG, JPG) and consider resizing if necessary (though excessive downscaling can hurt quality).
- Load Data: Gather your image files (e.g., infographics, scanned tables).
-
Generate & Store Image Embeddings:
import cohere import numpy as np from PIL import Image import os # Initialize Cohere client (ensure COHERE_API_KEY is set as env var) co = cohere.Client(os.environ.get("COHERE_API_KEY")) image_folder = 'path/to/your/images' image_files = [os.path.join(image_folder, f) for f in os.listdir(image_folder) if f.endswith(('.png', '.jpg', '.jpeg'))] image_embeddings = [] for img_path in image_files: img = Image.open(img_path) response = co.embed( images=[img], model='embed-multilingual-v3.0', # Or the specific Embed-v4 model name input_type='image_document' ) image_embeddings.append(response.embeddings[0]) # Store embeddings (e.g., in a NumPy array or vector DB) image_embeddings_np = np.array(image_embeddings) # In production, use a proper vector store (ChromaDB, Pinecone, etc.)
-
Implement Search/Retrieval:
def search_images(query_text, embeddings_np, image_filenames): response = co.embed( texts=[query_text], model='embed-multilingual-v3.0', # Use the same model input_type='search_query' ) query_embedding = np.array(response.embeddings[0]) # Calculate similarity (e.g., dot product for normalized embeddings) similarities = np.dot(embeddings_np, query_embedding) # Get top result top_index = np.argmax(similarities) return image_filenames[top_index], similarities[top_index]
- Optional: Re-ranking: Retrieve top N images and use a re-ranker (potentially vision-based) to improve the final selection.
-
Generate Answer with Gemini:
import google.generativeai as genai from PIL import Image # Configure Gemini Client (ensure GOOGLE_API_KEY is set) genai.configure(api_key=os.environ.get("GOOGLE_API_KEY")) generation_model = genai.GenerativeModel('gemini-1.5-flash-preview') # Or another multimodal model user_query = "What is the net profit of Nike?" retrieved_image_path, score = search_images(user_query, image_embeddings_np, image_files) print(f"Retrieved: {retrieved_image_path} with score {score:.4f}") # Pass image and query to Gemini retrieved_image = Image.open(retrieved_image_path) response = generation_model.generate_content([user_query, retrieved_image]) print("\nAnswer:") print(response.text)
Example Results (API Method):
- Query: "What is the net profit of Nike?"
- Retrieval: Correctly identifies the Nike infographic.
- Generation: Gemini analyzes the image and extracts the answer: "Based on the Nike Q3 Financial Year 2025 income statement visualization provided, the net profit for Nike ending in February 2025 is $1.8 billion."
- Query: "What would be the net profit of Tesla without interest?"
- Retrieval: Retrieves the Tesla infographic.
- Generation: Gemini identifies profit and interest figures from the image, performs the calculation, and provides the result, showcasing reasoning over visual data.
Implementation Guide 2: Local Model (ColPali & Gemini)
This approach keeps the embedding and retrieval local using a model like ColPali, reducing reliance on external APIs for that part. Generation still uses Gemini here, but could be swapped for a local VLM.
Steps:
- Install Libraries:
pip install vector-retrieval transformers torch
(Note:vector-retrieval
or similar libraries handle ColBERT/ColPali indexing/search. You might also needpoppler-utils
for PDF conversion:sudo apt-get install poppler-utils
on Debian/Ubuntu). - Hugging Face Token (Optional): May be needed to download models from the Hugging Face Hub.
-
Load ColPali Model & Index:
from vector_retrieval.models import ColPali # Example import from vector_retrieval.indexing import Indexer # Example import # Load the model model_name = 'google/colpali-mathvista-flan' # Example ColPali model retriever = ColPali(model_name) # Point to your image folder and create/load an index image_folder = 'path/to/your/images' index_path = 'path/to/save/index' indexer = Indexer(retriever) indexer.index_documents(corpus_path=image_folder, index_path=index_path) # In subsequent runs, load the index: retriever.load_index(index_path)
- Perform Search:
def search_local(query_text, retriever, k=1): results = retriever.search(query_text, k=k) # Results format depends on the library, typically includes doc IDs (filenames) and scores return results
- Generation (Similar to API method):
- Retrieve the top image path(s) from the local search results.
- Load the image.
- Pass the image and original query to your chosen multimodal LLM (Gemini API or a local VLM).
Example Results (Local Method):
Using the same queries (Nike profit, Google acquisitions, Tesla profit without interest), the local ColPali retriever successfully identifies and ranks the correct infographic highest. This retrieved image is then passed to the generation model (Gemini in this example) for the final answer synthesis, yielding similar results to the API method.
Conclusion: Seeing is Believing for RAG
Vision-based RAG, using direct image indexing and retrieval, significantly enhances the ability to extract information from visually rich documents compared to text-captioning workarounds. Whether you opt for the power and simplicity of APIs like Cohere Embed-v4 or prioritize local control with models like ColPali, multimodal capabilities represent a major step forward.
As enterprise search and information retrieval continue to evolve, integrating visual understanding directly into RAG pipelines will be crucial for unlocking the full value of diverse data sources.
Resources & Further Reading
- Cohere Embed-v4: Blog Post
- Example Colab Notebook (API Method): Google Colab
- Nils Reimers' Post (Inspiration): X.com
- Related Video (Embedding Quantization): YouTube
- Related Video (Contextual Retrieval): YouTube