Multimodal RAG System

How to run

Create and activate a Python virtual environment (recommended)
Install requirements: pip install -r requirements.txt

Place your PDF document in the project root directory
Run the PDF processor: python pdf_processor.py

This will:

Extract text and split into chunks (saved in /data)
Extract images (saved in /pic_data)

Run the embeddings processor:
python embeddings_processor.py

This will:

Create embeddings for text chunks
Store embeddings in ChromaDB

Run the terminal interface: python runner.py

You can:

Type your questions about the document
Type 'help' to see available commands
Type 'exit' to quit

Multimodal RAG System

A Retrieval-Augmented Generation (RAG) system that can process and query both text and images from corporate documents.

Overview

This project implements a multimodal RAG system that:

Extracts and processes text and images from PDF documents
Embeds text and images into vector space using Google's embedding models
Stores embeddings in ChromaDB vector database
Retrieves relevant content based on user queries
Generates coherent responses using Mistral AI

System Architecture

User Query → Multimodal Retriever → RAG Generator → User Response
                  ↑                        ↑
              ChromaDB ← Embeddings Processor
                  ↑
      Text Files & Images ← PDF Processor
                  ↑
          Source Documents

Components

PDF Processing & Embedding

pdf_processor.py: Extracts text content from PDFs and splits into chunks
pic_record.py: Extracts images from PDF files
embeddings_processor.py: Creates embeddings for text using Google's embedding models

Retrieval & Generation

multimodal_retriever.py: Handles multimodal queries across text and image collections
rag_generator.py: Takes retrieval results and generates responses using Mistral AI
retriever.py: Simple text-only retrieval functions (legacy)

User Interface & Evaluation

gradio_interface.py: Web interface using Gradio for interacting with the system
evaluation.py: Tools to evaluate system performance
optimization.py: Parameter optimization to fine-tune the system

Getting Started

Prerequisites

Python 3.8+
Required packages (see requirements.txt)
API keys:
- Google AI API key (for embeddings)
- Mistral AI API key (for LLM)

Installation

Clone the repository
Install dependencies:

pip install -r requirements.txt

Create a .env file with your API keys:

GOOGLE_API_KEY=your_google_ai_key
MISTRAL_API_KEY=your_mistral_ai_key

Usage

Processing Documents

Process a PDF document to extract text and images:

python pdf_processor.py --input yourfile.pdf
python pic_record.py --input yourfile.pdf

Generate embeddings and store in ChromaDB:

python embeddings_processor.py

Query the System

Launch the web interface:

python gradio_interface.py

Or use the system programmatically:

from rag_generator import RAGGenerator

generator = RAGGenerator()
result = generator.generate_answer("What were the financial highlights from the last fiscal year?")
print(result['answer'])

Evaluate and Optimize

Run evaluation:

python evaluation.py

Fine-tune parameters:

python optimization.py

System Optimization

For optimal performance, the system can be tuned with these parameters:

Number of text chunks to retrieve (default: 5)
Number of images to retrieve (default: 3)
LLM temperature (default: 0.1)

Run the optimization script to find the best parameters for your specific data.

Extending the System

Adding New Document Types

Implement new document processors that output text chunks and images
Process these through the embedding pipeline

Using Different Models

Modify the embedding functions in embeddings_processor.py
Change the LLM in rag_generator.py

Limitations

The system currently only handles PDF documents
Image understanding is limited by the quality of embeddings
Performance depends on the quality and relevance of the source documents

Future Improvements

Implement CLIP or other specialized multimodal embedding models
Add support for more document types (Word, PPT, etc.)
Implement user feedback loop for continuous improvement
Add caching mechanisms for frequently asked questions

License

This project is licensed under the MIT License - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

How to run

Multimodal RAG System

Overview

System Architecture

Components

PDF Processing & Embedding

Retrieval & Generation

User Interface & Evaluation

Getting Started

Prerequisites

Installation

Usage

Processing Documents

Query the System

Evaluate and Optimize

System Optimization

Extending the System

Adding New Document Types

Using Different Models

Limitations

Future Improvements

License

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
__pycache__		__pycache__
chroma_db		chroma_db
data		data
pic_data		pic_data
.env		.env
.gitignore		.gitignore
README.md		README.md
embeddings_processor.py		embeddings_processor.py
evaluation.py		evaluation.py
ltimindtree_annual_report.pdf		ltimindtree_annual_report.pdf
multimodal_retriever.py		multimodal_retriever.py
optimization.py		optimization.py
pdf_processor.py		pdf_processor.py
rag_generator.py		rag_generator.py
requirements.txt		requirements.txt
runner.py		runner.py
runner_st.py		runner_st.py

pratikm778/HackAI

Folders and files

Latest commit

History

Repository files navigation

How to run

Multimodal RAG System

Overview

System Architecture

Components

PDF Processing & Embedding

Retrieval & Generation

User Interface & Evaluation

Getting Started

Prerequisites

Installation

Usage

Processing Documents

Query the System

Evaluate and Optimize

System Optimization

Extending the System

Adding New Document Types

Using Different Models

Limitations

Future Improvements

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages