- Create and activate a Python virtual environment (recommended)
- Install requirements: pip install -r requirements.txt
- Place your PDF document in the project root directory
- Run the PDF processor:
python
pdf_processor.py
This will:
- Extract text and split into chunks (saved in /data)
- Extract images (saved in /pic_data)
- Run the embeddings processor:
- python
embeddings_processor.py
This will:
- Create embeddings for text chunks
- Store embeddings in ChromaDB
Run the terminal interface:
python runner.py
You can:
- Type your questions about the document
- Type 'help' to see available commands
- Type 'exit' to quit
A Retrieval-Augmented Generation (RAG) system that can process and query both text and images from corporate documents.
This project implements a multimodal RAG system that:
- Extracts and processes text and images from PDF documents
- Embeds text and images into vector space using Google's embedding models
- Stores embeddings in ChromaDB vector database
- Retrieves relevant content based on user queries
- Generates coherent responses using Mistral AI
User Query → Multimodal Retriever → RAG Generator → User Response
↑ ↑
ChromaDB ← Embeddings Processor
↑
Text Files & Images ← PDF Processor
↑
Source Documents
- pdf_processor.py: Extracts text content from PDFs and splits into chunks
- pic_record.py: Extracts images from PDF files
- embeddings_processor.py: Creates embeddings for text using Google's embedding models
- multimodal_retriever.py: Handles multimodal queries across text and image collections
- rag_generator.py: Takes retrieval results and generates responses using Mistral AI
- retriever.py: Simple text-only retrieval functions (legacy)
- gradio_interface.py: Web interface using Gradio for interacting with the system
- evaluation.py: Tools to evaluate system performance
- optimization.py: Parameter optimization to fine-tune the system
- Python 3.8+
- Required packages (see requirements.txt)
- API keys:
- Google AI API key (for embeddings)
- Mistral AI API key (for LLM)
- Clone the repository
- Install dependencies:
pip install -r requirements.txt- Create a
.envfile with your API keys:
GOOGLE_API_KEY=your_google_ai_key
MISTRAL_API_KEY=your_mistral_ai_key
Process a PDF document to extract text and images:
python pdf_processor.py --input yourfile.pdf
python pic_record.py --input yourfile.pdfGenerate embeddings and store in ChromaDB:
python embeddings_processor.pyLaunch the web interface:
python gradio_interface.pyOr use the system programmatically:
from rag_generator import RAGGenerator
generator = RAGGenerator()
result = generator.generate_answer("What were the financial highlights from the last fiscal year?")
print(result['answer'])Run evaluation:
python evaluation.pyFine-tune parameters:
python optimization.pyFor optimal performance, the system can be tuned with these parameters:
- Number of text chunks to retrieve (default: 5)
- Number of images to retrieve (default: 3)
- LLM temperature (default: 0.1)
Run the optimization script to find the best parameters for your specific data.
- Implement new document processors that output text chunks and images
- Process these through the embedding pipeline
- Modify the embedding functions in
embeddings_processor.py - Change the LLM in
rag_generator.py
- The system currently only handles PDF documents
- Image understanding is limited by the quality of embeddings
- Performance depends on the quality and relevance of the source documents
- Implement CLIP or other specialized multimodal embedding models
- Add support for more document types (Word, PPT, etc.)
- Implement user feedback loop for continuous improvement
- Add caching mechanisms for frequently asked questions
This project is licensed under the MIT License - see the LICENSE file for details.