MineMEETS is a sophisticated multimodal RAG application that processes text transcripts, transcribes audio/video, and embeds screenshots/images for intelligent retrieval. It features LangChain-Pinecone integration for seamless vector operations, hybrid search capabilities, and enhanced multimodal understanding across text, audio, and visual content.
- 🔄 LangChain-Pinecone Integration: Seamless vector database operations with advanced retrieval chains
- 🎯 Hybrid Search: Combines semantic, keyword, and query expansion search strategies for better recall
- 🎨 True Multimodal RAG: Cross-modal analysis synthesizing text, audio, and visual information
- 📝 Smart Document Processing: LangChain document loaders with intelligent chunking and text splitting
- 🎙️ Audio Intelligence: Whisper transcription with speaker diarization and temporal context
- 🖼️ Visual Understanding: CLIP ViT-B/32 embeddings with multimodal context integration
- 💬 Enhanced Q&A: Context-aware responses with source attribution and modality indicators
- 📊 Meeting Analytics: Comprehensive summaries with modality breakdown and search strategy insights
- Text + Audio/Video: Upload
.txttranscripts or audio/video for Whisper transcription - Screenshots/Images: Upload
.png/.jpg/.jpeg/.webp/.bmp; embedded with CLIP ViT-B/32 - Intelligent Chunking: Recursive text splitting with semantic boundary detection
- Meeting-Scoped Namespaces: Each upload isolated per
meeting_idwith enhanced metadata - Multimodal Insights: Cross-references information across text, audio, and visual modalities
- Streamlit UI: Intuitive upload → process → ask workflow
- Install Requirements
pip install -r requirements.txt- Create
.envwith these variables:
PINECONE_API_KEY=your-pinecone-key
PINECONE_ENVIRONMENT=your-env-or-project-region
# Whisper
WHISPER_MODEL=base
WHISPER_CACHE_DIR=.cache/whisper
# Ollama (required)
OLLAMA_MODEL=llama3.1
OLLAMA_HOST=http://localhost:11434
- Launch Ollama locally:
- Download and start from https://ollama.com/download
ollama pull llama3.1(recommended for better multimodal understanding)- Ollama will serve HTTP at
http://localhost:11434by default
- Run Streamlit app:
streamlit run app.py- Use it
- Upload
.txt, audio/video, and screenshots - Click "Process Meeting"; then ask questions in the Q&A tab
- Explore enhanced multimodal responses with source attribution
multimodal_rag.py: LangChain-based RAG chain with hybrid retrieval and cross-modal analysisdocument_processor.py: LangChain document loaders and intelligent text splitterspinecone_db.py: Enhanced Pinecone integration with LangChain VectorStore supportqa_agent.py: Multimodal Q&A with enhanced context understandingcoordinator.py: Orchestrates multimodal ingestion with fallback mechanismsaudio_agent.py: Whisper transcription with temporal segmentationimage_agent.py: CLIP embeddings for visual content understanding
- Hybrid Retrieval: Combines semantic, keyword, and expanded query strategies
- Multimodal Context: Categorizes and processes content by modality (text, audio, visual)
- Smart Chunking: Uses LangChain's RecursiveCharacterTextSplitter for optimal text segmentation
- Enhanced Metadata: Rich metadata including search strategies, temporal information, and modality types
- Document Ingestion: LangChain loaders process various file formats
- Intelligent Chunking: Semantic-aware text splitting with overlap management
- Hybrid Embedding: Multiple search strategies for comprehensive retrieval
- Cross-Modal Analysis: Synthesizes information across text, audio, and visual content
- Context-Aware Response: Generates responses with modality-specific insights
- Semantic Search: Vector similarity using CLIP embeddings
- Keyword Search: Targeted keyword matching for precision
- Query Expansion: Broadens search for general questions
- Multimodal Ranking: Combines relevance scores with modality-specific boosts
- PDF Support: Enhanced PDF text extraction with page metadata
- DOCX Support: Word document processing with formatting preservation
- Text Splitting: Intelligent boundary detection (sentences, paragraphs, sections)
- Metadata Enhancement: Rich metadata including file stats, chunk positions, and content types
- Cross-Modal Analysis: Identifies patterns and consistency across modalities
- Temporal Context: Audio timestamps and text positions for timeline understanding
- Visual Context: Image descriptions and their relation to discussion content
- Source Attribution: Clear indication of which modality provided each piece of information
- LangChain Installation: Ensure all LangChain packages are installed:
langchain-pinecone,langchain-core,langchain-community - Pinecone Dimension: Index dimension must match
EMBEDDING_DIM(512 for CLIP ViT-B/32) - Memory Issues: Large files are automatically chunked; monitor memory usage for very large documents
- Ollama Models: Use
llama3.1or similar for best multimodal understanding capabilities - Fallback Mechanisms: System gracefully falls back to legacy methods if LangChain features fail
- 🔧 Production-Ready: Comprehensive error handling with graceful fallbacks
- 🎯 LangChain Integration: Leverages the power of LangChain for advanced RAG capabilities
- 🔄 Hybrid Search: Multiple retrieval strategies ensure comprehensive information discovery
- 🎨 True Multimodal: Understands and synthesizes across text, audio, and visual content
- 📊 Rich Analytics: Detailed insights into search strategies, modalities used, and response confidence
- 🛠️ Extensible Design: Clean agent architecture allows easy addition of new modalities or features
- 💻 Windows Compatible: Optimized for Windows development environments
- 📝 Well-Documented: Comprehensive documentation and clear code structure
MIT License