Skip to content

๐Ÿ•Œ Hadith RAG โ€” Arabic Retrieval-Augmented Generation Pipeline - Hadith RAG is an AI-powered retrieval and reasoning engine designed for Islamic text understanding โ€” specifically Hadith corpora like Sahih Bukhari, Sahih Muslim, and others. It leverages semantic chunking

Notifications You must be signed in to change notification settings

ahmedeltaher/Hadith-AI

Repository files navigation

Arabic Hadith RAG Pipeline

Hadith RAG is an AI-powered retrieval and reasoning engine designed for Islamic text understanding โ€” specifically Hadith corpora like Sahih Bukhari, Sahih Muslim, and others. It leverages semantic chunking, Arabic embeddings, and LLM reasoning to answer complex questions accurately, while preserving the sacred context and meaning of the Hadith.

A comprehensive Retrieval-Augmented Generation (RAG) system specialized for Arabic Hadith texts, built with LlamaIndex, Ollama, and Qdrant vector store.

ุตู„ู‰ ุงู„ู„ู‡ ุนู„ูŠู‡ ูˆุณู„ู…

๐ŸŒŸ Features

  • Arabic Hadith Processing: Specialized pipeline for Arabic religious texts
  • Modern RAG Architecture: LlamaIndex + Ollama + Qdrant integration
  • Semantic Chunking: Intelligent document segmentation for better context
  • Sentence Window Context: Enhanced retrieval with surrounding context
  • Multiple Data Formats: Support for .txt, .md, and .json files
  • Interactive CLI: Rich terminal interface for queries
  • Modular Design: Clean, extensible codebase

๐Ÿ› ๏ธ Technology Stack

  • LLM: Ollama qwen2.5:7b (Chinese model with Arabic capabilities)
  • Embeddings: Ollama qwen3-embedding:4b
  • Vector Store: Qdrant (in-memory or server)
  • Framework: LlamaIndex โ‰ฅ0.11.0
  • Interface: Typer + Rich for beautiful CLI

๐Ÿš€ Quick Start

Prerequisites

  1. Install Ollama (https://ollama.ai)
  2. Pull required models:
    ollama pull qwen2.5:7b
    ollama pull qwen3-embedding:4b

Installation

Option 1: Automated Setup

git clone <repository-url>
cd hadith-rag
./setup.sh

Option 2: Conda Environment Setup

git clone <repository-url>
cd hadith-rag

# Create conda environment
conda env create -f environment.yml
conda activate hadith-rag

# Or use make commands
make conda-setup
conda activate hadith-rag

Option 3: Manual Setup

git clone <repository-url>
cd hadith-rag

# With pip
pip install -r requirements.txt

# Or with conda
conda create -n hadith-rag python=3.11 -y
conda activate hadith-rag
pip install -r requirements.txt
  1. Add your Hadith data to the data/ directory:

    • .json files with structure: {"hadiths": [{"text": "...", ...}]}
    • .txt or .md files with Hadith texts
    • Sample files included for testing
  2. Run the pipeline:

    python main.py interactive

๐Ÿ“– Usage

Interactive Mode

Start an interactive query session:

python main.py interactive [OPTIONS]

Options:

  • --rebuild: Rebuild the index from scratch
  • --no-window: Disable sentence window context
  • --top-k N: Number of documents to retrieve (default: 5)
  • --data-dir PATH: Custom data directory
  • --storage-dir PATH: Custom storage directory

Single Query

Execute a single query and exit:

python main.py query-single "ู…ุง ู‡ูˆ ุญุฏูŠุซ ุงู„ู†ูŠุฉุŸ"

Build Index Only

Build or rebuild the document index:

python main.py build-index [OPTIONS]

System Check

Verify setup and dependencies:

python main.py check-setup

๐Ÿ—‚๏ธ Project Structure

hadith-rag/
โ”œโ”€โ”€ main.py                 # CLI entry point
โ”œโ”€โ”€ requirements.txt        # Python dependencies
โ”œโ”€โ”€ data/                  # Hadith documents
โ”‚   โ”œโ”€โ”€ sahih_bukhari_sample.json
โ”‚   โ””โ”€โ”€ hadith_collection.md
โ”œโ”€โ”€ storage/               # Vector index storage
โ”œโ”€โ”€ src/                   # Core modules
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”œโ”€โ”€ config.py          # Configuration management
โ”‚   โ”œโ”€โ”€ embeddings.py      # Custom Ollama embedding wrapper
โ”‚   โ”œโ”€โ”€ document_loader.py # Multi-format document loading
โ”‚   โ”œโ”€โ”€ index_builder.py   # Vector index creation
โ”‚   โ””โ”€โ”€ query_engine.py    # Query processing & response generation
โ””โ”€โ”€ README.md

โš™๏ธ Configuration

Settings are managed in src/config.py:

# Ollama settings
OLLAMA_BASE_URL = "http://localhost:11434"
EMBEDDING_MODEL = "qwen3-embedding:4b"
LLM_MODEL = "qwen2.5:7b"

# Retrieval settings
SIMILARITY_TOP_K = 5
SENTENCE_WINDOW_SIZE = 3

# Chunking settings  
CHUNK_SIZE = 512
CHUNK_OVERLAP = 50

Environment variables can override defaults via .env file.

๐Ÿ“Š Data Formats

JSON Format

{
  "collection": "sahih_bukhari",
  "hadiths": [
    {
      "text": "Arabic hadith text here...",
      "english": "English translation",
      "narrator": "Narrator name",
      "grade": "Sahih",
      "number": 1
    }
  ]
}

Text/Markdown Format

## Hadith Title

Arabic hadith text here...

## Another Hadith

More Arabic text...

The system automatically extracts metadata from file paths and content structure.

๐ŸŽฏ Example Queries

  • "ู…ุง ู‡ูˆ ุญุฏูŠุซ ุงู„ู†ูŠุฉุŸ" (What is the hadith about intention?)
  • "ุฃุญุงุฏูŠุซ ุนู† ุจุฑ ุงู„ูˆุงู„ุฏูŠู†" (Hadiths about honoring parents)
  • "ู‚ุงู„ ุฑุณูˆู„ ุงู„ู„ู‡ ุนู† ุงู„ุตุฏู‚" (What the Prophet said about truthfulness)
  • "ุฃุญุงุฏูŠุซ ููŠ ุตุญูŠุญ ุงู„ุจุฎุงุฑูŠ ุนู† ุงู„ุตู„ุงุฉ" (Hadiths in Sahih Bukhari about prayer)

๐Ÿ”ง Advanced Usage

Custom Embedding Model

from src.embeddings import create_embedding_model

# Use different embedding model
embed_model = create_embedding_model(
    model_name="custom-arabic-model",
    base_url="http://localhost:11434"
)

Programmatic Usage

from src import HadithQueryEngine, build_hadith_index

# Build index
index = build_hadith_index(
    data_dir="./data",
    use_sentence_window=True,
    rebuild=True
)

# Create query engine  
engine = HadithQueryEngine(index, similarity_top_k=10)

# Query
result = engine.query("ู…ุง ู‡ูˆ ุงู„ุฅุณู„ุงู…ุŸ")
print(result["answer"])

๐Ÿ›ก๏ธ Best Practices

  1. Data Quality: Ensure Arabic text is properly encoded (UTF-8)
  2. Index Management: Use --rebuild when adding new documents
  3. Performance: Start with smaller document collections for testing
  4. Memory: Qdrant in-memory mode suitable for smaller datasets
  5. Models: Verify Ollama models are running before starting

๐Ÿšจ Troubleshooting

Common Issues

Ollama Connection Failed:

# Check Ollama is running
ollama serve

# Verify models are available
ollama list

Import Errors:

# Install missing dependencies
pip install -r requirements.txt

Empty Index:

# Check data directory has valid files
python main.py check-setup

# Rebuild index
python main.py build-index --rebuild

Arabic Text Issues:

  • Ensure files are UTF-8 encoded
  • Check Arabic text renders correctly in terminal
  • Use --no-window if semantic chunking fails

๐Ÿ“ Development

Adding New Features

  1. New Data Sources: Extend HadithDocumentLoader
  2. Custom Retrievers: Modify query_engine.py
  3. UI Improvements: Enhance main.py CLI
  4. New Models: Update embeddings.py and config.py

Running Tests

# Install test dependencies
pip install pytest

# Run tests (when available)
pytest tests/

๐Ÿค Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make changes with proper documentation
  4. Add tests for new functionality
  5. Submit a pull request

๐Ÿ“œ License

This project is open source and available under the MIT License.

๐Ÿ™ Acknowledgments

  • LlamaIndex: Excellent RAG framework
  • Ollama: Local LLM serving made simple
  • Qdrant: High-performance vector database
  • Arabic NLP Community: Research and tools for Arabic text processing

Built with โค๏ธ for the Arabic Hadith preservation and accessibility

About

๐Ÿ•Œ Hadith RAG โ€” Arabic Retrieval-Augmented Generation Pipeline - Hadith RAG is an AI-powered retrieval and reasoning engine designed for Islamic text understanding โ€” specifically Hadith corpora like Sahih Bukhari, Sahih Muslim, and others. It leverages semantic chunking

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published