Skip to content

gegnew/vectorserver

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

89 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

VectorServer - Document Embedding and Retrieval System

A FastAPI-based vector database system for document chunking, embedding, and semantic search using Cohere's embedding models.

Overview

VectorServer is a document processing and retrieval system that:

  • Stores documents in a hierarchical structure (Libraries > Documents > Chunks)
  • Automatically chunks long documents into semantically meaningful segments
  • Generates vector embeddings using Cohere's embed-v4.0 model
  • Provides semantic search capabilities across document collections
  • Offers a RESTful API for document management and search

Task Completion Status

  1. 🟒 Define the Chunk, Document and Library classes.
  2. 🟒 Implement two or three indexing algorithms, do not use external libraries,
    1. 🟒 Exact kNN:
    • Time complexity: O(nd)
    • Space complexity: O(n)
    • Simplest and fastest to implement; most precise and fast enough for small datasets.
    1. 🟒 IVF
    • Time complexity:
    • Build time: O(I Γ— N Γ— K Γ— D)
      • I: Number of k-means iterations
      • N Γ— K Γ— D: Each iteration computes distances from N vectors to K centroids
    • Search Time: O(K Γ— D + |P|)
      1. Coarse Search: O(K Γ— D) - compute distance from query to K centroids
      2. Fine Search: O(|P|) - return labels of nearest centroid, where |P| = average size of labels β‰ˆ N/K
    • Space complexity: O(N Γ— D + K Γ— D + N) - N = number of vectors - D = vector dimensions - K = number of partitions Where:
  • N = number of vectors
  • D = vector dimensionality
  • K = number of partitions/centroids
  1. 🟒 Implement the necessary data structures/algorithms to ensure that there are no data races between reads and writes to the database.
    • I've used aiosqlite to leverage FastAPI's async capabilities and prevent data races. This isn't a very "custom" solution; previously I had implemented the DB class as a context manager which handled transactions manually. For SQLite, this is a fine solution, but it doesn't make the most of FastAPI's capabilities.
  2. 🟒 Create the logic to do the CRUD operations on libraries and documents/chunks.
    • Most DB operations implemented
  3. 🟒 Implement an API layer on top of that logic to let users interact with the vector database.
    • All endpoints for Libraries implemented
  4. 🟒 Create a docker image for the project
    • sufficient for development, but not for production

Extra Points:

  1. 🟒 Metadata filtering
  2. 🟒 Persistence to Disk (indexes are currently not persisted to disk, must be rebuilt on each app start)
  3. πŸ”΄ Leader-Follower Architecture
  4. πŸ”΄ Python SDK Client

Architecture

System Overview

graph TB
    C[HTTP Clients] --> MAIN[main.py]
    API_DOCS[Swagger UI] --> MAIN
    
    MAIN --> LIB_R[libraries.py]
    MAIN --> DOC_R[documents.py]
    MAIN --> CHUNK_R[chunks.py]
    MAIN --> SEARCH_R[search.py]
    MAIN --> INDEX_R[indexes.py]
    
    LIB_R --> LIB_S[LibraryService]
    DOC_R --> DOC_S[DocumentService]
    CHUNK_R --> CHUNK_S[ChunkService]
    SEARCH_R --> SEARCH_S[SearchService]
    INDEX_R --> SEARCH_S
    
    LIB_S --> LIB_REPO[LibraryRepository]
    DOC_S --> DOC_REPO[DocumentRepository]
    DOC_S --> CHUNK_REPO[ChunkRepository]
    CHUNK_S --> CHUNK_REPO
    SEARCH_S --> CHUNK_REPO
    SEARCH_S --> DOC_REPO
    SEARCH_S --> VECTOR_REPO[VectorIndexRepository]
    
    LIB_REPO --> DB[Database Manager]
    DOC_REPO --> DB
    CHUNK_REPO --> DB
    DB --> SQLITE[(SQLite)]
    
    DOC_S --> EMBEDDER[Embedder]
    DOC_S --> CHUNKER[SmartChunker]
    SEARCH_S --> EMBEDDER
    EMBEDDER --> COHERE[Cohere API]
    
    VECTOR_REPO --> FLAT[FlatIndex]
    VECTOR_REPO --> IVF[IVF Index]
    SEARCH_S --> PERSISTENT[PersistentIndex]
    PERSISTENT --> DISK[Disk Storage]
Loading

Data Model

The system uses a three-tier hierarchical structure:

Library (Collection of related documents)
  > Document (Individual files/texts)
    > Chunk (Text segments with embeddings)

Libraries: Top-level collections for organizing documents by topic, project, or source Documents: Individual text files or content with metadata
Chunks: Text segments (~500 characters) with vector embeddings for semantic search

Component Details

API Layer

  • FastAPI Application: Async web framework with automatic OpenAPI documentation
  • Route Handlers: RESTful endpoints for CRUD operations and search
  • Dependency Injection: Service instances provided via FastAPI's dependency system

Service Layer

  • Business Logic: Document processing, search orchestration, and entity management
  • Transaction Management: Coordinates database operations across repositories
  • Integration Points: Connects external APIs (Cohere) with internal systems

Repository Layer

  • Data Access: Abstract database operations with consistent interfaces
  • Connection Management: Thread-safe SQLite connections with read/write separation
  • Vector Operations: Specialized repositories for embedding storage and retrieval

Vector Processing

  • Smart Chunking: Intelligent text segmentation with boundary detection
  • Embedding Generation: Cohere API integration for vector embeddings
  • Index Management: Multiple indexing strategies (Flat, IVF) with persistence

Data Layer

  • SQLite Database: Lightweight, serverless database with foreign key constraints
  • Persistent Storage: Disk-based index caching for improved startup performance

Technical Choices

Database: SQLite with foreign key constraints for data integrity

  • Lightweight, serverless, perfect for development and testing
  • BLOB storage for binary vector embeddings
  • Automatic cascade deletion maintains referential integrity

Embedding Model: Cohere embed-v4.0 (1024 dimensions)

  • State-of-the-art multilingual embeddings
  • Optimized for search and retrieval tasks
  • Consistent 1024-dimensional vectors for all content

Chunking Strategy: Intelligent text segmentation

  • 500-character chunks
  • 50-character overlap (NOT IMPLEMENTED)
  • Smart boundary detection (sentences > words > characters) (NOT IMPLEMENTED)
  • Preserves context across chunk boundaries

Framework: FastAPI + Pydantic

  • Type safety with automatic validation
  • OpenAPI documentation generation
  • High performance async capabilities

Installation

Prerequisites

  • Python 3.11+
  • Cohere API key

Setup

  1. Clone the repository
git clone <repository-url>
cd vectorserver
  1. Environment Configuration Create a .env file:
COHERE_API_KEY=your_cohere_api_key_here
DB_PATH=data/dev.sqlite
  1. Install dependencies
pip install -r requirements.txt
# or with uv (recommended)
uv sync
  1. Initialize Database
# the dev.sqlite database is included in this repository

Usage

Running the API Server

Option 1: Run with Docker

 docker-compose up --build

Option 2: Run in local environment

# Development server with hot reload
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000

# Production server
uvicorn app.main:app --host 0.0.0.0 --port 8000

The API will be available at http://localhost:8000 with interactive docs at /docs.

Running Tests

# Run all tests
pytest

# Run specific test modules
pytest tests/test_db.py -v
pytest tests/test_main.py -v

API Examples

Create a Library

curl -X POST "http://localhost:8000/libraries" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Research Papers",
    "description": "Collection of ML research papers",
    "metadata": {"topic": "machine_learning"}
  }'

Upload and Process Document

curl -X POST "http://localhost:8000/libraries/{library_id}/documents" \
  -H "Content-Type: application/json" \
  -d '{
    "title": "Attention Is All You Need",
    "content": "The dominant sequence transduction models...",
    "metadata": {"authors": ["Vaswani", "Shazeer"], "year": 2017}
  }'

Semantic Search

curl -X POST "http://localhost:8000/search" \
  -H "Content-Type: application/json" \
  -d '{
    "content": "Assiniboine",
    "library_id": "9f9b0b6d-3671-4f9b-a20c-d9e31cc61dba"
  }'

Project Structure

vectorserver/
β”œβ”€β”€ app/
β”‚   β”œβ”€β”€ models/           # Pydantic models
β”‚   β”‚   β”œβ”€β”€ library.py
β”‚   β”‚   β”œβ”€β”€ document.py
β”‚   β”‚   └── chunk.py
β”‚   β”œβ”€β”€ routes/           # API endpoints
β”‚   β”‚   β”œβ”€β”€ libraries.py
β”‚   β”‚   β”œβ”€β”€ documents.py
β”‚   β”‚   └── search.py
β”‚   β”œβ”€β”€ repositories      # Database/indexing operations
β”‚   β”‚   β”œβ”€β”€ base.py
β”‚   β”‚   β”œβ”€β”€ library.py
β”‚   β”‚   β”œβ”€β”€ document.py
β”‚   β”‚   β”œβ”€β”€ chunk.py
β”‚   β”‚   β”œβ”€β”€ vector_index.py
β”‚   β”‚   └── db.py
β”‚   β”œβ”€β”€ embeddings.py     # Cohere embedding integration
β”‚   β”œβ”€β”€ settings.py       # Configuration
β”‚   └── main.py           # FastAPI app
β”œβ”€β”€ tests/
β”‚   └── *.py
β”œβ”€β”€ data/                 # SQLite database files
└── README.md

Key Features

Vector Search

  • Cosine similarity-based retrieval
  • Configurable result count
  • Cross-document search capabilities
  • Embedding caching for performance

Data Management

  • Complete CRUD operations for all entities (NOT QUITE)
  • Cascade deletion maintains data integrity
  • JSON metadata storage for flexible schema
  • Timestamp tracking for audit trails

API Features

  • RESTful design with OpenAPI documentation
  • Type-safe request/response models
  • Error handling with detailed messages
  • Async support for high concurrency

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

A toy implementation of a vector DB server with semantic search capabilities

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published