Skip to content

[Enhancement]: metadata management #46153

@lyang24

Description

@lyang24

Is there an existing issue for this?

  • I have searched the existing issues

What would you like to be added?

Introduce a new ScalarCollection type - a first-class collection type that stores only scalar fields without requiring vector fields. This enables normalized data modeling patterns and application-level joins with vector collections.

Currently, Milvus requires every collection to have at least one vector field (enforced in internal/proxy/task.go:391-393). This constraint forces users into denormalized data models when dealing with relational patterns.

Proposed API Design

Introduce ScalarCollection as a distinct type alongside Collection:

from pymilvus import ScalarCollection, ScalarCollectionSchema, FieldSchema, DataType

# Create scalar collection - NO vector fields required
shop_schema = ScalarCollectionSchema([
    FieldSchema("shop_id", DataType.INT64, is_primary=True),
    FieldSchema("shop_name", DataType.VARCHAR, max_length=256),
    FieldSchema("shop_status", DataType.VARCHAR, max_length=50),
    FieldSchema("shop_rating", DataType.FLOAT),
    FieldSchema("shop_region", DataType.VARCHAR, max_length=100),
], description="Shop metadata")

# ScalarCollection - distinct from Collection
shop_metadata = ScalarCollection("shop_metadata", schema=shop_schema)

# Supported operations
shop_metadata.insert([[1, 2, 3],
                      ["Shop A", "Shop B", "Shop C"],
                      ["active", "active", "inactive"],
                      [4.5, 4.8, 3.2],
                      ["US-West", "US-East", "EU"]])

results = shop_metadata.query(
    expr="shop_status == 'active' AND shop_rating > 4.0",
    output_fields=["shop_id", "shop_name"]
)

# Search() method not available on ScalarCollection
# shop_metadata.search(...)  # AttributeError: ScalarCollection has no attribute 'search'

Why is this needed?

No response

Anything else?

Real-World Problem: E-commerce Recommendation Platform

Entity Relationships:

  • Shop → Items (1:many)
  • Merchant → Items (1:many)
  • Consumer Profile → Items (1:many)
  • Items have embeddings for similarity search

Frequently Changing Metadata:

  • Shop: status (active/inactive), rating, region, promotion_tier
  • Merchant: verification_status, reputation_score, fulfillment_rate
  • Consumer: preferred_categories, price_range, dietary_restrictions, style_preferences

Current Forced Solution: Painful Denormalization

# CURRENT: Must denormalize everything into items collection
items_collection = CollectionSchema([
    FieldSchema("item_id", DataType.INT64, is_primary=True),
    FieldSchema("item_embedding", DataType.FLOAT_VECTOR, dim=768),  # Required!

    # Shop metadata - DUPLICATED across all items from same shop
    FieldSchema("shop_id", DataType.INT64),
    FieldSchema("shop_status", DataType.VARCHAR, max_length=50),
    FieldSchema("shop_rating", DataType.FLOAT),

    # Merchant metadata - DUPLICATED across all items from same merchant
    FieldSchema("merchant_id", DataType.INT64),
    FieldSchema("merchant_verification", DataType.VARCHAR, max_length=50),

    # Consumer preferences? NO GOOD SOLUTION!
    # Option 1: Dummy embeddings (wasteful)
    # Option 2: External database (loses Milvus query capabilities)
    # Option 3: Denormalize into interactions (massive duplication)
])

Pain Points with Denormalization

  1. Update Inefficiency:

    • Shop status change: 10,000 upsert operations (not 1)
    • Merchant reputation update: 50,000 upsert operations (not 1)
    • Consumer preference change: 500 upsert operations (not 1)
    • Batch update (20 shops in a region): 200,000 operations instead of 20
  2. Storage Overhead:

    • 100 shops × 10,000 items × 200 bytes = 200 MB redundant shop data
    • 50 merchants × 20,000 items × 150 bytes = 150 MB redundant merchant data
    • 10,000 users × 500 interactions × 100 bytes = 500 MB redundant consumer data
    • Total waste: ~850 MB of duplicated metadata
  3. Forced Vector Creation:

    • Consumer preferences don't naturally have embeddings
    • Shop/merchant metadata doesn't need vector similarity
    • Forced to create dummy vectors or lose Milvus benefits
  4. Data Consistency Issues:

    • Temporal inconsistency during bulk updates (half updated, half stale)

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/enhancementIssues or changes related to enhancement

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions