-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Open
Labels
kind/enhancementIssues or changes related to enhancementIssues or changes related to enhancement
Description
Is there an existing issue for this?
- I have searched the existing issues
What would you like to be added?
Introduce a new ScalarCollection type - a first-class collection type that stores only scalar fields without requiring vector fields. This enables normalized data modeling patterns and application-level joins with vector collections.
Currently, Milvus requires every collection to have at least one vector field (enforced in internal/proxy/task.go:391-393). This constraint forces users into denormalized data models when dealing with relational patterns.
Proposed API Design
Introduce ScalarCollection as a distinct type alongside Collection:
from pymilvus import ScalarCollection, ScalarCollectionSchema, FieldSchema, DataType
# Create scalar collection - NO vector fields required
shop_schema = ScalarCollectionSchema([
FieldSchema("shop_id", DataType.INT64, is_primary=True),
FieldSchema("shop_name", DataType.VARCHAR, max_length=256),
FieldSchema("shop_status", DataType.VARCHAR, max_length=50),
FieldSchema("shop_rating", DataType.FLOAT),
FieldSchema("shop_region", DataType.VARCHAR, max_length=100),
], description="Shop metadata")
# ScalarCollection - distinct from Collection
shop_metadata = ScalarCollection("shop_metadata", schema=shop_schema)
# Supported operations
shop_metadata.insert([[1, 2, 3],
["Shop A", "Shop B", "Shop C"],
["active", "active", "inactive"],
[4.5, 4.8, 3.2],
["US-West", "US-East", "EU"]])
results = shop_metadata.query(
expr="shop_status == 'active' AND shop_rating > 4.0",
output_fields=["shop_id", "shop_name"]
)
# Search() method not available on ScalarCollection
# shop_metadata.search(...) # AttributeError: ScalarCollection has no attribute 'search'Why is this needed?
No response
Anything else?
Real-World Problem: E-commerce Recommendation Platform
Entity Relationships:
- Shop → Items (1:many)
- Merchant → Items (1:many)
- Consumer Profile → Items (1:many)
- Items have embeddings for similarity search
Frequently Changing Metadata:
- Shop: status (active/inactive), rating, region, promotion_tier
- Merchant: verification_status, reputation_score, fulfillment_rate
- Consumer: preferred_categories, price_range, dietary_restrictions, style_preferences
Current Forced Solution: Painful Denormalization
# CURRENT: Must denormalize everything into items collection
items_collection = CollectionSchema([
FieldSchema("item_id", DataType.INT64, is_primary=True),
FieldSchema("item_embedding", DataType.FLOAT_VECTOR, dim=768), # Required!
# Shop metadata - DUPLICATED across all items from same shop
FieldSchema("shop_id", DataType.INT64),
FieldSchema("shop_status", DataType.VARCHAR, max_length=50),
FieldSchema("shop_rating", DataType.FLOAT),
# Merchant metadata - DUPLICATED across all items from same merchant
FieldSchema("merchant_id", DataType.INT64),
FieldSchema("merchant_verification", DataType.VARCHAR, max_length=50),
# Consumer preferences? NO GOOD SOLUTION!
# Option 1: Dummy embeddings (wasteful)
# Option 2: External database (loses Milvus query capabilities)
# Option 3: Denormalize into interactions (massive duplication)
])Pain Points with Denormalization
-
Update Inefficiency:
- Shop status change: 10,000 upsert operations (not 1)
- Merchant reputation update: 50,000 upsert operations (not 1)
- Consumer preference change: 500 upsert operations (not 1)
- Batch update (20 shops in a region): 200,000 operations instead of 20
-
Storage Overhead:
- 100 shops × 10,000 items × 200 bytes = 200 MB redundant shop data
- 50 merchants × 20,000 items × 150 bytes = 150 MB redundant merchant data
- 10,000 users × 500 interactions × 100 bytes = 500 MB redundant consumer data
- Total waste: ~850 MB of duplicated metadata
-
Forced Vector Creation:
- Consumer preferences don't naturally have embeddings
- Shop/merchant metadata doesn't need vector similarity
- Forced to create dummy vectors or lose Milvus benefits
-
Data Consistency Issues:
- Temporal inconsistency during bulk updates (half updated, half stale)
Metadata
Metadata
Assignees
Labels
kind/enhancementIssues or changes related to enhancementIssues or changes related to enhancement