feat: Enable CLI support for content variants and custom extensions #46

vaibhav45sktech · 2026-01-03T06:02:54Z

This PR updates the CLI to correctly handle custom file extensions and multiple content variants. Previously, there was no syntax to submit these specifics via the CLI, forcing users to rely on complex SPARQL queries to exclude unwanted variants. I have implemented a new pipe-separated syntax (e.g., URL|key=value|.ext) that allows users to pass metadata directly with the artifact URL. This implementation maintains backward compatibility with the existing underscore-separated format (key1=value1_key2=value2) while adding support for the new, more flexible pipe syntax.

Summary by CodeRabbit

New Features
- Distributions can be supplied as pre-parsed structured data or legacy text; parser extracts and validates file metadata, variants, format and compression.
- CLI classic deploy flow now parses distribution specs before submission.
- Query improvements to better aggregate and expose content variants.
Tests
- Added tests and constants covering distribution parsing, variant handling, and integration with dataset creation.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

issue #32

coderabbitai · 2026-01-03T06:03:02Z

📝 Walkthrough

Walkthrough

Adds support for pre-parsed distribution dictionaries alongside legacy distribution strings, updates the CLI to parse distribution strings into dicts before deployment, and introduces a SPARQL query constant plus a helper to parse aggregated content-variant strings.

Changes

Cohort / File(s)	Summary
Core API - Distribution Format Support `databusclient/api/deploy.py`	Added `_get_file_info_from_dict(dist_dict)` to extract file metadata from pre-parsed distribution dicts. Updated `create_dataset(...)` signatures to accept `Union[List[str], List[Dict]]` and added branching logic to handle dict vs string distributions while preserving existing validation and error cases.
Queries Module `databusclient/api/queries.py`	Added `ONTOLOGIES_QUERY` SPARQL constant that aggregates content variants via GROUP_CONCAT and `parse_content_variants_string(variants_str: str) -> dict` to parse that output (supports key=value pairs and standalone values).
CLI Distribution Parser `databusclient/cli.py`	Added `parse_distribution_str(dist_str: str)` to parse pipe-separated distribution specs into dicts (`url`, `variants`, `formatExtension`, `compression`) and wired classic deploy flow to pass parsed dicts to `api_deploy.create_dataset`. Emits warnings for unrecognized modifiers.
Tests — CLI parsing & integration `tests/test_parse_distribution.py`	New comprehensive tests for `parse_distribution_str`, covering variants parsing, format/compression detection, edge cases, and integration tests mocking `_load_file_stats` and verifying `create_dataset` behavior with parsed distributions.
Tests — Misc `tests/test_deploy.py`	Added module-level constant `EXAMPLE_URL` for use in tests.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Refactor/project structure #39: Extends and modifies the same deploy API surface (adds support for pre-parsed distribution dicts and changes create_dataset signatures).

Suggested reviewers

Integer-Ctrl

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is present and clearly explains the new feature, but it does not follow the required template structure with section headings.	Restructure the description to follow the template: add 'Description' section header, 'Related Issues' section, 'Type of change' checklist, and 'Checklist' section with all verification items.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly identifies the main change: adding CLI support for content variants and custom extensions, which matches the primary objective of the PR.
Docstring Coverage	✅ Passed	Docstring coverage is 94.44% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

tests/test_deploy.py (1)

19-47: Test coverage is good; consider adding formatExtension assertion in the last test.

The tests comprehensively cover multiple scenarios. However, the final test case (lines 44-46) validates only compression but doesn't verify that formatExtension is None as expected.
🔎 Suggested test completeness improvement
     # Test with compression only (no format extension)
     result = parse_distribution_str("http://example.com/data|.gz")
     assert result["url"] == "http://example.com/data"
     assert result["compression"] == "gz"
+    assert result["formatExtension"] is None
+    assert result["variants"] == {}

databusclient/cli.py (1)

15-57: Fix indentation and consider extracting the compression list.

Line 50: The print statement appears to have inconsistent indentation (5 spaces instead of 4). Verify this matches the project's style.
Line 37: The hardcoded compression extensions list is a pragmatic heuristic, but extracting it as a module-level constant would improve maintainability:

🔎 Suggested improvements

+# Common compression file extensions
+COMPRESSION_EXTENSIONS = {'.gz', '.zip', '.br', '.tar', '.zst'}
+
 def parse_distribution_str(dist_str: str):
     """
     Parses a distribution string with format:
     URL|key=value|...|.extension
     
     Returns a dictionary suitable for the deploy API.
     """
     parts = dist_str.split('|')
     url = parts[0].strip()
     
     variants = {}
     format_ext = None
     compression = None
     
     # Iterate over the modifiers (everything after the URL)
     for part in parts[1:]:
         part = part.strip()
         
         # Case 1: Extension (starts with .)
         if part.startswith('.'):
             # purely heuristic: if it looks like compression (gz, zip, br), treat as compression
             # otherwise treat as format extension
-            if part.lower() in ['.gz', '.zip', '.br', '.tar', '.zst']:
+            if part.lower() in COMPRESSION_EXTENSIONS:
                 compression = part.lstrip('.') # remove leading dot for API compatibility if needed
             else:
                 format_ext = part.lstrip('.')
         
         # Case 2: Content Variant (key=value)
         elif '=' in part:
             key, value = part.split('=', 1)
             variants[key.strip()] = value.strip()
             
         # Case 3: Standalone tag (treat as boolean variant or ignore? 
         # For now, we assume it's a value for a default key or warn)
         else:
-             print(f"WARNING: Unrecognized modifier '{part}' in distribution. Expected '.ext' or 'key=val'.")
+            print(f"WARNING: Unrecognized modifier '{part}' in distribution. Expected '.ext' or 'key=val'.")

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b2c3f1c and 138a10c.

📒 Files selected for processing (4)

databusclient/api/deploy.py
databusclient/api/queries.py
databusclient/cli.py
tests/test_deploy.py

🧰 Additional context used

🧬 Code graph analysis (1)

tests/test_deploy.py (1)

databusclient/cli.py (1)

parse_distribution_str (15-57)

🔇 Additional comments (5)

databusclient/api/queries.py (1)

9-51: LGTM: Well-structured SPARQL query with proper variant handling.

The query correctly uses GROUP_CONCAT with OPTIONAL to handle distributions that may lack content variants, and appropriately filters development versions.

databusclient/api/deploy.py (3)

176-208: LGTM: Correctly extracts metadata from pre-parsed dictionaries.

The function properly handles optional fields and falls back to loading file stats when necessary. The defaults for formatExtension ("file") and compression ("none") align with the legacy inference logic.

Note: The function expects byteSize in the input dict (line 202) but returns content_length, which matches the API entity field naming convention.

304-348: LGTM: Clear documentation of dual-format distributions.

The updated signature and docstring clearly communicate that distributions can be either legacy strings or pre-parsed dictionaries, maintaining backward compatibility.

366-387: LGTM: Clean dual-format handling with proper type discrimination.

The implementation correctly routes dict and string distributions through their respective parsers while maintaining consistent downstream processing.

databusclient/cli.py (1)

130-138: LGTM: Clean integration of distribution parsing into the deploy workflow.

The list comprehension correctly transforms raw distribution strings into structured dictionaries before passing them to the API layer.

databusclient/api/queries.py

databusclient/cli.py

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

databusclient/cli.py (1)

1-12: Past review comment about unused re import appears resolved.

The previous review mentioned an unused re import at line 4, but the current code shows from typing import List at that line. The issue appears to have been addressed.

🧹 Nitpick comments (2)

databusclient/cli.py (2)
14-14: Add return type annotation.

The function should include a return type annotation for better type safety and documentation.
🔎 Proposed fix
+from typing import List, Dict, Any
+
-def parse_distribution_str(dist_str: str):
+def parse_distribution_str(dist_str: str) -> Dict[str, Any]:
49-49: Use click.echo() instead of print() for consistency.

For CLI applications using Click, prefer click.echo() over print() for better cross-platform compatibility and consistency with the rest of the CLI output.
🔎 Proposed fix
-             print(f"WARNING: Unrecognized modifier '{part}' in distribution. Expected '.ext' or 'key=val'.")
+             click.echo(f"WARNING: Unrecognized modifier '{part}' in distribution. Expected '.ext' or 'key=val'.")

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 138a10c and 3b3c28c.

📒 Files selected for processing (3)

databusclient/api/queries.py
databusclient/cli.py
tests/test_deploy.py

🚧 Files skipped from review as they are similar to previous changes (1)

databusclient/api/queries.py

🧰 Additional context used

🧬 Code graph analysis (1)

databusclient/cli.py (1)

databusclient/api/deploy.py (1)

create_dataset (304-465)

🔇 Additional comments (1)

tests/test_deploy.py (1)

15-16: LGTM!

Good refactoring to define the example URL as a constant. This improves maintainability and follows DRY principles.

databusclient/cli.py

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (4)

tests/test_parse_distribution.py (4)
92-121: Consider adding a test for case-insensitive compression detection.

The implementation uses .lower() to match compression extensions (as seen in the relevant code snippet), but there's no test verifying that .GZ, .Zip, etc. are correctly detected. This ensures the case-insensitive behavior is maintained.
🔎 Suggested test addition

Add this test to the compression detection section:
def test_compression_case_insensitive(self):
    """Test that compression detection is case-insensitive."""
    result = parse_distribution_str("http://example.com/file|.GZ")
    assert result["compression"] == "gz"
    
    result = parse_distribution_str("http://example.com/file|.Zip")
    assert result["compression"] == "zip"
21-30: Consider adding a test for URLs containing pipe characters.

The current parsing logic splits on | to separate the URL from modifiers. If a URL contains a pipe character (though rare, it's valid in URLs), the parsing would break. Consider adding a test to document this limitation or suggest URL encoding.
🔎 Suggested test addition

Add this test to the URL extraction section:
def test_url_with_pipe_character(self):
    """Test handling of URLs containing pipe characters (edge case)."""
    # URLs can technically contain pipe characters
    # This test documents current behavior (would break parsing)
    # Consider URL encoding pipes as %7C in practice
    url_with_pipe = "http://example.com/data?filter=a%7Cb"  # %7C is URL-encoded pipe
    result = parse_distribution_str(f"{url_with_pipe}|lang=en")
    assert result["url"] == url_with_pipe
63-121: Consider using parametrized tests to reduce duplication.

The format extension tests (lines 63-87) and compression detection tests (lines 92-121) follow a repetitive pattern. Using pytest.mark.parametrize could make these more maintainable and concise.
🔎 Example parametrized test refactor

Replace the individual format extension tests with:
@pytest.mark.parametrize("extension,expected", [
    ("json", "json"),
    ("ttl", "ttl"),
    ("csv", "csv"),
    ("xml", "xml"),
])
def test_format_extension_detection(self, extension, expected):
    """Test format extension detection for various types."""
    result = parse_distribution_str(f"http://example.com/file|.{extension}")
    assert result["formatExtension"] == expected
Similarly for compression tests:
@pytest.mark.parametrize("compression,expected", [
    ("gz", "gz"),
    ("zip", "zip"),
    ("br", "br"),
    ("tar", "tar"),
    ("zst", "zst"),
])
def test_compression_detection(self, compression, expected):
    """Test compression detection for various types."""
    result = parse_distribution_str(f"http://example.com/file|.{compression}")
    assert result["compression"] == expected
126-146: Consider adding a test for multiple extensions of the same type.

The current implementation would overwrite if multiple format extensions (e.g., |.json|.xml) or multiple compression types (e.g., |.gz|.zip) are specified. A test documenting this behavior would help prevent confusion.
🔎 Suggested test addition

Add to the edge cases section:
def test_multiple_format_extensions_last_wins(self):
    """Test that when multiple format extensions are specified, the last one wins."""
    result = parse_distribution_str("http://example.com/file|.json|.xml")
    # Last extension specified should be used
    assert result["formatExtension"] == "xml"

def test_multiple_compression_types_last_wins(self):
    """Test that when multiple compression types are specified, the last one wins."""
    result = parse_distribution_str("http://example.com/file|.gz|.zip")
    # Last compression specified should be used
    assert result["compression"] == "zip"

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3b3c28c and e8704ab.

📒 Files selected for processing (1)

tests/test_parse_distribution.py

🧰 Additional context used

🧬 Code graph analysis (1)

tests/test_parse_distribution.py (2)

databusclient/cli.py (1)

parse_distribution_str (14-56)

databusclient/api/deploy.py (2)

create_dataset (304-465)

_get_file_info_from_dict (176-208)

🔇 Additional comments (4)

tests/test_parse_distribution.py (4)

1-11: LGTM! Imports are appropriate for the test scope.

The imports correctly include the main function under test (parse_distribution_str) and the deployment API components needed for integration testing. Importing the private function _get_file_info_from_dict is acceptable for integration testing purposes.

151-176: LGTM! Edge cases are well-tested.

The tests properly verify whitespace handling, standalone tag warnings (including both stdout capture and result validation), and the minimal URL-only case. The test at lines 158-166 correctly ensures that unrecognized modifiers trigger a warning and are not added to the variants dict.

181-208: LGTM! Integration tests correctly verify parsed dictionaries work with deploy API.

The mocking strategy is appropriate, using _load_file_stats to avoid actual file operations. The test correctly verifies that:

Parsed variants, extensions, and compression are properly extracted

Default values ("file" for extension, "none" for compression) are applied when not specified

The SHA-256 hash length is correct (64 characters)

209-275: LGTM! Comprehensive integration tests verify end-to-end functionality.

These tests effectively verify that:

Parsed distribution dictionaries are correctly consumed by create_dataset

The generated dataset structure includes proper @context and @graph

Distribution fields are correctly mapped, including content variants prefixed with dcv:

Multiple distributions with different language variants are properly represented

The approach of using generator expressions with next() (lines 234-237, 266-269) to locate specific graphs is clean and idiomatic.

vaibhav45sktech · 2026-01-08T22:47:22Z

@Integer-Ctrl Could u please review my pr .

feat: Improve content variant handling in distribution parsing

138a10c

coderabbitai bot reviewed Jan 3, 2026

View reviewed changes

databusclient/api/queries.py Show resolved Hide resolved

databusclient/cli.py Outdated Show resolved Hide resolved

refining

3b3c28c

coderabbitai bot reviewed Jan 3, 2026

View reviewed changes

databusclient/cli.py Show resolved Hide resolved

databusclient/cli.py Show resolved Hide resolved

Add unit and integration tests for parse_distribution_str

e8704ab

coderabbitai bot reviewed Jan 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Enable CLI support for content variants and custom extensions #46

feat: Enable CLI support for content variants and custom extensions #46

Uh oh!

vaibhav45sktech commented Jan 3, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Jan 3, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

vaibhav45sktech commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat: Enable CLI support for content variants and custom extensions #46

Are you sure you want to change the base?

feat: Enable CLI support for content variants and custom extensions #46

Uh oh!

Conversation

vaibhav45sktech commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

vaibhav45sktech commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vaibhav45sktech commented Jan 3, 2026 •

edited

Loading

coderabbitai bot commented Jan 3, 2026 •

edited

Loading