Skip to content

Conversation

@vaibhav45sktech
Copy link

@vaibhav45sktech vaibhav45sktech commented Jan 3, 2026

This PR updates the CLI to correctly handle custom file extensions and multiple content variants. Previously, there was no syntax to submit these specifics via the CLI, forcing users to rely on complex SPARQL queries to exclude unwanted variants. I have implemented a new pipe-separated syntax (e.g., URL|key=value|.ext) that allows users to pass metadata directly with the artifact URL. This implementation maintains backward compatibility with the existing underscore-separated format (key1=value1_key2=value2) while adding support for the new, more flexible pipe syntax.

Summary by CodeRabbit

  • New Features

    • Distributions can be supplied as pre-parsed structured data or legacy text; parser extracts and validates file metadata, variants, format and compression.
    • CLI classic deploy flow now parses distribution specs before submission.
    • Query improvements to better aggregate and expose content variants.
  • Tests

    • Added tests and constants covering distribution parsing, variant handling, and integration with dataset creation.

✏️ Tip: You can customize this high-level summary in your review settings.

issue #32

@coderabbitai
Copy link

coderabbitai bot commented Jan 3, 2026

📝 Walkthrough

Walkthrough

Adds support for pre-parsed distribution dictionaries alongside legacy distribution strings, updates the CLI to parse distribution strings into dicts before deployment, and introduces a SPARQL query constant plus a helper to parse aggregated content-variant strings.

Changes

Cohort / File(s) Summary
Core API - Distribution Format Support
databusclient/api/deploy.py
Added _get_file_info_from_dict(dist_dict) to extract file metadata from pre-parsed distribution dicts. Updated create_dataset(...) signatures to accept Union[List[str], List[Dict]] and added branching logic to handle dict vs string distributions while preserving existing validation and error cases.
Queries Module
databusclient/api/queries.py
Added ONTOLOGIES_QUERY SPARQL constant that aggregates content variants via GROUP_CONCAT and parse_content_variants_string(variants_str: str) -> dict to parse that output (supports key=value pairs and standalone values).
CLI Distribution Parser
databusclient/cli.py
Added parse_distribution_str(dist_str: str) to parse pipe-separated distribution specs into dicts (url, variants, formatExtension, compression) and wired classic deploy flow to pass parsed dicts to api_deploy.create_dataset. Emits warnings for unrecognized modifiers.
Tests — CLI parsing & integration
tests/test_parse_distribution.py
New comprehensive tests for parse_distribution_str, covering variants parsing, format/compression detection, edge cases, and integration tests mocking _load_file_stats and verifying create_dataset behavior with parsed distributions.
Tests — Misc
tests/test_deploy.py
Added module-level constant EXAMPLE_URL for use in tests.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • Refactor/project structure #39: Extends and modifies the same deploy API surface (adds support for pre-parsed distribution dicts and changes create_dataset signatures).

Suggested reviewers

  • Integer-Ctrl

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is present and clearly explains the new feature, but it does not follow the required template structure with section headings. Restructure the description to follow the template: add 'Description' section header, 'Related Issues' section, 'Type of change' checklist, and 'Checklist' section with all verification items.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly identifies the main change: adding CLI support for content variants and custom extensions, which matches the primary objective of the PR.
Docstring Coverage ✅ Passed Docstring coverage is 94.44% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
tests/test_deploy.py (1)

19-47: Test coverage is good; consider adding formatExtension assertion in the last test.

The tests comprehensively cover multiple scenarios. However, the final test case (lines 44-46) validates only compression but doesn't verify that formatExtension is None as expected.

🔎 Suggested test completeness improvement
     # Test with compression only (no format extension)
     result = parse_distribution_str("http://example.com/data|.gz")
     assert result["url"] == "http://example.com/data"
     assert result["compression"] == "gz"
+    assert result["formatExtension"] is None
+    assert result["variants"] == {}
databusclient/cli.py (1)

15-57: Fix indentation and consider extracting the compression list.

  1. Line 50: The print statement appears to have inconsistent indentation (5 spaces instead of 4). Verify this matches the project's style.

  2. Line 37: The hardcoded compression extensions list is a pragmatic heuristic, but extracting it as a module-level constant would improve maintainability:

🔎 Suggested improvements
+# Common compression file extensions
+COMPRESSION_EXTENSIONS = {'.gz', '.zip', '.br', '.tar', '.zst'}
+
 def parse_distribution_str(dist_str: str):
     """
     Parses a distribution string with format:
     URL|key=value|...|.extension
     
     Returns a dictionary suitable for the deploy API.
     """
     parts = dist_str.split('|')
     url = parts[0].strip()
     
     variants = {}
     format_ext = None
     compression = None
     
     # Iterate over the modifiers (everything after the URL)
     for part in parts[1:]:
         part = part.strip()
         
         # Case 1: Extension (starts with .)
         if part.startswith('.'):
             # purely heuristic: if it looks like compression (gz, zip, br), treat as compression
             # otherwise treat as format extension
-            if part.lower() in ['.gz', '.zip', '.br', '.tar', '.zst']:
+            if part.lower() in COMPRESSION_EXTENSIONS:
                 compression = part.lstrip('.') # remove leading dot for API compatibility if needed
             else:
                 format_ext = part.lstrip('.')
         
         # Case 2: Content Variant (key=value)
         elif '=' in part:
             key, value = part.split('=', 1)
             variants[key.strip()] = value.strip()
             
         # Case 3: Standalone tag (treat as boolean variant or ignore? 
         # For now, we assume it's a value for a default key or warn)
         else:
-             print(f"WARNING: Unrecognized modifier '{part}' in distribution. Expected '.ext' or 'key=val'.")
+            print(f"WARNING: Unrecognized modifier '{part}' in distribution. Expected '.ext' or 'key=val'.")
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b2c3f1c and 138a10c.

📒 Files selected for processing (4)
  • databusclient/api/deploy.py
  • databusclient/api/queries.py
  • databusclient/cli.py
  • tests/test_deploy.py
🧰 Additional context used
🧬 Code graph analysis (1)
tests/test_deploy.py (1)
databusclient/cli.py (1)
  • parse_distribution_str (15-57)
🔇 Additional comments (5)
databusclient/api/queries.py (1)

9-51: LGTM: Well-structured SPARQL query with proper variant handling.

The query correctly uses GROUP_CONCAT with OPTIONAL to handle distributions that may lack content variants, and appropriately filters development versions.

databusclient/api/deploy.py (3)

176-208: LGTM: Correctly extracts metadata from pre-parsed dictionaries.

The function properly handles optional fields and falls back to loading file stats when necessary. The defaults for formatExtension ("file") and compression ("none") align with the legacy inference logic.

Note: The function expects byteSize in the input dict (line 202) but returns content_length, which matches the API entity field naming convention.


304-348: LGTM: Clear documentation of dual-format distributions.

The updated signature and docstring clearly communicate that distributions can be either legacy strings or pre-parsed dictionaries, maintaining backward compatibility.


366-387: LGTM: Clean dual-format handling with proper type discrimination.

The implementation correctly routes dict and string distributions through their respective parsers while maintaining consistent downstream processing.

databusclient/cli.py (1)

130-138: LGTM: Clean integration of distribution parsing into the deploy workflow.

The list comprehension correctly transforms raw distribution strings into structured dictionaries before passing them to the API layer.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
databusclient/cli.py (1)

1-12: Past review comment about unused re import appears resolved.

The previous review mentioned an unused re import at line 4, but the current code shows from typing import List at that line. The issue appears to have been addressed.

🧹 Nitpick comments (2)
databusclient/cli.py (2)

14-14: Add return type annotation.

The function should include a return type annotation for better type safety and documentation.

🔎 Proposed fix
+from typing import List, Dict, Any
+
-def parse_distribution_str(dist_str: str):
+def parse_distribution_str(dist_str: str) -> Dict[str, Any]:

49-49: Use click.echo() instead of print() for consistency.

For CLI applications using Click, prefer click.echo() over print() for better cross-platform compatibility and consistency with the rest of the CLI output.

🔎 Proposed fix
-             print(f"WARNING: Unrecognized modifier '{part}' in distribution. Expected '.ext' or 'key=val'.")
+             click.echo(f"WARNING: Unrecognized modifier '{part}' in distribution. Expected '.ext' or 'key=val'.")
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 138a10c and 3b3c28c.

📒 Files selected for processing (3)
  • databusclient/api/queries.py
  • databusclient/cli.py
  • tests/test_deploy.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • databusclient/api/queries.py
🧰 Additional context used
🧬 Code graph analysis (1)
databusclient/cli.py (1)
databusclient/api/deploy.py (1)
  • create_dataset (304-465)
🔇 Additional comments (1)
tests/test_deploy.py (1)

15-16: LGTM!

Good refactoring to define the example URL as a constant. This improves maintainability and follows DRY principles.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
tests/test_parse_distribution.py (4)

92-121: Consider adding a test for case-insensitive compression detection.

The implementation uses .lower() to match compression extensions (as seen in the relevant code snippet), but there's no test verifying that .GZ, .Zip, etc. are correctly detected. This ensures the case-insensitive behavior is maintained.

🔎 Suggested test addition

Add this test to the compression detection section:

def test_compression_case_insensitive(self):
    """Test that compression detection is case-insensitive."""
    result = parse_distribution_str("http://example.com/file|.GZ")
    assert result["compression"] == "gz"
    
    result = parse_distribution_str("http://example.com/file|.Zip")
    assert result["compression"] == "zip"

21-30: Consider adding a test for URLs containing pipe characters.

The current parsing logic splits on | to separate the URL from modifiers. If a URL contains a pipe character (though rare, it's valid in URLs), the parsing would break. Consider adding a test to document this limitation or suggest URL encoding.

🔎 Suggested test addition

Add this test to the URL extraction section:

def test_url_with_pipe_character(self):
    """Test handling of URLs containing pipe characters (edge case)."""
    # URLs can technically contain pipe characters
    # This test documents current behavior (would break parsing)
    # Consider URL encoding pipes as %7C in practice
    url_with_pipe = "http://example.com/data?filter=a%7Cb"  # %7C is URL-encoded pipe
    result = parse_distribution_str(f"{url_with_pipe}|lang=en")
    assert result["url"] == url_with_pipe

63-121: Consider using parametrized tests to reduce duplication.

The format extension tests (lines 63-87) and compression detection tests (lines 92-121) follow a repetitive pattern. Using pytest.mark.parametrize could make these more maintainable and concise.

🔎 Example parametrized test refactor

Replace the individual format extension tests with:

@pytest.mark.parametrize("extension,expected", [
    ("json", "json"),
    ("ttl", "ttl"),
    ("csv", "csv"),
    ("xml", "xml"),
])
def test_format_extension_detection(self, extension, expected):
    """Test format extension detection for various types."""
    result = parse_distribution_str(f"http://example.com/file|.{extension}")
    assert result["formatExtension"] == expected

Similarly for compression tests:

@pytest.mark.parametrize("compression,expected", [
    ("gz", "gz"),
    ("zip", "zip"),
    ("br", "br"),
    ("tar", "tar"),
    ("zst", "zst"),
])
def test_compression_detection(self, compression, expected):
    """Test compression detection for various types."""
    result = parse_distribution_str(f"http://example.com/file|.{compression}")
    assert result["compression"] == expected

126-146: Consider adding a test for multiple extensions of the same type.

The current implementation would overwrite if multiple format extensions (e.g., |.json|.xml) or multiple compression types (e.g., |.gz|.zip) are specified. A test documenting this behavior would help prevent confusion.

🔎 Suggested test addition

Add to the edge cases section:

def test_multiple_format_extensions_last_wins(self):
    """Test that when multiple format extensions are specified, the last one wins."""
    result = parse_distribution_str("http://example.com/file|.json|.xml")
    # Last extension specified should be used
    assert result["formatExtension"] == "xml"

def test_multiple_compression_types_last_wins(self):
    """Test that when multiple compression types are specified, the last one wins."""
    result = parse_distribution_str("http://example.com/file|.gz|.zip")
    # Last compression specified should be used
    assert result["compression"] == "zip"
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3b3c28c and e8704ab.

📒 Files selected for processing (1)
  • tests/test_parse_distribution.py
🧰 Additional context used
🧬 Code graph analysis (1)
tests/test_parse_distribution.py (2)
databusclient/cli.py (1)
  • parse_distribution_str (14-56)
databusclient/api/deploy.py (2)
  • create_dataset (304-465)
  • _get_file_info_from_dict (176-208)
🔇 Additional comments (4)
tests/test_parse_distribution.py (4)

1-11: LGTM! Imports are appropriate for the test scope.

The imports correctly include the main function under test (parse_distribution_str) and the deployment API components needed for integration testing. Importing the private function _get_file_info_from_dict is acceptable for integration testing purposes.


151-176: LGTM! Edge cases are well-tested.

The tests properly verify whitespace handling, standalone tag warnings (including both stdout capture and result validation), and the minimal URL-only case. The test at lines 158-166 correctly ensures that unrecognized modifiers trigger a warning and are not added to the variants dict.


181-208: LGTM! Integration tests correctly verify parsed dictionaries work with deploy API.

The mocking strategy is appropriate, using _load_file_stats to avoid actual file operations. The test correctly verifies that:

  • Parsed variants, extensions, and compression are properly extracted
  • Default values ("file" for extension, "none" for compression) are applied when not specified
  • The SHA-256 hash length is correct (64 characters)

209-275: LGTM! Comprehensive integration tests verify end-to-end functionality.

These tests effectively verify that:

  • Parsed distribution dictionaries are correctly consumed by create_dataset
  • The generated dataset structure includes proper @context and @graph
  • Distribution fields are correctly mapped, including content variants prefixed with dcv:
  • Multiple distributions with different language variants are properly represented

The approach of using generator expressions with next() (lines 234-237, 266-269) to locate specific graphs is clean and idiomatic.

@vaibhav45sktech
Copy link
Author

@Integer-Ctrl Could u please review my pr .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant