-
Notifications
You must be signed in to change notification settings - Fork 12
feat: Enable CLI support for content variants and custom extensions #46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughAdds support for pre-parsed distribution dictionaries alongside legacy distribution strings, updates the CLI to parse distribution strings into dicts before deployment, and introduces a SPARQL query constant plus a helper to parse aggregated content-variant strings. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (2)
tests/test_deploy.py (1)
19-47: Test coverage is good; consider adding formatExtension assertion in the last test.The tests comprehensively cover multiple scenarios. However, the final test case (lines 44-46) validates only
compressionbut doesn't verify thatformatExtensionisNoneas expected.🔎 Suggested test completeness improvement
# Test with compression only (no format extension) result = parse_distribution_str("http://example.com/data|.gz") assert result["url"] == "http://example.com/data" assert result["compression"] == "gz" + assert result["formatExtension"] is None + assert result["variants"] == {}databusclient/cli.py (1)
15-57: Fix indentation and consider extracting the compression list.
Line 50: The
Line 37: The hardcoded compression extensions list is a pragmatic heuristic, but extracting it as a module-level constant would improve maintainability:
🔎 Suggested improvements
+# Common compression file extensions +COMPRESSION_EXTENSIONS = {'.gz', '.zip', '.br', '.tar', '.zst'} + def parse_distribution_str(dist_str: str): """ Parses a distribution string with format: URL|key=value|...|.extension Returns a dictionary suitable for the deploy API. """ parts = dist_str.split('|') url = parts[0].strip() variants = {} format_ext = None compression = None # Iterate over the modifiers (everything after the URL) for part in parts[1:]: part = part.strip() # Case 1: Extension (starts with .) if part.startswith('.'): # purely heuristic: if it looks like compression (gz, zip, br), treat as compression # otherwise treat as format extension - if part.lower() in ['.gz', '.zip', '.br', '.tar', '.zst']: + if part.lower() in COMPRESSION_EXTENSIONS: compression = part.lstrip('.') # remove leading dot for API compatibility if needed else: format_ext = part.lstrip('.') # Case 2: Content Variant (key=value) elif '=' in part: key, value = part.split('=', 1) variants[key.strip()] = value.strip() # Case 3: Standalone tag (treat as boolean variant or ignore? # For now, we assume it's a value for a default key or warn) else: - print(f"WARNING: Unrecognized modifier '{part}' in distribution. Expected '.ext' or 'key=val'.") + print(f"WARNING: Unrecognized modifier '{part}' in distribution. Expected '.ext' or 'key=val'.")
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
databusclient/api/deploy.pydatabusclient/api/queries.pydatabusclient/cli.pytests/test_deploy.py
🧰 Additional context used
🧬 Code graph analysis (1)
tests/test_deploy.py (1)
databusclient/cli.py (1)
parse_distribution_str(15-57)
🔇 Additional comments (5)
databusclient/api/queries.py (1)
9-51: LGTM: Well-structured SPARQL query with proper variant handling.The query correctly uses GROUP_CONCAT with OPTIONAL to handle distributions that may lack content variants, and appropriately filters development versions.
databusclient/api/deploy.py (3)
176-208: LGTM: Correctly extracts metadata from pre-parsed dictionaries.The function properly handles optional fields and falls back to loading file stats when necessary. The defaults for
formatExtension("file") andcompression("none") align with the legacy inference logic.Note: The function expects
byteSizein the input dict (line 202) but returnscontent_length, which matches the API entity field naming convention.
304-348: LGTM: Clear documentation of dual-format distributions.The updated signature and docstring clearly communicate that distributions can be either legacy strings or pre-parsed dictionaries, maintaining backward compatibility.
366-387: LGTM: Clean dual-format handling with proper type discrimination.The implementation correctly routes dict and string distributions through their respective parsers while maintaining consistent downstream processing.
databusclient/cli.py (1)
130-138: LGTM: Clean integration of distribution parsing into the deploy workflow.The list comprehension correctly transforms raw distribution strings into structured dictionaries before passing them to the API layer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
databusclient/cli.py (1)
1-12: Past review comment about unusedreimport appears resolved.The previous review mentioned an unused
reimport at line 4, but the current code showsfrom typing import Listat that line. The issue appears to have been addressed.
🧹 Nitpick comments (2)
databusclient/cli.py (2)
14-14: Add return type annotation.The function should include a return type annotation for better type safety and documentation.
🔎 Proposed fix
+from typing import List, Dict, Any + -def parse_distribution_str(dist_str: str): +def parse_distribution_str(dist_str: str) -> Dict[str, Any]:
49-49: Useclick.echo()instead ofprint()for consistency.For CLI applications using Click, prefer
click.echo()overprint()for better cross-platform compatibility and consistency with the rest of the CLI output.🔎 Proposed fix
- print(f"WARNING: Unrecognized modifier '{part}' in distribution. Expected '.ext' or 'key=val'.") + click.echo(f"WARNING: Unrecognized modifier '{part}' in distribution. Expected '.ext' or 'key=val'.")
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
databusclient/api/queries.pydatabusclient/cli.pytests/test_deploy.py
🚧 Files skipped from review as they are similar to previous changes (1)
- databusclient/api/queries.py
🧰 Additional context used
🧬 Code graph analysis (1)
databusclient/cli.py (1)
databusclient/api/deploy.py (1)
create_dataset(304-465)
🔇 Additional comments (1)
tests/test_deploy.py (1)
15-16: LGTM!Good refactoring to define the example URL as a constant. This improves maintainability and follows DRY principles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
tests/test_parse_distribution.py (4)
92-121: Consider adding a test for case-insensitive compression detection.The implementation uses
.lower()to match compression extensions (as seen in the relevant code snippet), but there's no test verifying that.GZ,.Zip, etc. are correctly detected. This ensures the case-insensitive behavior is maintained.🔎 Suggested test addition
Add this test to the compression detection section:
def test_compression_case_insensitive(self): """Test that compression detection is case-insensitive.""" result = parse_distribution_str("http://example.com/file|.GZ") assert result["compression"] == "gz" result = parse_distribution_str("http://example.com/file|.Zip") assert result["compression"] == "zip"
21-30: Consider adding a test for URLs containing pipe characters.The current parsing logic splits on
|to separate the URL from modifiers. If a URL contains a pipe character (though rare, it's valid in URLs), the parsing would break. Consider adding a test to document this limitation or suggest URL encoding.🔎 Suggested test addition
Add this test to the URL extraction section:
def test_url_with_pipe_character(self): """Test handling of URLs containing pipe characters (edge case).""" # URLs can technically contain pipe characters # This test documents current behavior (would break parsing) # Consider URL encoding pipes as %7C in practice url_with_pipe = "http://example.com/data?filter=a%7Cb" # %7C is URL-encoded pipe result = parse_distribution_str(f"{url_with_pipe}|lang=en") assert result["url"] == url_with_pipe
63-121: Consider using parametrized tests to reduce duplication.The format extension tests (lines 63-87) and compression detection tests (lines 92-121) follow a repetitive pattern. Using
pytest.mark.parametrizecould make these more maintainable and concise.🔎 Example parametrized test refactor
Replace the individual format extension tests with:
@pytest.mark.parametrize("extension,expected", [ ("json", "json"), ("ttl", "ttl"), ("csv", "csv"), ("xml", "xml"), ]) def test_format_extension_detection(self, extension, expected): """Test format extension detection for various types.""" result = parse_distribution_str(f"http://example.com/file|.{extension}") assert result["formatExtension"] == expectedSimilarly for compression tests:
@pytest.mark.parametrize("compression,expected", [ ("gz", "gz"), ("zip", "zip"), ("br", "br"), ("tar", "tar"), ("zst", "zst"), ]) def test_compression_detection(self, compression, expected): """Test compression detection for various types.""" result = parse_distribution_str(f"http://example.com/file|.{compression}") assert result["compression"] == expected
126-146: Consider adding a test for multiple extensions of the same type.The current implementation would overwrite if multiple format extensions (e.g.,
|.json|.xml) or multiple compression types (e.g.,|.gz|.zip) are specified. A test documenting this behavior would help prevent confusion.🔎 Suggested test addition
Add to the edge cases section:
def test_multiple_format_extensions_last_wins(self): """Test that when multiple format extensions are specified, the last one wins.""" result = parse_distribution_str("http://example.com/file|.json|.xml") # Last extension specified should be used assert result["formatExtension"] == "xml" def test_multiple_compression_types_last_wins(self): """Test that when multiple compression types are specified, the last one wins.""" result = parse_distribution_str("http://example.com/file|.gz|.zip") # Last compression specified should be used assert result["compression"] == "zip"
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tests/test_parse_distribution.py
🧰 Additional context used
🧬 Code graph analysis (1)
tests/test_parse_distribution.py (2)
databusclient/cli.py (1)
parse_distribution_str(14-56)databusclient/api/deploy.py (2)
create_dataset(304-465)_get_file_info_from_dict(176-208)
🔇 Additional comments (4)
tests/test_parse_distribution.py (4)
1-11: LGTM! Imports are appropriate for the test scope.The imports correctly include the main function under test (
parse_distribution_str) and the deployment API components needed for integration testing. Importing the private function_get_file_info_from_dictis acceptable for integration testing purposes.
151-176: LGTM! Edge cases are well-tested.The tests properly verify whitespace handling, standalone tag warnings (including both stdout capture and result validation), and the minimal URL-only case. The test at lines 158-166 correctly ensures that unrecognized modifiers trigger a warning and are not added to the variants dict.
181-208: LGTM! Integration tests correctly verify parsed dictionaries work with deploy API.The mocking strategy is appropriate, using
_load_file_statsto avoid actual file operations. The test correctly verifies that:
- Parsed variants, extensions, and compression are properly extracted
- Default values ("file" for extension, "none" for compression) are applied when not specified
- The SHA-256 hash length is correct (64 characters)
209-275: LGTM! Comprehensive integration tests verify end-to-end functionality.These tests effectively verify that:
- Parsed distribution dictionaries are correctly consumed by
create_dataset- The generated dataset structure includes proper @context and @graph
- Distribution fields are correctly mapped, including content variants prefixed with
dcv:- Multiple distributions with different language variants are properly represented
The approach of using generator expressions with
next()(lines 234-237, 266-269) to locate specific graphs is clean and idiomatic.
|
@Integer-Ctrl Could u please review my pr . |
This PR updates the CLI to correctly handle custom file extensions and multiple content variants. Previously, there was no syntax to submit these specifics via the CLI, forcing users to rely on complex SPARQL queries to exclude unwanted variants. I have implemented a new pipe-separated syntax (e.g.,
URL|key=value|.ext) that allows users to pass metadata directly with the artifact URL. This implementation maintains backward compatibility with the existing underscore-separated format (key1=value1_key2=value2) while adding support for the new, more flexible pipe syntax.Summary by CodeRabbit
New Features
Tests
✏️ Tip: You can customize this high-level summary in your review settings.
issue #32