-
Notifications
You must be signed in to change notification settings - Fork 10
Add patterns to partial restore in ch-backup restore command #277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Reviewer's GuideRefactors the restore CLI and core logic to replace explicit include/exclude database/table lists with a new pattern-based partial restore filter (included/excluded patterns) and wires it through CLI, backup logic, and tests, including new unit tests for the matcher and an integration scenario. Sequence diagram for ch-backup restore with included_patterns and excluded_patternssequenceDiagram
actor User
participant CLI as RestoreCLI
participant Filter as PartialRestoreFilter
participant CHB as ClickhouseBackup
participant Meta as BackupMeta
User->>CLI: ch-backup restore --included-patterns db1.table* --databases db1
CLI->>CLI: parse included_patterns and databases
CLI->>Filter: new PartialRestoreFilter(inverted=false, patterns=included_patterns)
CLI->>CHB: restore(sources, backup_name, databases, partial_restore_filter=Filter, ...)
CHB->>Meta: get_databases()
Meta-->>CHB: all_databases
CHB->>CHB: choose databases to restore
alt partial_restore_filter provided
loop for each db in databases
CHB->>Filter: is_possibly_contains_database(db)
Filter-->>CHB: bool
CHB->>CHB: build filtered database list
end
end
CHB->>CHB: collect tables to restore
alt partial_restore_filter provided
loop for each db in selected databases
CHB->>Meta: get_tables(db)
Meta-->>CHB: tables_in_db
loop for each table in tables_in_db
CHB->>Filter: accept_table(db, table_name)
Filter-->>CHB: bool
CHB->>CHB: include or exclude table
end
end
else no filter
CHB->>Meta: get all tables for all databases
end
CHB->>CHB: _restore(sources, db_names, tables, ...)
CHB-->>CLI: restore completed
CLI-->>User: print restore summary
Class diagram for new partial restore filter and integration into ClickhouseBackupclassDiagram
class PartialRestorePattern {
+str database
+str pattern_str
+pattern
+table
+PartialRestorePattern(database, pattern_str)
+matches(db, table) bool
+related_to_db(db) bool
+from_pattern(pattern) PartialRestorePattern
}
class PartialRestoreFilter {
+bool inverted
+List~PartialRestorePattern~ patterns
+PartialRestoreFilter(inverted, patterns)
+accept_table(db, table) bool
+is_possibly_contains_database(db) bool
+is_empty() bool
}
class ClickhouseBackup {
+restore(
sources,
backup_name,
databases,
override_replica_name,
force_non_replicated,
replica_name,
cloud_storage_source_bucket,
cloud_storage_source_path,
use_embedded_backup_restore,
clean_zookeeper_mode,
keep_going,
restore_tables_in_replicated_database,
partial_restore_filter
) void
+_restore(
sources,
db_names,
tables,
replica_name,
cloud_storage_source_bucket,
cloud_storage_source_path,
use_embedded_backup_restore,
clean_zookeeper_mode,
keep_going,
restore_tables_in_replicated_database
) void
}
class BackupMeta {
+get_databases() List~str~
+get_tables(db_name) List~TableMetadata~
}
class TableMetadata {
+str database
+str name
}
class RestoreCLI {
+restore_command(
ch_backup,
name,
databases,
schema_only,
override_replica_name,
force_non_replicated,
replica_name,
use_embedded_backup_restore,
clean_zookeeper_mode,
keep_going,
access,
data,
schema,
udf,
included_patterns,
excluded_patterns,
nc
) void
}
PartialRestoreFilter "*" o-- "*" PartialRestorePattern : aggregates
ClickhouseBackup --> BackupMeta : uses
ClickhouseBackup --> TableMetadata : restores
RestoreCLI --> ClickhouseBackup : calls restore
RestoreCLI --> PartialRestoreFilter : constructs
ClickhouseBackup --> PartialRestoreFilter : applies filter
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes - here's some feedback:
- The
PartialRestorePatternregex construction (re.compile(pattern_str.replace("*", ".*"))) will treat any other regex metacharacters (e.g..,+,[]) in table names as regex, which conflicts with the "simplified regex" intent and theспец.символыtest; considerre.escape(pattern_str)first and then replacing the escaped*to avoid unintended matches. - The CLI validation regex for
--included-patterns/--excluded-patterns(\w+\.[\w*]+) only allows ASCII word characters and*, while unit tests use non-ASCII and punctuation in names (e.g.спец.символы), so these patterns cannot be passed via CLI; relax the CLI regex or reuse the same parsing logic asPartialRestorePattern.from_pattern.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The `PartialRestorePattern` regex construction (`re.compile(pattern_str.replace("*", ".*"))`) will treat any other regex metacharacters (e.g. `.`, `+`, `[]`) in table names as regex, which conflicts with the "simplified regex" intent and the `спец.символы` test; consider `re.escape(pattern_str)` first and then replacing the escaped `*` to avoid unintended matches.
- The CLI validation regex for `--included-patterns` / `--excluded-patterns` (`\w+\.[\w*]+`) only allows ASCII word characters and `*`, while unit tests use non-ASCII and punctuation in names (e.g. `спец.символы`), so these patterns cannot be passed via CLI; relax the CLI regex or reuse the same parsing logic as `PartialRestorePattern.from_pattern`.
## Individual Comments
### Comment 1
<location> `ch_backup/cli.py:426-435` </location>
<code_context>
f"Example for {style(List.name.upper(), bold=True)}: "
+ style('"db1, db2"', fg=Color.cyan),
- mutually_exclusive(
- option(
- "-d",
- "--databases",
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Pattern list option regex may not match all valid table names and could mislead users.
`type=List(regexp=r"\w+\.[\w*]+")` restricts db/table names to word chars plus `*` in the table, so valid ClickHouse identifiers with other chars (e.g. dashes, quoted names with dots, etc.) can’t be expressed via CLI even though the backend matcher would accept them. Please either relax or remove this regex, or update it to accurately reflect the identifier rules supported by `PartialRestorePattern`.
Suggested implementation:
```python
type=List(),
```
If there are other options in this file using the same `List(regexp=r"\w+\.[\w*]+")` pattern (e.g. for other partial restore patterns), you should apply the same replacement there as well, so that all pattern-based CLI inputs accept the full range of identifiers supported by `PartialRestoreFilter`.
</issue_to_address>
### Comment 2
<location> `ch_backup/ch_backup.py:263-272` </location>
<code_context>
+ if not databases:
</code_context>
<issue_to_address>
**issue (bug_risk):** Dropping the explicit "missing databases" check can now produce confusing failures on invalid user input.
Without the prior validation that all requested databases exist in the backup metadata, invalid user input (e.g. a non-existent DB) will now flow into later calls like `get_tables(db_name)`, likely failing with less clear errors and changing the previous behavior. Please reintroduce a validation step that checks user-specified `databases` against `backup_meta.get_databases()` before applying the partial filter.
</issue_to_address>
### Comment 3
<location> `ch_backup/logic/partial_restore.py:19-22` </location>
<code_context>
+ self.pattern_str = pattern_str
+ self.pattern = None
+ self.table = None
+ if pattern_str.find("*") != -1:
+ self.pattern = re.compile(pattern_str.replace("*", ".*"))
+ else:
+ self.table = pattern_str
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Wildcard-to-regex translation can misbehave if table names ever contain regex metacharacters.
This implementation assumes patterns never include other regex metacharacters. If identifiers later allow `+`, `?`, `(`, `)`, etc., they’ll be treated as regex tokens instead of literals. To harden this, you can use `re.escape(pattern_str).replace("\*", ".*")` so only `*` remains a wildcard and all other characters are escaped.
```suggestion
if pattern_str.find("*") != -1:
escaped = re.escape(pattern_str)
self.pattern = re.compile(escaped.replace(r"\*", ".*"))
else:
self.table = pattern_str
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
ch_backup/ch_backup.py
Outdated
| filtered_databases = [] | ||
| excluded_databases = [] | ||
| for db_name in databases: | ||
| if partial_restore_filter.is_possibly_contains_database(db_name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kinda tricky part, because for me it is good that exclude-pattern: db1.* is totally excluding database from restoring, but for other people it could be not so obvious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decided to move all "partial restore filters" logic to separate file, but I'm not sure about classes/function and open to suggestions.
| }, | ||
| }, | ||
| { | ||
| "id": "Excluded filter and not enough info", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tricky part about exclude-database with patterns is that we don't have enough info with simple table pattern like db1.t1 and only have enough confidence with regex like db.*
| ), | ||
| option( | ||
| "--included-patterns", | ||
| type=List(regexp=r"\w+\.[\w*]+"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on Sourcery comment.
Do we want to support some "strange" table names with special symbols or quoted table name for patial restore? I think that such case will be exceptional and partially covered by "*" wildcard. But I can change regex, if needed.
p.s. googled and double-checked about non-ascii symbols and sourcery-ai was wrong
root@clickhouse02:/# ch-backup restore --included-patterns база.таблица1 11111
...
Error: No backups with name "11111" were found.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is about db1.*table* - we expect such patterns to work?
If yes, lets also add it to tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is about
db1.*table*- we expect such patterns to work?If yes, lets also add it to tests?
Valid point, added to unit and feature tests
| ), | ||
| option( | ||
| "--included-patterns", | ||
| type=List(regexp=r"\w+\.[\w*]+"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is about db1.*table* - we expect such patterns to work?
If yes, lets also add it to tests?
|
LGTM |
ch_backup/ch_backup.py
Outdated
| logging.debug( | ||
| f"Filtering table to restore: {table.database}.{table.name} result: {partial_restore_filter.accept_table(table.database, table.name)}." | ||
| ) | ||
| if partial_restore_filter.accept_table(table.database, table.name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider is_restoring = partial_restore_filter.accept_table(table.database, table.name) and use it in both places
| option( | ||
| "--included-patterns", | ||
| type=List(regexp=r"\w+\.[\w*]+"), | ||
| help="Comma-separated list of db.tables to restore. Other tables will be skipped. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Semicolon-separated ?
ch_backup/cli.py
Outdated
| "--included-patterns", | ||
| type=List(regexp=r"\w+\.[\w*]+"), | ||
| help="Comma-separated list of db.tables to restore. Other tables will be skipped. " | ||
| "Examples: db1.table1; db1.table2,db2.*; db1.prefix*,db2.*suffix", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Semicolons is mixed with commas ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, semicolons separate different cases, commas separate different patterns in one case.
I guess I will rewrite it to Examples: db1.table1 | db1.table2,db2.* | db1.prefix*,db2.*suffix to make it more clear.
| ), | ||
| option("--schema-only", is_flag=True, help="Restore only database schemas."), | ||
| ) | ||
| @option_group( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we preserve this parameters, and just treat them as patterns internally ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really don't want to, because, imho, it will complicate the logic unnecessarily.
You want preserve this parameters for backward compability or smth else? Because, his parameters are potentially used only manually once in a month and their replacement seems pretty easy for users
| ", ".join(missed_databases), | ||
| self._context.backup_meta.path, | ||
| ) | ||
| raise ClickhouseBackupError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this logic now ?
ch_backup/logic/partial_restore.py
Outdated
| self.pattern_str = pattern_str | ||
| self.pattern = None | ||
| self.table = None | ||
| if pattern_str.find("*") != -1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use fnmatch.fnmatch (wildcards) for this without converting to regex ?
And declare this in the helper string in cli for clarity
ch_backup/logic/partial_restore.py
Outdated
| self.pattern_str = pattern_str | ||
| self.pattern = None | ||
| self.table = None | ||
| if pattern_str.find("*") != -1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we allow pattern format for databases also, like db*.table*?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not really, I didn't see such request from clients and hope we will never need to use such pattern.
Also tried to introduce patterns for database, but code becomes kinda complicated if we support it patterns for database on exclude
#277 (comment)
904c34b to
3a46ad9
Compare
Co-authored-by: Aleksei Filatov <68555560+aalexfvk@users.noreply.github.com>
For command
ch-backup restore:exclude-database,include-tables,exclude-tablesincluded-patternsandexcluded-patternswith same functional and with ability to specify subset of tables via simplified regex.Summary by Sourcery
Add pattern-based partial restore filtering to the ch-backup restore command and update restore logic to use a new matcher instead of explicit include/exclude table lists.
New Features:
Enhancements:
Tests: