Skip to content

Conversation

@yahor-s
Copy link

@yahor-s yahor-s commented Dec 12, 2025

For command ch-backup restore:

  • Removed option exclude-database, include-tables, exclude-tables
  • Added instead options included-patterns and excluded-patterns with same functional and with ability to specify subset of tables via simplified regex.

Summary by Sourcery

Add pattern-based partial restore filtering to the ch-backup restore command and update restore logic to use a new matcher instead of explicit include/exclude table lists.

New Features:

  • Introduce --included-patterns and --excluded-patterns options on ch-backup restore to select tables by db.table or wildcard patterns.
  • Add PartialRestoreFilter and PartialRestorePattern helpers to encapsulate table/database matching for partial restores.

Enhancements:

  • Refactor restore flow to derive databases and tables from backup metadata using the new partial restore filter rather than explicit include/exclude table lists.
  • Tighten CLI constraints so schema-only restores cannot be combined with pattern-based partial restores.
  • Improve debug logging around database and table selection during restore.

Tests:

  • Add integration coverage for partial restore behavior with included and excluded patterns.
  • Add unit tests for the partial restore matcher to verify database and table pattern handling.

@sourcery-ai
Copy link

sourcery-ai bot commented Dec 12, 2025

Reviewer's Guide

Refactors the restore CLI and core logic to replace explicit include/exclude database/table lists with a new pattern-based partial restore filter (included/excluded patterns) and wires it through CLI, backup logic, and tests, including new unit tests for the matcher and an integration scenario.

Sequence diagram for ch-backup restore with included_patterns and excluded_patterns

sequenceDiagram
    actor User
    participant CLI as RestoreCLI
    participant Filter as PartialRestoreFilter
    participant CHB as ClickhouseBackup
    participant Meta as BackupMeta

    User->>CLI: ch-backup restore --included-patterns db1.table* --databases db1

    CLI->>CLI: parse included_patterns and databases
    CLI->>Filter: new PartialRestoreFilter(inverted=false, patterns=included_patterns)
    CLI->>CHB: restore(sources, backup_name, databases, partial_restore_filter=Filter, ...)

    CHB->>Meta: get_databases()
    Meta-->>CHB: all_databases
    CHB->>CHB: choose databases to restore
    alt partial_restore_filter provided
        loop for each db in databases
            CHB->>Filter: is_possibly_contains_database(db)
            Filter-->>CHB: bool
            CHB->>CHB: build filtered database list
        end
    end

    CHB->>CHB: collect tables to restore
    alt partial_restore_filter provided
        loop for each db in selected databases
            CHB->>Meta: get_tables(db)
            Meta-->>CHB: tables_in_db
            loop for each table in tables_in_db
                CHB->>Filter: accept_table(db, table_name)
                Filter-->>CHB: bool
                CHB->>CHB: include or exclude table
            end
        end
    else no filter
        CHB->>Meta: get all tables for all databases
    end

    CHB->>CHB: _restore(sources, db_names, tables, ...)
    CHB-->>CLI: restore completed
    CLI-->>User: print restore summary
Loading

Class diagram for new partial restore filter and integration into ClickhouseBackup

classDiagram
    class PartialRestorePattern {
        +str database
        +str pattern_str
        +pattern
        +table
        +PartialRestorePattern(database, pattern_str)
        +matches(db, table) bool
        +related_to_db(db) bool
        +from_pattern(pattern) PartialRestorePattern
    }

    class PartialRestoreFilter {
        +bool inverted
        +List~PartialRestorePattern~ patterns
        +PartialRestoreFilter(inverted, patterns)
        +accept_table(db, table) bool
        +is_possibly_contains_database(db) bool
        +is_empty() bool
    }

    class ClickhouseBackup {
        +restore(
            sources,
            backup_name,
            databases,
            override_replica_name,
            force_non_replicated,
            replica_name,
            cloud_storage_source_bucket,
            cloud_storage_source_path,
            use_embedded_backup_restore,
            clean_zookeeper_mode,
            keep_going,
            restore_tables_in_replicated_database,
            partial_restore_filter
        ) void
        +_restore(
            sources,
            db_names,
            tables,
            replica_name,
            cloud_storage_source_bucket,
            cloud_storage_source_path,
            use_embedded_backup_restore,
            clean_zookeeper_mode,
            keep_going,
            restore_tables_in_replicated_database
        ) void
    }

    class BackupMeta {
        +get_databases() List~str~
        +get_tables(db_name) List~TableMetadata~
    }

    class TableMetadata {
        +str database
        +str name
    }

    class RestoreCLI {
        +restore_command(
            ch_backup,
            name,
            databases,
            schema_only,
            override_replica_name,
            force_non_replicated,
            replica_name,
            use_embedded_backup_restore,
            clean_zookeeper_mode,
            keep_going,
            access,
            data,
            schema,
            udf,
            included_patterns,
            excluded_patterns,
            nc
        ) void
    }

    PartialRestoreFilter "*" o-- "*" PartialRestorePattern : aggregates
    ClickhouseBackup --> BackupMeta : uses
    ClickhouseBackup --> TableMetadata : restores
    RestoreCLI --> ClickhouseBackup : calls restore
    RestoreCLI --> PartialRestoreFilter : constructs
    ClickhouseBackup --> PartialRestoreFilter : applies filter
Loading

File-Level Changes

Change Details Files
Replace database/table include-exclude CLI options with pattern-based partial restore flags and adjust constraints and restore wiring.
  • Removed --exclude-databases, --tables, and --exclude-tables options from the restore CLI and simplified the database option group.
  • Added --included-patterns and --excluded-patterns options (db.table patterns with simple wildcard support) and wired them into restore_command parameters.
  • Reworked Click CLI @constraint annotations to enforce mutual exclusivity between included/excluded patterns and schema-only options.
ch_backup/cli.py
Introduce PartialRestoreFilter/PartialRestorePattern to implement pattern-based database/table filtering for restore.
  • Added PartialRestorePattern to parse db.table patterns, support '*' wildcards on table names, and match/relate patterns to databases and tables.
  • Added PartialRestoreFilter to encapsulate include/exclude semantics, table acceptance, and database-level pre-filtering (is_possibly_contains_database).
  • Integrated PartialRestoreFilter into ClickhouseBackup.restore to select databases and tables based on patterns and log excluded entities instead of using explicit include/exclude lists.
ch_backup/logic/partial_restore.py
ch_backup/ch_backup.py
Simplify restore pipeline table selection by removing explicit exclude_tables handling and relying on pre-filtered tables.
  • Updated internal _restore and table.restore signatures to drop exclude_tables arguments and adjusted callers accordingly.
  • Removed exclude_tables-based filtering in table.restore, assuming tables list is already filtered by PartialRestoreFilter.
  • Added logging around chosen databases and tables to reflect the new filtering behavior.
ch_backup/ch_backup.py
ch_backup/logic/table.py
Extend integration restore helper and scenarios to cover the new partial restore behavior and add unit tests for the matcher.
  • Extended tests.integration.modules.ch_backup.Backup.restore helper to accept included_patterns/excluded_patterns and translate them to CLI flags.
  • Added a new integration scenario validating partial restore with included_patterns and excluded_patterns, including table/part counts.
  • Added unit tests for PartialRestoreFilter and PartialRestorePattern covering database and table matching semantics and inverted behavior.
  • Cleaned up minor whitespace in existing integration feature steps.
tests/integration/modules/ch_backup.py
tests/integration/features/backup_restore.feature
tests/unit/test_partial_restore_matcher.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes - here's some feedback:

  • The PartialRestorePattern regex construction (re.compile(pattern_str.replace("*", ".*"))) will treat any other regex metacharacters (e.g. ., +, []) in table names as regex, which conflicts with the "simplified regex" intent and the спец.символы test; consider re.escape(pattern_str) first and then replacing the escaped * to avoid unintended matches.
  • The CLI validation regex for --included-patterns / --excluded-patterns (\w+\.[\w*]+) only allows ASCII word characters and *, while unit tests use non-ASCII and punctuation in names (e.g. спец.символы), so these patterns cannot be passed via CLI; relax the CLI regex or reuse the same parsing logic as PartialRestorePattern.from_pattern.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The `PartialRestorePattern` regex construction (`re.compile(pattern_str.replace("*", ".*"))`) will treat any other regex metacharacters (e.g. `.`, `+`, `[]`) in table names as regex, which conflicts with the "simplified regex" intent and the `спец.символы` test; consider `re.escape(pattern_str)` first and then replacing the escaped `*` to avoid unintended matches.
- The CLI validation regex for `--included-patterns` / `--excluded-patterns` (`\w+\.[\w*]+`) only allows ASCII word characters and `*`, while unit tests use non-ASCII and punctuation in names (e.g. `спец.символы`), so these patterns cannot be passed via CLI; relax the CLI regex or reuse the same parsing logic as `PartialRestorePattern.from_pattern`.

## Individual Comments

### Comment 1
<location> `ch_backup/cli.py:426-435` </location>
<code_context>
     f"Example for {style(List.name.upper(), bold=True)}: "
     + style('"db1, db2"', fg=Color.cyan),
-    mutually_exclusive(
-        option(
-            "-d",
-            "--databases",
</code_context>

<issue_to_address>
**suggestion (bug_risk):** Pattern list option regex may not match all valid table names and could mislead users.

`type=List(regexp=r"\w+\.[\w*]+")` restricts db/table names to word chars plus `*` in the table, so valid ClickHouse identifiers with other chars (e.g. dashes, quoted names with dots, etc.) can’t be expressed via CLI even though the backend matcher would accept them. Please either relax or remove this regex, or update it to accurately reflect the identifier rules supported by `PartialRestorePattern`.

Suggested implementation:

```python
        type=List(),

```

If there are other options in this file using the same `List(regexp=r"\w+\.[\w*]+")` pattern (e.g. for other partial restore patterns), you should apply the same replacement there as well, so that all pattern-based CLI inputs accept the full range of identifiers supported by `PartialRestoreFilter`.
</issue_to_address>

### Comment 2
<location> `ch_backup/ch_backup.py:263-272` </location>
<code_context>
+        if not databases:
</code_context>

<issue_to_address>
**issue (bug_risk):** Dropping the explicit "missing databases" check can now produce confusing failures on invalid user input.

Without the prior validation that all requested databases exist in the backup metadata, invalid user input (e.g. a non-existent DB) will now flow into later calls like `get_tables(db_name)`, likely failing with less clear errors and changing the previous behavior. Please reintroduce a validation step that checks user-specified `databases` against `backup_meta.get_databases()` before applying the partial filter.
</issue_to_address>

### Comment 3
<location> `ch_backup/logic/partial_restore.py:19-22` </location>
<code_context>
+        self.pattern_str = pattern_str
+        self.pattern = None
+        self.table = None
+        if pattern_str.find("*") != -1:
+            self.pattern = re.compile(pattern_str.replace("*", ".*"))
+        else:
+            self.table = pattern_str
</code_context>

<issue_to_address>
**suggestion (bug_risk):** Wildcard-to-regex translation can misbehave if table names ever contain regex metacharacters.

This implementation assumes patterns never include other regex metacharacters. If identifiers later allow `+`, `?`, `(`, `)`, etc., they’ll be treated as regex tokens instead of literals. To harden this, you can use `re.escape(pattern_str).replace("\*", ".*")` so only `*` remains a wildcard and all other characters are escaped.

```suggestion
        if pattern_str.find("*") != -1:
            escaped = re.escape(pattern_str)
            self.pattern = re.compile(escaped.replace(r"\*", ".*"))
        else:
            self.table = pattern_str
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

filtered_databases = []
excluded_databases = []
for db_name in databases:
if partial_restore_filter.is_possibly_contains_database(db_name):
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kinda tricky part, because for me it is good that exclude-pattern: db1.* is totally excluding database from restoring, but for other people it could be not so obvious.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decided to move all "partial restore filters" logic to separate file, but I'm not sure about classes/function and open to suggestions.

},
},
{
"id": "Excluded filter and not enough info",
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tricky part about exclude-database with patterns is that we don't have enough info with simple table pattern like db1.t1 and only have enough confidence with regex like db.*

),
option(
"--included-patterns",
type=List(regexp=r"\w+\.[\w*]+"),
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on Sourcery comment.

Do we want to support some "strange" table names with special symbols or quoted table name for patial restore? I think that such case will be exceptional and partially covered by "*" wildcard. But I can change regex, if needed.

p.s. googled and double-checked about non-ascii symbols and sourcery-ai was wrong

root@clickhouse02:/# ch-backup restore --included-patterns база.таблица1 11111
...

Error: No backups with name "11111" were found.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is about db1.*table* - we expect such patterns to work?

If yes, lets also add it to tests?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is about db1.*table* - we expect such patterns to work?

If yes, lets also add it to tests?

Valid point, added to unit and feature tests

),
option(
"--included-patterns",
type=List(regexp=r"\w+\.[\w*]+"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is about db1.*table* - we expect such patterns to work?

If yes, lets also add it to tests?

@MedvedewEM
Copy link
Contributor

LGTM

@MedvedewEM MedvedewEM requested a review from aalexfvk December 16, 2025 12:55
logging.debug(
f"Filtering table to restore: {table.database}.{table.name} result: {partial_restore_filter.accept_table(table.database, table.name)}."
)
if partial_restore_filter.accept_table(table.database, table.name):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider is_restoring = partial_restore_filter.accept_table(table.database, table.name) and use it in both places

option(
"--included-patterns",
type=List(regexp=r"\w+\.[\w*]+"),
help="Comma-separated list of db.tables to restore. Other tables will be skipped. "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Semicolon-separated ?

ch_backup/cli.py Outdated
"--included-patterns",
type=List(regexp=r"\w+\.[\w*]+"),
help="Comma-separated list of db.tables to restore. Other tables will be skipped. "
"Examples: db1.table1; db1.table2,db2.*; db1.prefix*,db2.*suffix",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Semicolons is mixed with commas ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, semicolons separate different cases, commas separate different patterns in one case.

I guess I will rewrite it to Examples: db1.table1 | db1.table2,db2.* | db1.prefix*,db2.*suffix to make it more clear.

),
option("--schema-only", is_flag=True, help="Restore only database schemas."),
)
@option_group(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we preserve this parameters, and just treat them as patterns internally ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't want to, because, imho, it will complicate the logic unnecessarily.

You want preserve this parameters for backward compability or smth else? Because, his parameters are potentially used only manually once in a month and their replacement seems pretty easy for users

", ".join(missed_databases),
self._context.backup_meta.path,
)
raise ClickhouseBackupError(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this logic now ?

self.pattern_str = pattern_str
self.pattern = None
self.table = None
if pattern_str.find("*") != -1:
Copy link
Contributor

@aalexfvk aalexfvk Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use fnmatch.fnmatch (wildcards) for this without converting to regex ?
And declare this in the helper string in cli for clarity

self.pattern_str = pattern_str
self.pattern = None
self.table = None
if pattern_str.find("*") != -1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we allow pattern format for databases also, like db*.table*?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really, I didn't see such request from clients and hope we will never need to use such pattern.
Also tried to introduce patterns for database, but code becomes kinda complicated if we support it patterns for database on exclude
#277 (comment)

@yahor-s yahor-s force-pushed the restore-add-patterns branch from 904c34b to 3a46ad9 Compare January 29, 2026 02:26
yahor-s and others added 2 commits January 29, 2026 11:39
Co-authored-by: Aleksei Filatov <68555560+aalexfvk@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants