MultiCloud UC Replication System for Databricks

A Python plug-in solution to replicate UC data & metadata between Databricks envs. Support and accelerate workloads in multi-cloud migration, single-cloud migration, workspace migration, DR, backup and recovery, multi-cloud data mesh.

Cloud agnostic - cross metastore or same metastore replication

Overview

This system provides incremental data and UC metadata replication capabilities between Databricks env or within same env with D2D Delta Share, Deep Clone and Autoloader, with specialized handling for Streaming Tables. It supports multiple operation types that can be run independently or together.

UC Object Types

UC Metadata

Object Type	Status	Comments
Storage Credentials	Supported
External Locations	Supported
Catalogs	Supported
Schemas	Supported
Volumes	Supported
Tables	Supported
Views	Supported
Table/View Comments	Supported
SQL-based Materialized Views	Supported	DLT-based MV should be recreated by DLT in target
SQL-based Streaming Tables	Supported	DLT-based ST should be recreated by DLT in target
Tags (catalog, schema, table, columns, views, volume)	Supported
Column Comments	Supported
Permissions	In Development
Functions	In Development
Models	In Development
Governed Tag	In Development

Data Replication

Object Type	Status	Comments
Managed Tables	Supported
External Tables	Supported	Can be replicated as External or Managed
Volume Files	Supported
DLT Streaming Tables (data only, no checkpoints)	Supported	checkpoints not replicated. state handling should be managed seperately.
DLT Materialized Views	Not Supported	DLT MV should be recomputed by DLT in target

Other Unsupported Objects

Databricks Workspace Assets is not yet supported, but maybe considered in future roadmap
Hive metastore

How Does it Work

The tool follows a set of operations including backup/publish source tables to delta share, replicate uc metadata, replicate table and reconcile data. How each operation performs is fully configurable in yaml-based file. Below are high-level summary of what each operation does:

Backup/Publish Operations

(Optional if managed by Terraform) Create delta share recipient, create shares and add schemas to share
For legacy DLT streaming table, deep clones backing tables from source to backup catalogs and add backup schema to backup share
For Default Publishing Mode (DPM) streaming table, add backing table to dpm_backing_tables share directly
Not required for UC metadata replication

Data Replication Operations

(Optional if managed by Terraform) Create shared catalog from shares
Deep clone tables across workspaces from shared catalog with schema enforcement
For DLT streaming table, it defaults to replicata from dmp_backing_table share and fallback to backup share
Incremental copy volume files across workspaces from shared catalog using autoloader + file copy

Metadata Replication Operations

Replicate UC metadata from source uc to target uc using Databricks Connect and Databricks SDK

Reconciliation Operations (Table only)

Row count validation
Schema structure comparison
Missing data detection

Key Features

Delta Sharing

Flexibility to let the tool setup Delta share infra automatically for you with default names, i.e. Recipient, Shares and Shared Catalogs. Alternatively, use existing Delta share infra

Run Anywhere

Run on both Serverless (recommended) and non-Serverless (DBR 16.4+).
Run in source, target Databricks workspace, or outside Databricks
Run in cli, or deployed via DAB as workflow job

Flexible Configuration

YAML-based configuration with Pydantic validation.
Hierarchical configuration with inheritance and CLI overrides
Environments to manage env specific connection and configurations
Substitute support to allow dynamic config string
Oject types flexible selective replication
Catalog, schema and table flexible selective replication
Check this for complete config features.
Check this for available configs with detail descriptions.

Incremental Data Replication

The system leverages native Deep Clone and Autoloader for incrementality and replication performance

Deep clone for delta table
Autoloader + File copy for volume files replication with option to specify starting timestamp
Option to replicate external table as managed table to support managed table migration

Streaming Table Handling

The system automatically handles Streaming Tables complexities:

Export legacy ST backing tables to share or add DPM ST backing tables to share
Constructs backing table path using pipeline ID
Deep clone ST backing tables rather than ST tables directly

UC Metadata Replication

Replicate UC metadata

Create or update target UC objects with Databricks SDK or SQL
Incremental tag replication

Parrallel Replication

Configurable concurrency with multithreading
Parrallel tables/volumes replication
Concurrent volume file copy

Robust Logging & Error Handling

Configurable retry logic with exponential backoff
Graceful degradation where operations continue if individual object fail
Comprehensive error logging with run id and full stack traces
All operations tracked in audit tables for monitoring and alerting
Print out all executed SQL in DEBUG mode for easy troubleshooting
Check here to learn more about loggings

Prerequisites

User or Service Principal in source and target workspace created with metastore admin right. If metastore admin permission is not available, check here to apply more granular UC access control
For cross-metastore replication, enable Delta Sharing (DS) including network connectivity https://docs.databricks.com/aws/en/delta-sharing/set-up#gsc.tab=0
Table replication requires Delta Share with Cloud Token method instead Presigned URL
PAT or OAuth Token for user or sp created and stored in Databricks Key Vault. Note: if this tool is run in source workspace, only target workspace token secrets need to be created in source. Conversely, if run in target workspace, source token needs to be created in target.
Network connectivity to source or target workspace. e.g. if tool runs in source workspace, source data plane (outbound) should be able to establish connect to target workspace control plane (inbound). And vica versa. Note: UC replication requires connect to both source and target workspace using Databricks Connect.
If tool is running outside of Databricks Workspace and Serverless is unavailable in source and/or target workspace, cluster id for all-purpose cluster in source or/and target workspace needs to be provided

Getting Started

Install the Databricks CLI from https://docs.databricks.com/dev-tools/cli/databricks-cli.html
Setup dev env:

git clone <repository-url>
cd <repository folder>
make setup
source .venv/bin/activate

Configure environments.yaml file with connection details
START REPLICATION

Clone and modify sample configs in configs folder. Configs with _default suffix allows you to set up replication using system generated default names and settings with minimum configuration.
High level replication steps are also descrbied in the sample config
For more comprehensive understanding of available configs, check README.yaml

# Check all available args
data-replicator --help

# Validate configuration without running
data-replicator configs/cross_metastore/uc_metadata_defaults.yaml --validate-only
data-replicator configs/cross_metastore/all_tables_defaults.yaml --validate-only
data-replicator configs/cross_metastore/volume_defaults.yaml --validate-only

# Replicate all uc metadata
# Set storage_credential_config if storage credentials need to be replicated
# Set cloud_url_mapping if external location or external table need to be replicated
# Objects will be replicated in the following logical order: Storage credentials -> External locations -> Catalogs -> Schemas -> Tables -> Views -> Volumes -> Column Tags -> Column Comments -> Permissions
data-replicator configs/cross_metastore/uc_metadata_defaults.yaml --uc-object-types all --target-catalogs catalog1,catalog2,catalog3

# Replicate delta tables for specific catalogs
# DLT streaming tables must already exist in target before replicate
data-replicator configs/cross_metastore/all_tables_defaults.yaml --target-catalogs catalog1,catalog2,catalog3

# Replicate volume files for specific catalogs
data-replicator configs/cross_metastore/volume_defaults.yaml --target-catalogs catalog1,catalog2,catalog3

The solution can be flexibly configured to replicate all or selected objects and data. Some objects such as storage credentials might be created centrally with Terraform instead

# Replicate Storage credentials and External locations
# Cloud identities setup (AWS role or Azure Managed Identity) with required access to cloud storage
# Configure uc_metadata_defaults.yaml
data-replicator configs/cross_metastore/uc_metadata_defaults.yaml --uc-object-types storage_credential,external_location

# Replicate catalogs and schemas
data-replicator configs/cross_metastore/uc_metadata_defaults.yaml --uc-object-types catalog,schema --target-catalogs catalog1,catalog2,catalog3 

# Replicate tables for specific catalogs
# If streaming tables are in the catalog, they must firstly be created in the target env using DLT
data-replicator configs/cross_metastore/all_tables_defaults.yaml --target-catalogs catalog1,catalog2,catalog3

# Alternatively replicate tables for specific schemas under a catalog
data-replicator configs/cross_metastore/all_tables_defaults.yaml --target-catalogs catalog1 --target-schemas bronze_1,silver_1

# Alternatively replicate streaming tables only - streaming tables must already exist in target
data-replicator configs/cross_metastore/streaming_tables_defaults.yaml --target-catalogs catalog1 --target-schemas bronze_1,silver_1

# Alternatively replicate delta tables only - streaming tables must already exist in target
data-replicator configs/cross_metastore/delta_tables_defaults.yaml --target-catalogs catalog1 --target-schemas bronze_1,silver_1

# Replicate volume for specific catalogs
data-replicator configs/cross_metastore/uc_metadata_defaults.yaml --uc-object-types volume --target-catalogs catalog1,catalog2,catalog3 

# Replicate sql streaming tables for specific catalogs
data-replicator configs/cross_metastore/uc_metadata_defaults.yaml --uc-object-types streaming_table --target-catalogs catalog1,catalog2,catalog3

# Replicate sql materialized views for specific catalogs
data-replicator configs/cross_metastore/uc_metadata_defaults.yaml --uc-object-types materialized_view --target-catalogs catalog1,catalog2,catalog3

# Replicate views for specific catalogs
data-replicator configs/cross_metastore/uc_metadata_defaults.yaml --uc-object-types view --target-catalogs catalog1,catalog2,catalog3

# Replicate tags & comments
data-replicator configs/cross_metastore/uc_metadata_defaults.yaml --uc-object-types table_tag,column_tag,catalog_tag,schema_tag,volume_tag,column_comment --target-catalogs catalog1,catalog2,catalog3 

# Replicate permissions for specific catalogs (Not yet supported - WIP)
# Prerequisite: All user principals should be provisioned in the migrated workspace
data-replicator configs/cross_metastore/uc_metadata_defaults.yaml --uc-object-types permission --target-catalogs catalog1,catalog2,catalog3

# Replicate volume files for specific schemas
data-replicator configs/cross_metastore/volume_defaults.yaml --target-catalogs aaron_replication --target-schemas bronze_1,silver_1

# Alternatively replicate volume files for specific volume
data-replicator configs/cross_metastore/volume_defaults.yaml --target-catalogs aaron_replication --target-schemas bronze_1,silver_1 --target-volumes raw

Deploy - the tool can be deployed as Workflow Job using DAB. Check example job in resources folder

databricks bundle validate
databricks bundle deploy

How to get help

Databricks support doesn't cover this content. For questions or bugs, please open a GitHub issue and the team will help on a best effort basis.

License

© 2025 Databricks, Inc. All rights reserved. The source in this notebook is provided subject to the Databricks License [https://databricks.com/db-license-source]. All included or referenced third party libraries are subject to the licenses set forth below.

Name		Name	Last commit message	Last commit date
Latest commit History 189 Commits
configs		configs
docs		docs
fixtures		fixtures
resources		resources
scratch		scratch
src/data_replication		src/data_replication
.gitignore		.gitignore
CODEOWNERS.txt		CODEOWNERS.txt
LICENSE.md		LICENSE.md
Makefile		Makefile
NOTICE.md		NOTICE.md
README.md		README.md
SECURITY.md		SECURITY.md
databricks.yml		databricks.yml
pytest.ini		pytest.ini
requirements-dev.txt		requirements-dev.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MultiCloud UC Replication System for Databricks

Overview

UC Object Types

UC Metadata

Data Replication

Other Unsupported Objects

How Does it Work

Backup/Publish Operations

Data Replication Operations

Metadata Replication Operations

Reconciliation Operations (Table only)

Key Features

Delta Sharing

Run Anywhere

Flexible Configuration

Incremental Data Replication

Streaming Table Handling

UC Metadata Replication

Parrallel Replication

Robust Logging & Error Handling

Prerequisites

Getting Started

How to get help

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

databricks-solutions/databricks-replicator

Folders and files

Latest commit

History

Repository files navigation

MultiCloud UC Replication System for Databricks

Overview

UC Object Types

UC Metadata

Data Replication

Other Unsupported Objects

How Does it Work

Backup/Publish Operations

Data Replication Operations

Metadata Replication Operations

Reconciliation Operations (Table only)

Key Features

Delta Sharing

Run Anywhere

Flexible Configuration

Incremental Data Replication

Streaming Table Handling

UC Metadata Replication

Parrallel Replication

Robust Logging & Error Handling

Prerequisites

Getting Started

How to get help

License

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages