Skip to content

databricks-solutions/databricks-replicator

Repository files navigation

MultiCloud UC Replication System for Databricks

A Python plug-in solution to replicate UC data & metadata between Databricks envs. Support and accelerate workloads in multi-cloud migration, single-cloud migration, workspace migration, DR, backup and recovery, multi-cloud data mesh.

Cloud agnostic - cross metastore or same metastore replication

Overview

This system provides incremental data and UC metadata replication capabilities between Databricks env or within same env with D2D Delta Share, Deep Clone and Autoloader, with specialized handling for Streaming Tables. It supports multiple operation types that can be run independently or together.

UC Object Types

UC Metadata

Object Type Status Comments
Storage Credentials Supported
External Locations Supported
Catalogs Supported
Schemas Supported
Volumes Supported
Tables Supported
Views Supported
Table/View Comments Supported
SQL-based Materialized Views Supported DLT-based MV should be recreated by DLT in target
SQL-based Streaming Tables Supported DLT-based ST should be recreated by DLT in target
Tags (catalog, schema, table, columns, views, volume) Supported
Column Comments Supported
Permissions In Development
Functions In Development
Models In Development
Governed Tag In Development

Data Replication

Object Type Status Comments
Managed Tables Supported
External Tables Supported Can be replicated as External or Managed
Volume Files Supported
DLT Streaming Tables (data only, no checkpoints) Supported checkpoints not replicated. state handling should be managed seperately.
DLT Materialized Views Not Supported DLT MV should be recomputed by DLT in target

Other Unsupported Objects

  • Databricks Workspace Assets is not yet supported, but maybe considered in future roadmap
  • Hive metastore

How Does it Work

The tool follows a set of operations including backup/publish source tables to delta share, replicate uc metadata, replicate table and reconcile data. How each operation performs is fully configurable in yaml-based file. Below are high-level summary of what each operation does:

Backup/Publish Operations

  • (Optional if managed by Terraform) Create delta share recipient, create shares and add schemas to share
  • For legacy DLT streaming table, deep clones backing tables from source to backup catalogs and add backup schema to backup share
  • For Default Publishing Mode (DPM) streaming table, add backing table to dpm_backing_tables share directly
  • Not required for UC metadata replication

Data Replication Operations

  • (Optional if managed by Terraform) Create shared catalog from shares
  • Deep clone tables across workspaces from shared catalog with schema enforcement
  • For DLT streaming table, it defaults to replicata from dmp_backing_table share and fallback to backup share
  • Incremental copy volume files across workspaces from shared catalog using autoloader + file copy

Metadata Replication Operations

  • Replicate UC metadata from source uc to target uc using Databricks Connect and Databricks SDK

Reconciliation Operations (Table only)

  • Row count validation
  • Schema structure comparison
  • Missing data detection

Key Features

Delta Sharing

Flexibility to let the tool setup Delta share infra automatically for you with default names, i.e. Recipient, Shares and Shared Catalogs. Alternatively, use existing Delta share infra

Run Anywhere

  • Run on both Serverless (recommended) and non-Serverless (DBR 16.4+).
  • Run in source, target Databricks workspace, or outside Databricks
  • Run in cli, or deployed via DAB as workflow job

Flexible Configuration

  • YAML-based configuration with Pydantic validation.
  • Hierarchical configuration with inheritance and CLI overrides
  • Environments to manage env specific connection and configurations
  • Substitute support to allow dynamic config string
  • Oject types flexible selective replication
  • Catalog, schema and table flexible selective replication
  • Check this for complete config features.
  • Check this for available configs with detail descriptions.

Incremental Data Replication

The system leverages native Deep Clone and Autoloader for incrementality and replication performance

  • Deep clone for delta table
  • Autoloader + File copy for volume files replication with option to specify starting timestamp
  • Option to replicate external table as managed table to support managed table migration

Streaming Table Handling

The system automatically handles Streaming Tables complexities:

  • Export legacy ST backing tables to share or add DPM ST backing tables to share
  • Constructs backing table path using pipeline ID
  • Deep clone ST backing tables rather than ST tables directly

UC Metadata Replication

Replicate UC metadata

  • Create or update target UC objects with Databricks SDK or SQL
  • Incremental tag replication

Parrallel Replication

  • Configurable concurrency with multithreading
  • Parrallel tables/volumes replication
  • Concurrent volume file copy

Robust Logging & Error Handling

  • Configurable retry logic with exponential backoff
  • Graceful degradation where operations continue if individual object fail
  • Comprehensive error logging with run id and full stack traces
  • All operations tracked in audit tables for monitoring and alerting
  • Print out all executed SQL in DEBUG mode for easy troubleshooting
  • Check here to learn more about loggings

Prerequisites

  • User or Service Principal in source and target workspace created with metastore admin right. If metastore admin permission is not available, check here to apply more granular UC access control

  • For cross-metastore replication, enable Delta Sharing (DS) including network connectivity https://docs.databricks.com/aws/en/delta-sharing/set-up#gsc.tab=0

  • Table replication requires Delta Share with Cloud Token method instead Presigned URL

  • PAT or OAuth Token for user or sp created and stored in Databricks Key Vault. Note: if this tool is run in source workspace, only target workspace token secrets need to be created in source. Conversely, if run in target workspace, source token needs to be created in target.

  • Network connectivity to source or target workspace. e.g. if tool runs in source workspace, source data plane (outbound) should be able to establish connect to target workspace control plane (inbound). And vica versa. Note: UC replication requires connect to both source and target workspace using Databricks Connect.

  • If tool is running outside of Databricks Workspace and Serverless is unavailable in source and/or target workspace, cluster id for all-purpose cluster in source or/and target workspace needs to be provided

Getting Started

  1. Install the Databricks CLI from https://docs.databricks.com/dev-tools/cli/databricks-cli.html

  2. Setup dev env:

git clone <repository-url>
cd <repository folder>
make setup
source .venv/bin/activate
  1. Configure environments.yaml file with connection details

  2. START REPLICATION

  • Clone and modify sample configs in configs folder. Configs with _default suffix allows you to set up replication using system generated default names and settings with minimum configuration.
  • High level replication steps are also descrbied in the sample config
  • For more comprehensive understanding of available configs, check README.yaml
# Check all available args
data-replicator --help

# Validate configuration without running
data-replicator configs/cross_metastore/uc_metadata_defaults.yaml --validate-only
data-replicator configs/cross_metastore/all_tables_defaults.yaml --validate-only
data-replicator configs/cross_metastore/volume_defaults.yaml --validate-only

# Replicate all uc metadata
# Set storage_credential_config if storage credentials need to be replicated
# Set cloud_url_mapping if external location or external table need to be replicated
# Objects will be replicated in the following logical order: Storage credentials -> External locations -> Catalogs -> Schemas -> Tables -> Views -> Volumes -> Column Tags -> Column Comments -> Permissions
data-replicator configs/cross_metastore/uc_metadata_defaults.yaml --uc-object-types all --target-catalogs catalog1,catalog2,catalog3

# Replicate delta tables for specific catalogs
# DLT streaming tables must already exist in target before replicate
data-replicator configs/cross_metastore/all_tables_defaults.yaml --target-catalogs catalog1,catalog2,catalog3

# Replicate volume files for specific catalogs
data-replicator configs/cross_metastore/volume_defaults.yaml --target-catalogs catalog1,catalog2,catalog3

The solution can be flexibly configured to replicate all or selected objects and data. Some objects such as storage credentials might be created centrally with Terraform instead

# Replicate Storage credentials and External locations
# Cloud identities setup (AWS role or Azure Managed Identity) with required access to cloud storage
# Configure uc_metadata_defaults.yaml
data-replicator configs/cross_metastore/uc_metadata_defaults.yaml --uc-object-types storage_credential,external_location

# Replicate catalogs and schemas
data-replicator configs/cross_metastore/uc_metadata_defaults.yaml --uc-object-types catalog,schema --target-catalogs catalog1,catalog2,catalog3 

# Replicate tables for specific catalogs
# If streaming tables are in the catalog, they must firstly be created in the target env using DLT
data-replicator configs/cross_metastore/all_tables_defaults.yaml --target-catalogs catalog1,catalog2,catalog3

# Alternatively replicate tables for specific schemas under a catalog
data-replicator configs/cross_metastore/all_tables_defaults.yaml --target-catalogs catalog1 --target-schemas bronze_1,silver_1

# Alternatively replicate streaming tables only - streaming tables must already exist in target
data-replicator configs/cross_metastore/streaming_tables_defaults.yaml --target-catalogs catalog1 --target-schemas bronze_1,silver_1

# Alternatively replicate delta tables only - streaming tables must already exist in target
data-replicator configs/cross_metastore/delta_tables_defaults.yaml --target-catalogs catalog1 --target-schemas bronze_1,silver_1

# Replicate volume for specific catalogs
data-replicator configs/cross_metastore/uc_metadata_defaults.yaml --uc-object-types volume --target-catalogs catalog1,catalog2,catalog3 

# Replicate sql streaming tables for specific catalogs
data-replicator configs/cross_metastore/uc_metadata_defaults.yaml --uc-object-types streaming_table --target-catalogs catalog1,catalog2,catalog3

# Replicate sql materialized views for specific catalogs
data-replicator configs/cross_metastore/uc_metadata_defaults.yaml --uc-object-types materialized_view --target-catalogs catalog1,catalog2,catalog3

# Replicate views for specific catalogs
data-replicator configs/cross_metastore/uc_metadata_defaults.yaml --uc-object-types view --target-catalogs catalog1,catalog2,catalog3

# Replicate tags & comments
data-replicator configs/cross_metastore/uc_metadata_defaults.yaml --uc-object-types table_tag,column_tag,catalog_tag,schema_tag,volume_tag,column_comment --target-catalogs catalog1,catalog2,catalog3 

# Replicate permissions for specific catalogs (Not yet supported - WIP)
# Prerequisite: All user principals should be provisioned in the migrated workspace
data-replicator configs/cross_metastore/uc_metadata_defaults.yaml --uc-object-types permission --target-catalogs catalog1,catalog2,catalog3

# Replicate volume files for specific schemas
data-replicator configs/cross_metastore/volume_defaults.yaml --target-catalogs aaron_replication --target-schemas bronze_1,silver_1

# Alternatively replicate volume files for specific volume
data-replicator configs/cross_metastore/volume_defaults.yaml --target-catalogs aaron_replication --target-schemas bronze_1,silver_1 --target-volumes raw
  1. Deploy - the tool can be deployed as Workflow Job using DAB. Check example job in resources folder
databricks bundle validate
databricks bundle deploy

How to get help

Databricks support doesn't cover this content. For questions or bugs, please open a GitHub issue and the team will help on a best effort basis.

License

© 2025 Databricks, Inc. All rights reserved. The source in this notebook is provided subject to the Databricks License [https://databricks.com/db-license-source]. All included or referenced third party libraries are subject to the licenses set forth below.

About

No description, website, or topics provided.

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •