feat(ducklake): copy data_imports result to Ducklake in local dev #42862

bill-ph · 2025-12-05T19:20:05Z

Problem

Part of PostHog Data Warehouse work stream.

Changes

Follows same pattern as the data modeling prototype #42005 #42541 #42231

What's included:

feature flag gating
copy sub-workflow
verification

What's yet to be done:

(refactor) reuse same logic between data_imports/data_modeling, current duplication is intentional to avoid difficult merge conflicts.
move workflow to ducklake queue like chore(ducklake): move ducklake copy to dedicated queue #42651
AWS secret injection so it can work in staging/prod

How did you test this code?

Unit tests

pytest posthog/temporal/tests/data_imports/test_ducklake_copy_data_imports_workflow.py

Manual test

1, create and enable feature flag ducklake-data-imports-copy-workflow
2, trigger a data import (i.e. Stripe), if you already have it, can simply go to temporal workflow and trigger an import workflow
3, upon completion of the import workflow, it should trigger a copy sub-workflow of type ducklake-copy.data-imports
4, this should create a new table in ducklake, which can be found and queried like this

duckdb -c "
  INSTALL ducklake;
  LOAD ducklake;
  SET s3_endpoint='localhost:19000';
  SET s3_use_ssl=false;
  SET s3_access_key_id='object_storage_root_user';
  SET s3_secret_access_key='object_storage_root_password';
  SET s3_url_style='path';

  ATTACH 'ducklake:postgres:dbname=ducklake_catalog host=localhost user=posthog password=posthog'
    AS ducklake (DATA_PATH 's3://ducklake-dev/');
  
  -- List all schemas
  SELECT * FROM information_schema.schemata;
  SELECT * FROM information_schema.schemata WHERE catalog_name = 'ducklake';
  
  -- List tables in your team's schema
  SELECT table_schema, table_name 
  FROM information_schema.tables 
  WHERE table_catalog = 'ducklake';
  
  -- Query a specific table
  SELECT * FROM ducklake.data_imports_team_2.stripe_price_019aebef LIMIT 10;
"

Documentation

posthog/ducklake/README.md

Changelog: (features only) Is this feature complete?

No

github-actions · 2025-12-05T19:29:46Z

Size Change: -9 B (0%)

Total Size: 3.51 MB

ℹ️ View Unchanged

Filename	Size	Change
`frontend/dist/toolbar.js`	3.51 MB	-9 B (0%)

_{compressed-size-action}

…by row count and partition check

…k because delta doesn't expose

greptile-apps

_{10 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

fuziontech

PR Review: DuckLake Copy for Data Imports

Summary

This PR adds a Temporal workflow to copy data imports (Stripe, Hubspot, etc.) into DuckLake after successful syncs. It follows the existing pattern from the data modeling workflow and includes feature flag gating, verification, and metrics.

Positive Aspects

Clean separation of concerns - Each activity has a single responsibility
Feature flag gating - Safe rollout with ducklake-data-imports-copy-workflow
Fire-and-forget pattern - Uses ParentClosePolicy.ABANDON so parent workflow succeeds even if copy fails
Comprehensive verification - Row counts, schema hashes, and partition count checks
Good test coverage - Tests covering serialization, feature flags, metadata prep, copy execution, and workflow integration
Observability - Metrics for workflow completion and verification results
Documentation - Thorough README with local testing instructions

Suggestions (Non-blocking)

1. Potential SQL Injection Defense in Depth (Medium)

ducklake_copy_data_imports_workflow.py:233-236:

conn.execute(f"CREATE SCHEMA IF NOT EXISTS {qualified_schema}")
conn.execute(
    f"CREATE OR REPLACE TABLE {qualified_table} AS SELECT * FROM delta_scan(?)",
    [inputs.model.source_table_uri],
)

While _sanitize_ducklake_identifier sanitizes the names, consider using DuckDB's identifier quoting for the schema/table names for defense in depth.

2. Missing Error Handling for Schema Not Found

ducklake_copy_data_imports_workflow.py:176-178 - If a schema is deleted between the parent workflow starting and this activity running, ExternalDataSchema.DoesNotExist will be raised. Could catch and log gracefully rather than failing the entire workflow.

3. Hardcoded Identifier Length Limit

ducklake_copy_data_imports_workflow.py:267 - The 63-character limit appears PostgreSQL-inspired but DuckDB identifiers can be longer. Consider documenting why 63.

4. Connection Could Use Context Manager

ducklake_copy_data_imports_workflow.py:217-240 - Using with duckdb.connect() as conn: would be slightly cleaner than try/finally.

5. Duplicate Schema Fetch

_run_data_imports_partition_verification and _run_data_imports_schema_verification both call _fetch_delta_schema. These could share the fetched schema to avoid redundant network calls.

6. Missing Type Annotation

ducklake_copy_data_imports_workflow.py:270 - The logger parameter is missing a type annotation.

Questions

Should the DuckLake copy workflow run on a dedicated task queue (mentioned as TODO)?
What's the expected table size for data imports? The 30-minute timeout seems generous.
Is there a plan to handle partial failures (e.g., 3 of 5 schemas succeed)?

Verdict: Approve ✅

The implementation is solid, follows existing patterns, and has good test coverage. The concerns raised are mostly defensive improvements rather than blocking issues. The fire-and-forget pattern correctly ensures data import success doesn't depend on DuckLake availability.

bill-ph · 2025-12-08T15:12:03Z

On the review points

1, SQL injection: sanitization keeps only alpha-numeric values, underscore(_) and a manually added dot(.), this should cover all bases.
2, we should fail in that scenario, and let an engineer investigate and fix the problem
3, we do have plan to use postgres as metadata server
4, good idea, changed per suggestion
5, minor duplication, will evaluate in the planned de-duplication refactor
6, just realized logger is not used, might be a remnant of debugging, removed the logger param.

bill-ph added 10 commits December 4, 2025 13:38

Phase 1: Scaffold types + gate

01e5396

Phase 2: Metadata preparation

cd14f04

fix tests

1a0840d

Phase 3: Copy activity

85452d4

Phase 4: Verification activity

4eba25c

Phase 5: Workflow shell + metrics

40e5fcc

Phase 6: Parent trigger

9ccfcd2

doc updates

9f2b0c6

consistent naming feature flag

4a11e65

Merge branch 'master' into data_import_ducklake_copy

d080385

bill-ph added 6 commits December 5, 2025 14:29

doc updates

3b618f6

fix feature flag inconsistency

6f579c8

do not fail parent import job on metadata failure, and some cleanups

142e17f

key cardinality check is high cost and most failures would be caught …

8cb36d1

…by row count and partition check

get column type from delta instead of django, remove nullability chec…

2442d3a

…k because delta doesn't expose

fix tests

760ca36

bill-ph marked this pull request as ready for review December 5, 2025 21:25

greptile-apps bot reviewed Dec 5, 2025

View reviewed changes

check fixes

e4d3e3e

bill-ph requested a review from a team December 5, 2025 21:47

fuziontech approved these changes Dec 6, 2025

View reviewed changes

bill-ph added 2 commits December 8, 2025 10:09

refactor: use context manager instead of try-finally

0aa9c01

refactor: remove unused logger parameter

4ffab58

bill-ph enabled auto-merge (squash) December 8, 2025 15:12

bill-ph added 2 commits December 8, 2025 16:05

Merge branch 'master' into data_import_ducklake_copy

7d37f2e

Merge branch 'master' into data_import_ducklake_copy

5568f21

bill-ph merged commit 46ab7a0 into master Dec 8, 2025
183 checks passed

bill-ph deleted the data_import_ducklake_copy branch December 8, 2025 21:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(ducklake): copy data_imports result to Ducklake in local dev #42862

feat(ducklake): copy data_imports result to Ducklake in local dev #42862

Uh oh!

bill-ph commented Dec 5, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Dec 5, 2025 •

edited

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

fuziontech left a comment

Uh oh!

bill-ph commented Dec 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat(ducklake): copy data_imports result to Ducklake in local dev #42862

feat(ducklake): copy data_imports result to Ducklake in local dev #42862

Uh oh!

Conversation

bill-ph commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Changes

How did you test this code?

Unit tests

Manual test

Documentation

Changelog: (features only) Is this feature complete?

Uh oh!

github-actions bot commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

fuziontech left a comment

Choose a reason for hiding this comment

PR Review: DuckLake Copy for Data Imports

Summary

Positive Aspects

Suggestions (Non-blocking)

1. Potential SQL Injection Defense in Depth (Medium)

2. Missing Error Handling for Schema Not Found

3. Hardcoded Identifier Length Limit

4. Connection Could Use Context Manager

5. Duplicate Schema Fetch

6. Missing Type Annotation

Questions

Uh oh!

bill-ph commented Dec 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bill-ph commented Dec 5, 2025 •

edited

Loading

github-actions bot commented Dec 5, 2025 •

edited

Loading