Skip to content

Conversation

@bill-ph
Copy link
Contributor

@bill-ph bill-ph commented Dec 5, 2025

Problem

Part of PostHog Data Warehouse work stream.

Changes

Follows same pattern as the data modeling prototype #42005 #42541 #42231

What's included:

  • feature flag gating
  • copy sub-workflow
  • verification

What's yet to be done:

How did you test this code?

Unit tests

pytest posthog/temporal/tests/data_imports/test_ducklake_copy_data_imports_workflow.py

Manual test

1, create and enable feature flag ducklake-data-imports-copy-workflow
2, trigger a data import (i.e. Stripe), if you already have it, can simply go to temporal workflow and trigger an import workflow
3, upon completion of the import workflow, it should trigger a copy sub-workflow of type ducklake-copy.data-imports
4, this should create a new table in ducklake, which can be found and queried like this

duckdb -c "
  INSTALL ducklake;
  LOAD ducklake;
  SET s3_endpoint='localhost:19000';
  SET s3_use_ssl=false;
  SET s3_access_key_id='object_storage_root_user';
  SET s3_secret_access_key='object_storage_root_password';
  SET s3_url_style='path';

  ATTACH 'ducklake:postgres:dbname=ducklake_catalog host=localhost user=posthog password=posthog'
    AS ducklake (DATA_PATH 's3://ducklake-dev/');
  
  -- List all schemas
  SELECT * FROM information_schema.schemata;
  SELECT * FROM information_schema.schemata WHERE catalog_name = 'ducklake';
  
  -- List tables in your team's schema
  SELECT table_schema, table_name 
  FROM information_schema.tables 
  WHERE table_catalog = 'ducklake';
  
  -- Query a specific table
  SELECT * FROM ducklake.data_imports_team_2.stripe_price_019aebef LIMIT 10;
"

Documentation

posthog/ducklake/README.md

Changelog: (features only) Is this feature complete?

No

@github-actions
Copy link
Contributor

github-actions bot commented Dec 5, 2025

Size Change: -9 B (0%)

Total Size: 3.51 MB

ℹ️ View Unchanged
Filename Size Change
frontend/dist/toolbar.js 3.51 MB -9 B (0%)

compressed-size-action

@bill-ph bill-ph marked this pull request as ready for review December 5, 2025 21:25
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@bill-ph bill-ph requested a review from a team December 5, 2025 21:47
Copy link
Member

@fuziontech fuziontech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: DuckLake Copy for Data Imports

Summary

This PR adds a Temporal workflow to copy data imports (Stripe, Hubspot, etc.) into DuckLake after successful syncs. It follows the existing pattern from the data modeling workflow and includes feature flag gating, verification, and metrics.

Positive Aspects

  1. Clean separation of concerns - Each activity has a single responsibility
  2. Feature flag gating - Safe rollout with ducklake-data-imports-copy-workflow
  3. Fire-and-forget pattern - Uses ParentClosePolicy.ABANDON so parent workflow succeeds even if copy fails
  4. Comprehensive verification - Row counts, schema hashes, and partition count checks
  5. Good test coverage - Tests covering serialization, feature flags, metadata prep, copy execution, and workflow integration
  6. Observability - Metrics for workflow completion and verification results
  7. Documentation - Thorough README with local testing instructions

Suggestions (Non-blocking)

1. Potential SQL Injection Defense in Depth (Medium)

ducklake_copy_data_imports_workflow.py:233-236:

conn.execute(f"CREATE SCHEMA IF NOT EXISTS {qualified_schema}")
conn.execute(
    f"CREATE OR REPLACE TABLE {qualified_table} AS SELECT * FROM delta_scan(?)",
    [inputs.model.source_table_uri],
)

While _sanitize_ducklake_identifier sanitizes the names, consider using DuckDB's identifier quoting for the schema/table names for defense in depth.

2. Missing Error Handling for Schema Not Found

ducklake_copy_data_imports_workflow.py:176-178 - If a schema is deleted between the parent workflow starting and this activity running, ExternalDataSchema.DoesNotExist will be raised. Could catch and log gracefully rather than failing the entire workflow.

3. Hardcoded Identifier Length Limit

ducklake_copy_data_imports_workflow.py:267 - The 63-character limit appears PostgreSQL-inspired but DuckDB identifiers can be longer. Consider documenting why 63.

4. Connection Could Use Context Manager

ducklake_copy_data_imports_workflow.py:217-240 - Using with duckdb.connect() as conn: would be slightly cleaner than try/finally.

5. Duplicate Schema Fetch

_run_data_imports_partition_verification and _run_data_imports_schema_verification both call _fetch_delta_schema. These could share the fetched schema to avoid redundant network calls.

6. Missing Type Annotation

ducklake_copy_data_imports_workflow.py:270 - The logger parameter is missing a type annotation.


Questions

  1. Should the DuckLake copy workflow run on a dedicated task queue (mentioned as TODO)?
  2. What's the expected table size for data imports? The 30-minute timeout seems generous.
  3. Is there a plan to handle partial failures (e.g., 3 of 5 schemas succeed)?

Verdict: Approve

The implementation is solid, follows existing patterns, and has good test coverage. The concerns raised are mostly defensive improvements rather than blocking issues. The fire-and-forget pattern correctly ensures data import success doesn't depend on DuckLake availability.

@bill-ph
Copy link
Contributor Author

bill-ph commented Dec 8, 2025

On the review points

1, SQL injection: sanitization keeps only alpha-numeric values, underscore(_) and a manually added dot(.), this should cover all bases.
2, we should fail in that scenario, and let an engineer investigate and fix the problem
3, we do have plan to use postgres as metadata server
4, good idea, changed per suggestion
5, minor duplication, will evaluate in the planned de-duplication refactor
6, just realized logger is not used, might be a remnant of debugging, removed the logger param.

@bill-ph bill-ph enabled auto-merge (squash) December 8, 2025 15:12
@bill-ph bill-ph merged commit 46ab7a0 into master Dec 8, 2025
183 checks passed
@bill-ph bill-ph deleted the data_import_ducklake_copy branch December 8, 2025 21:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants