parallel streaming: various improvements

fordN · fordN · commit 028d16ffe51c · 2025-10-22T14:46:46.000-07:00
- Configurable reorg buffer
- Create table ahead of spinning up parallel workers to ensure it's ready for all of them and avoid complexity of thread locking
- SQL variables for string replacement
- Better docs, including limitations
diff --git a/docs/parallel_streaming_usage.md b/docs/parallel_streaming_usage.md
@@ -233,7 +233,23 @@ except KeyboardInterrupt:
     print("\nStopped by user")
 ```
 
-**Note on Reorg Buffer**: When transitioning from parallel catchup to continuous streaming, the system automatically starts continuous streaming from `detected_max_block - 200`. This 200-block overlap ensures that any reorgs that occurred during the parallel catchup phase are detected and handled properly. With reorg detection enabled, duplicate blocks are automatically handled correctly.
+**Note on Reorg Buffer**: When transitioning from parallel catchup to continuous streaming, the system automatically starts continuous streaming from `detected_max_block - reorg_buffer` (default: 200 blocks). This overlap ensures that any reorgs that occurred during the parallel catchup phase are detected and handled properly. With reorg detection enabled, duplicate blocks are automatically handled correctly. The `reorg_buffer` can be customized via `ParallelConfig(reorg_buffer=N)`.
+
+## Limitations
+
+Currently, parallel streaming has the following limitations:
+
+1. **Block-based partitioning only**: Only supports partitioning by block number columns (`block_num` or `_block_num`). Tables without block numbers cannot use parallel execution.
+
+2. **Schema detection requires data**: Pre-flight schema detection requires at least 1 row in the source table. Empty tables will skip pre-flight creation and let workers handle it.
+
+3. **Static partitioning**: Partitions are created upfront based on the block range. The system does not support dynamic repartitioning during execution.
+
+4. **Thread-level parallelism**: Uses Python threads (ThreadPoolExecutor), not processes. For CPU-bound transformations, performance may be limited by the GIL.
+
+5. **Single table queries**: The partitioning strategy works best with queries against a single table. Complex joins or unions may require careful query structuring.
+
+6. **Reorg buffer configuration**: The `reorg_buffer` parameter (default: 200 blocks) is configurable but applies uniformly. Per-chain customization requires separate `ParallelConfig` instances.
 
 ## Performance Characteristics
 
@@ -301,16 +317,23 @@ Result: Zero data gaps, all reorgs caught ✓
 ─────────────────────────────────────────────────────────────────────
 ```
 
-**Why 200 blocks?**
+**Why 200 blocks (default)?**
 - Ethereum average reorg depth: 1-5 blocks
 - 200 blocks = ~40 minutes of history
 - Provides safety margin for deep reorgs that occurred during catchup
 - Small performance cost (200 blocks re-loaded) vs high data integrity value
 
 **Customizing the Buffer:**
-Currently hardcoded to 200 blocks. To modify, edit `parallel.py`:
+The reorg buffer is fully configurable via `ParallelConfig`:
 ```python
-reorg_buffer = 200  # Increase for networks with deeper reorgs
+parallel_config = ParallelConfig(
+    num_workers=4,
+    table_name='eth_firehose.blocks',
+    min_block=0,
+    max_block=None,  # Hybrid mode
+    reorg_buffer=500,  # Increase for networks with deeper reorgs (e.g., testnets)
+    block_column='block_num'
+)
 ```
 
 ### Custom Partition Filters
diff --git a/src/amp/client.py b/src/amp/client.py
@@ -9,7 +9,13 @@
 from .config.connection_manager import ConnectionManager
 from .loaders.registry import create_loader, get_available_loaders
 from .loaders.types import LoadConfig, LoadMode, LoadResult
-from .streaming import ParallelConfig, ParallelStreamExecutor, ReorgAwareStream, ResumeWatermark, StreamingResultIterator
+from .streaming import (
+    ParallelConfig,
+    ParallelStreamExecutor,
+    ReorgAwareStream,
+    ResumeWatermark,
+    StreamingResultIterator,
+)
 
 
 class QueryBuilder:
@@ -57,7 +63,7 @@ def load(
 
         # Validate that parallel_config is only used with stream=True
         if kwargs.get('parallel_config'):
-            raise ValueError("parallel_config requires stream=True")
+            raise ValueError('parallel_config requires stream=True')
 
         # Default to batch streaming (read_all=False) for memory efficiency
         kwargs.setdefault('read_all', False)
@@ -238,7 +244,7 @@ def _load_table(
                 table_name=table_name,
                 loader_type=loader,
                 success=False,
-                error=str(e)
+                error=str(e),
             )
 
     def _load_stream(
@@ -264,7 +270,7 @@ def _load_stream(
                 table_name=table_name,
                 loader_type=loader,
                 success=False,
-                error=str(e)
+                error=str(e),
             )
 
     def query_and_load_streaming(
@@ -389,4 +395,3 @@ def query_and_load_streaming(
                 error=str(e),
                 metadata={'streaming_error': True},
             )
-
diff --git a/src/amp/loaders/base.py b/src/amp/loaders/base.py
@@ -14,7 +14,6 @@
 from ..streaming.types import BlockRange, ResponseBatchWithReorg
 from .types import LoadMode, LoadResult
 
-
 # Type variable for configuration classes
 TConfig = TypeVar('TConfig')
 
diff --git a/src/amp/loaders/implementations/snowflake_loader.py b/src/amp/loaders/implementations/snowflake_loader.py
@@ -163,8 +163,12 @@ def _load_batch_impl(self, batch: pa.RecordBatch, table_name: str, **kwargs) ->
                 'Please use APPEND mode or manually truncate/drop the table before loading.'
             )
 
+        # Table creation is now handled by base class or pre-flight creation in parallel mode
+        # For pandas loading, we skip manual table creation and let write_pandas handle it
         if create_table and table_name.upper() not in self._created_tables:
-            self._create_table_from_schema(batch.schema, table_name)
+            # For pandas, skip table creation - write_pandas will handle it
+            if self.loading_method != 'pandas':
+                self._create_table_from_schema(batch.schema, table_name)
             self._created_tables.add(table_name.upper())
 
         if self.use_stage:
diff --git a/src/amp/streaming/parallel.py b/src/amp/streaming/parallel.py
@@ -1,12 +1,10 @@
 """
 Parallel streaming implementation for high-throughput data loading.
 
-This module implements parallel query execution using ThreadPoolExecutor. 
-It partitions streaming queries by block_num ranges using CTEs (Common Table Expressions) 
-that DataFusion inlines efficiently.
+This module implements parallel query execution using ThreadPoolExecutor.
+It partitions streaming queries by block_num ranges
 
 Key design decisions:
-- Uses CTEs to shadow table names with filtered versions for clean partitioning
 - Only supports streaming queries (not regular load operations)
 - Block range partitioning only (block_num or _block_num columns)
 """
@@ -24,6 +22,14 @@
 if TYPE_CHECKING:
     from ..client import Client
 
+# SQL keyword constants for query parsing
+_WHERE = ' WHERE '
+_ORDER_BY = ' ORDER BY '
+_LIMIT = ' LIMIT '
+_GROUP_BY = ' GROUP BY '
+_SETTINGS = ' SETTINGS '
+_STREAM_TRUE = 'STREAM = TRUE'
+
 
 @dataclass
 class QueryPartition:
@@ -56,6 +62,7 @@ class ParallelConfig:
     partition_size: Optional[int] = None  # Blocks per partition (auto-calculated if not set)
     block_column: str = 'block_num'  # Column name to partition on
     stop_on_error: bool = False  # Stop all workers on first error
+    reorg_buffer: int = 200  # Block overlap when transitioning to continuous streaming (for reorg detection)
 
     def __post_init__(self):
         if self.num_workers < 1:
@@ -175,24 +182,23 @@ def wrap_query_with_partition(self, user_query: str, partition: QueryPartition)
 
         # Create partition filter
         partition_filter = (
-            f"{partition.block_column} >= {partition.start_block} "
-            f"AND {partition.block_column} < {partition.end_block}"
+            f'{partition.block_column} >= {partition.start_block} AND {partition.block_column} < {partition.end_block}'
         )
 
         # Check if query already has a WHERE clause (case-insensitive)
         # Look for WHERE before any ORDER BY, LIMIT, or SETTINGS clauses
         query_upper = user_query.upper()
 
         # Find WHERE position
-        where_pos = query_upper.find(' WHERE ')
+        where_pos = query_upper.find(_WHERE)
 
         if where_pos != -1:
             # Query has WHERE clause - append with AND
             # Need to insert before ORDER BY, LIMIT, GROUP BY, or SETTINGS if they exist
-            insert_pos = where_pos + len(' WHERE ')
+            insert_pos = where_pos + len(_WHERE)
 
             # Find the end of the WHERE clause (before ORDER BY, LIMIT, GROUP BY, SETTINGS)
-            end_keywords = [' ORDER BY ', ' LIMIT ', ' GROUP BY ', ' SETTINGS ']
+            end_keywords = [_ORDER_BY, _LIMIT, _GROUP_BY, _SETTINGS]
             end_pos = len(user_query)
 
             for keyword in end_keywords:
@@ -201,14 +207,10 @@ def wrap_query_with_partition(self, user_query: str, partition: QueryPartition)
                     end_pos = keyword_pos
 
             # Insert partition filter with AND
-            partitioned_query = (
-                user_query[:end_pos] +
-                f" AND ({partition_filter})" +
-                user_query[end_pos:]
-            )
+            partitioned_query = user_query[:end_pos] + f' AND ({partition_filter})' + user_query[end_pos:]
         else:
             # No WHERE clause - add one before ORDER BY, LIMIT, GROUP BY, or SETTINGS
-            end_keywords = [' ORDER BY ', ' LIMIT ', ' GROUP BY ', ' SETTINGS ']
+            end_keywords = [_ORDER_BY, _LIMIT, _GROUP_BY, _SETTINGS]
             insert_pos = len(user_query)
 
             for keyword in end_keywords:
@@ -217,11 +219,7 @@ def wrap_query_with_partition(self, user_query: str, partition: QueryPartition)
                     insert_pos = keyword_pos
 
             # Insert WHERE clause with partition filter
-            partitioned_query = (
-                user_query[:insert_pos] +
-                f" WHERE {partition_filter}" +
-                user_query[insert_pos:]
-            )
+            partitioned_query = user_query[:insert_pos] + f' WHERE {partition_filter}' + user_query[insert_pos:]
 
         return partitioned_query
 
@@ -270,7 +268,7 @@ def _detect_current_max_block(self) -> int:
         Raises:
             RuntimeError: If query fails or returns no results
         """
-        query = f"SELECT MAX({self.config.block_column}) as max_block FROM {self.config.table_name}"
+        query = f'SELECT MAX({self.config.block_column}) as max_block FROM {self.config.table_name}'
         self.logger.info(f'Detecting current max block with query: {query}')
 
         try:
@@ -290,7 +288,7 @@ def _detect_current_max_block(self) -> int:
 
         except Exception as e:
             self.logger.error(f'Failed to detect max block: {e}')
-            raise RuntimeError(f'Failed to detect current max block from {self.config.table_name}: {e}')
+            raise RuntimeError(f'Failed to detect current max block from {self.config.table_name}: {e}') from e
 
     def execute_parallel_stream(
         self, user_query: str, destination: str, connection_name: str, load_config: Optional[Dict[str, Any]] = None
@@ -377,15 +375,83 @@ def execute_parallel_stream(
             f'Starting parallel streaming with {len(partitions)} partitions across {self.config.num_workers} workers'
         )
 
-        # 2. Submit worker tasks
+        # 2. Pre-flight table creation (before workers start)
+        # Create table once to avoid locking complexity in parallel workers
+        try:
+            # Get connection info
+            connection_info = self.client.connection_manager.get_connection_info(connection_name)
+            loader_config = connection_info['config']
+            loader_type = connection_info['loader']
+
+            # Get sample schema by executing LIMIT 1 on original query
+            # We don't need partition filtering for schema detection, just need any row
+            sample_query = user_query.strip().rstrip(';')
+
+            # Remove SETTINGS clause (especially stream = true) to avoid streaming mode
+            sample_query_upper = sample_query.upper()
+            settings_pos = sample_query_upper.find(_SETTINGS)
+            if settings_pos != -1:
+                sample_query = sample_query[:settings_pos].rstrip()
+                sample_query_upper = sample_query.upper()
+
+            # Insert LIMIT 1 before ORDER BY, GROUP BY if present
+            end_keywords = [_ORDER_BY, _GROUP_BY]
+            insert_pos = len(sample_query)
+
+            for keyword in end_keywords:
+                keyword_pos = sample_query_upper.find(keyword)
+                if keyword_pos != -1 and keyword_pos < insert_pos:
+                    insert_pos = keyword_pos
+
+            # Insert LIMIT 1 at the correct position
+            sample_query = sample_query[:insert_pos].rstrip() + ' LIMIT 1' + sample_query[insert_pos:]
+
+            self.logger.debug(f'Fetching schema with sample query: {sample_query[:100]}...')
+            sample_table = self.client.get_sql(sample_query, read_all=True)
+
+            if sample_table.num_rows > 0:
+                # Create loader instance to get effective schema and create table
+                from ..loaders.registry import create_loader
+
+                loader_instance = create_loader(loader_type, loader_config)
+
+                try:
+                    loader_instance.connect()
+
+                    # Get schema from sample batch
+                    sample_batch = sample_table.to_batches()[0]
+                    effective_schema = sample_batch.schema
+
+                    # Create table once with schema
+                    if hasattr(loader_instance, '_create_table_from_schema'):
+                        loader_instance._create_table_from_schema(effective_schema, destination)
+                        loader_instance._created_tables.add(destination)
+                        self.logger.info(f"Pre-created table '{destination}' with {len(effective_schema)} columns")
+                    else:
+                        self.logger.warning('Loader does not support table creation, workers will handle it')
+                finally:
+                    loader_instance.disconnect()
+            else:
+                self.logger.warning('Sample query returned no rows, skipping pre-flight table creation')
+
+            # Update load_config to skip table creation in workers
+            load_config['create_table'] = False
+
+        except Exception as e:
+            self.logger.warning(
+                f'Pre-flight table creation failed: {e}. Workers will attempt table creation with locking.'
+            )
+            # Don't fail the entire job - let workers try to create the table
+
+        # 3. Submit worker tasks
         futures = {}
         for partition in partitions:
             future = self.executor.submit(
                 self._execute_partition, user_query, partition, destination, connection_name, load_config
             )
             futures[future] = partition
 
-        # 3. Stream results as they complete
+        # 4. Stream results as they complete
         try:
             for future in as_completed(futures):
                 partition = futures[future]
@@ -417,17 +483,16 @@ def execute_parallel_stream(
             self.executor.shutdown(wait=True)
             self._log_final_stats()
 
-        # 4. If in hybrid mode, transition to continuous streaming for live blocks
+        # 5. If in hybrid mode, transition to continuous streaming for live blocks
         if continue_streaming:
             # Start continuous streaming with a buffer for reorg overlap
             # This ensures we catch any reorgs that occurred during parallel catchup
-            reorg_buffer = 200
-            continuous_start_block = max(self.config.min_block, detected_max_block - reorg_buffer)
+            continuous_start_block = max(self.config.min_block, detected_max_block - self.config.reorg_buffer)
 
             self.logger.info(
                 f'Parallel catch-up complete (loaded up to block {detected_max_block:,}). '
                 f'Transitioning to continuous streaming from block {continuous_start_block:,} '
-                f'(with {reorg_buffer}-block reorg buffer)...'
+                f'(with {self.config.reorg_buffer}-block reorg buffer)...'
             )
 
             # Ensure query has streaming settings
@@ -443,20 +508,16 @@ def execute_parallel_stream(
             # Add block filter to start from (detected_max - buffer) to catch potential reorgs
             # Check if query already has WHERE clause
             where_pos = streaming_query_upper.find(' WHERE ')
-            block_filter = f"{self.config.block_column} >= {continuous_start_block}"
+            block_filter = f'{self.config.block_column} >= {continuous_start_block}'
 
             if where_pos != -1:
                 # Has WHERE clause - append with AND
                 # Find position after WHERE keyword
                 insert_pos = where_pos + len(' WHERE ')
-                streaming_query = (
-                    streaming_query[:insert_pos] +
-                    f"({block_filter}) AND " +
-                    streaming_query[insert_pos:]
-                )
+                streaming_query = streaming_query[:insert_pos] + f'({block_filter}) AND ' + streaming_query[insert_pos:]
             else:
                 # No WHERE clause - add one before SETTINGS if present
-                streaming_query += f" WHERE {block_filter}"
+                streaming_query += f' WHERE {block_filter}'
 
             # Now add streaming settings for continuous mode
             streaming_query += ' SETTINGS stream = true'
@@ -521,7 +582,7 @@ def _execute_partition(
                 destination=destination,
                 connection_name=connection_name,
                 read_all=False,  # Stream batches for memory efficiency
-                **load_config
+                **load_config,
             )
 
             # Aggregate results from streaming iterator
@@ -543,7 +604,7 @@ def _execute_partition(
             self.logger.info(
                 f'Worker {partition.partition_id} completed: '
                 f'{total_rows:,} rows in {duration:.2f}s '
-                f'({batch_count} batches, {total_rows/duration:.0f} rows/sec)'
+                f'({batch_count} batches, {total_rows / duration:.0f} rows/sec)'
             )
 
             # Return aggregated result
@@ -603,4 +664,4 @@ def _log_final_stats(self):
                 f'avg throughput: {avg_throughput:,.0f} rows/sec per worker'
             )
         else:
-            self.logger.error(f'Parallel execution failed: all {self._stats.workers_failed} workers failed')
+            self.logger.error(f'Parallel execution failed: all {self._stats.workers_failed} workers failed')