Skip to content

Conversation

@shahab96
Copy link
Collaborator

@shahab96 shahab96 commented Dec 7, 2025

Summary

Implements comprehensive StatefulSet reconciliation improvements with intelligent diff detection, immutable field validation, and real-time status monitoring. This PR addresses issue #43 by adding the infrastructure needed to properly track and manage StatefulSet rollouts.

Changes

Phase 1: StatefulSet Diff Detection and Validation

  • ✅ Added statefulset_needs_update() method to detect semantic changes across 15+ fields
    • Replicas, pod management policy, image, environment variables
    • Labels, annotations, scheduling (affinity, tolerations, node selectors)
    • Resources, topology spread constraints, priority class, lifecycle hooks
  • ✅ Added validate_statefulset_update() to prevent invalid updates to immutable fields
    • Validates selector, serviceName, and volumeClaimTemplates
    • Returns clear error messages for violations
  • ✅ Refactored reconciliation loop to check for updates before applying
  • ✅ Added 9 comprehensive unit tests
  • ✅ Updated error_policy() with intelligent retry intervals based on error type

Phase 2: Status Types and Helpers

  • ✅ Extended Status struct with:
    • observed_generation - tracks which spec generation was observed
    • conditions - Kubernetes standard conditions (Ready, Progressing, Degraded)
  • ✅ Extended Pool status with replica tracking:
    • replicas, ready_replicas, current_replicas, updated_replicas
    • current_revision, update_revision - StatefulSet controller revisions
    • last_update_time - timestamp of last status update
  • ✅ Added new PoolState variants: Updating, RolloutComplete, RolloutFailed, Degraded
  • ✅ Added StatefulSet status helper methods to Context

Phase 3: Status Updates and Rollout Monitoring

  • ✅ Added build_pool_status() helper to extract status from StatefulSets
  • ✅ Collect pool statuses during reconciliation loop
  • ✅ Aggregate pool statuses into Tenant-level conditions
  • ✅ Update Tenant status with comprehensive rollout information
  • ✅ Smart requeuing: 10-second interval during active rollouts for responsive updates
  • ✅ Refactored Context.update_status() to use JSON merge patch (proper kube-rs API)

Bug Fixes

  • ✅ Fixed RUST_LOG environment variable support (added env-filter feature)
  • ✅ Added validation to block pool name changes (prevents orphaned StatefulSets)
  • ✅ Fixed status update API to use patch_status with JSON merge patch
  • ✅ Removed non-existent Health column from kubectl output
  • ✅ Updated CRD with correct schema for all status fields

Dependencies

  • Added chrono for RFC3339 timestamp generation

Status Output

The operator now provides comprehensive status information:

status:
  currentState: Ready
  availableReplicas: 2
  observedGeneration: 1
  conditions:
  - type: Ready
    status: "True"
    reason: AllPodsReady
    message: "2/2 pods ready"
    lastTransitionTime: "2025-12-04T13:55:38Z"
    observedGeneration: 1
  pools:
  - ssName: example-tenant-primary
    state: RolloutComplete
    replicas: 2
    readyReplicas: 2
    currentReplicas: 2
    updatedReplicas: 2
    currentRevision: example-tenant-primary-65c455969b
    updateRevision: example-tenant-primary-65c455969b
    lastUpdateTime: "2025-12-04T13:55:38Z"

kubectl output:

NAME             STATE   AGE
example-tenant   Ready   15m

Test Results

  • ✅ All 35 unit tests passing
  • ✅ Locally tested with live Kubernetes cluster
  • ✅ Verified status updates work correctly
  • ✅ Verified conditions are populated properly
  • ✅ Verified rollout monitoring with real StatefulSets

Test Plan

  1. Apply the updated CRD: kubectl apply -f deploy/rustfs-operator/crds/tenant.yaml
  2. Deploy a test Tenant resource
  3. Verify status is populated: kubectl get tenant <name> -o yaml
  4. Update the Tenant (e.g., change image or replicas)
  5. Watch the status transition through states: kubectl get tenant -w
  6. Verify rollout monitoring with 10-second requeue during updates
  7. Check conditions reflect proper state: kubectl describe tenant <name>

Related Issues

Closes #43

Breaking Changes

None - all changes are additive and backward compatible.

Rollout Strategy

  1. Apply updated CRD (includes new status fields)
  2. Deploy updated operator
  3. Existing Tenants will have status populated on next reconciliation

Implement intelligent StatefulSet update detection and immutable field
validation to improve reconciliation efficiency and safety.

Changes:
- Add statefulset_needs_update() method for semantic diff detection
- Add validate_statefulset_update() method for immutable field checks
- Refactor reconciliation loop to check/validate before updating
- Add new error types: InternalError, ImmutableFieldModified, SerdeJson
- Extend error policy with 60s requeue for immutable field errors
- Add 9 comprehensive unit tests (35 tests total, all passing)
- Update CHANGELOG.md with detailed changes

Benefits:
- Reduces unnecessary API calls and reconciliation overhead
- Prevents invalid updates that would cause API rejections
- Provides clear error messages for users
- Foundation for rollout monitoring in Phase 2

Related: rustfs#43
Extend status structures and add StatefulSet status helper methods
for rollout monitoring support.

Changes:
- Add Condition struct for Kubernetes standard conditions (Ready, Progressing, Degraded)
- Extend Status struct with observed_generation and conditions fields
- Extend Pool status with replica tracking and revision fields
- Add new PoolState variants: Updating, RolloutComplete, RolloutFailed, Degraded
- Add context methods:
  - get_statefulset_status() - Fetch StatefulSet status
  - is_rollout_complete() - Check if rollout is complete
  - get_statefulset_revisions() - Get current and update revisions

Benefits:
- Foundation for comprehensive rollout monitoring
- Kubernetes-standard status conditions
- Per-pool rollout status tracking
- All existing tests continue to pass

Related: rustfs#43
The tracing subscriber was not respecting the RUST_LOG environment
variable, making it impossible to see debug logs from the operator.

Changes:
- Enable 'env-filter' feature for tracing-subscriber in Cargo.toml
- Add .with_env_filter() to tracing subscriber initialization
- Allows operators to see debug logs with: RUST_LOG=operator=debug

This fixes the "no logs" issue reported when running the operator.

Testing:
- Verified operator runs successfully with RUST_LOG=info
- Verified debug logs appear with RUST_LOG=operator=debug
- Confirmed diff detection and validation logs are visible

Related: rustfs#43
)

Changing a pool name creates a new StatefulSet but leaves the old one
orphaned. This is invalid because pool names are part of the StatefulSet
selector (immutable field).

Changes:
- Add validation before StatefulSet reconciliation loop
- List all StatefulSets owned by the Tenant
- Check for StatefulSets whose pool names don't match current spec
- Return ImmutableFieldModified error with clear guidance

Error message guides users to delete and recreate Tenant if rename needed.

Testing:
- All 35 tests passing
- Operator detects orphaned StatefulSets and returns error
- 60-second requeue for user to fix

Related: rustfs#43
Implement comprehensive status updates and rollout monitoring to track
StatefulSet reconciliation progress in real-time.

Changes:
- Add build_pool_status() helper method to Tenant to extract status from StatefulSets
- Collect pool statuses during StatefulSet reconciliation loop
- Aggregate pool statuses into overall Tenant conditions (Ready, Progressing, Degraded)
- Update Tenant status with replica counts, pool states, and conditions
- Requeue faster (10s) when pools are updating for responsive monitoring
- Refactor Context.update_status() to accept full Status struct
- Add chrono dependency for timestamp generation

Status Updates:
- Pool status includes: replicas, ready_replicas, current_replicas, updated_replicas,
  current_revision, update_revision, last_update_time, and state
- Pool states: NotCreated, Initialized, Updating, RolloutComplete, RolloutFailed, Degraded
- Tenant conditions: Ready (True/False), Progressing (True during rollout), Degraded (True when degraded)
- Tenant overall state: Ready, NotReady, Updating, Degraded

Rollout Monitoring:
- Tracks individual pool rollout status based on StatefulSet replicas and revisions
- Sets Progressing condition during rollouts
- Sets Ready condition when all replicas are ready and updated
- Sets Degraded condition when pools are unhealthy
- Requeues every 10 seconds during active rollouts for responsive updates

This completes Phase 3 of issue rustfs#43. Status information is now properly populated
and visible in kubectl/k9s.
The replace_status API requires a complete Kubernetes object with
apiVersion, kind, metadata, spec, and status - not just the status field.

Changes:
- Clone the Tenant resource and set its status field
- Serialize the complete Tenant object
- Fixes 'Object Kind is missing' error during status updates

This fixes the status update errors seen in the operator logs.
The Health column was referencing a non-existent .status.healthStatus field.
Removed it since STATE column already shows the current state.
…s#43)

Changed from replace_status with raw bytes to patch_status with JSON
merge patch, which is the proper kube-rs API for updating status
subresources.

This fixes the 'Object Kind is missing' error.
@shahab96 shahab96 changed the title Statefulset reconciliation feat: StatefulSet reconciliation improvements and status monitoring (#43) Dec 7, 2025
- Use chained if-let conditions to collapse nested if statements
- Use as_deref() instead of as_ref().map(|s| s.as_str())
- Allow unwrap/expect in test modules (acceptable in tests)

All clippy checks now pass with -D warnings.
@shahab96 shahab96 marked this pull request as ready for review December 7, 2025 07:49
@shahab96 shahab96 requested a review from bestgopher as a code owner December 7, 2025 07:49
Replace unwrap() and expect() calls with explicit pattern matching
in RBAC test code to satisfy clippy lints without using allow
annotations. All test behavior is preserved with explicit panic
messages on None values.
@bestgopher bestgopher added this pull request to the merge queue Dec 9, 2025
Merged via the queue into rustfs:main with commit a22fcc0 Dec 9, 2025
2 checks passed
@shahab96 shahab96 deleted the feat/statefulset-reconciliation-#43 branch December 12, 2025 06:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve StatefulSet reconciliation and update handling

2 participants