Skip to content

Conversation

@kaxil
Copy link
Member

@kaxil kaxil commented Jan 20, 2026

Fixes memory growth in long-running API servers by adding bounded LRU+TTL caching to DBDagBag. Previously, the cache was an unbounded dict that never expired, causing memory to grow indefinitely as DAG versions accumulated.

Problem

The API server's DBDagBag uses an internal dict to cache SerializedDAG objects (5-50 MB each). This cache:

  • Never expires - entries stay forever
  • Never evicts - grows with each new DAG version
  • Shared singleton - one instance for the entire API server lifetime

With 100+ DAGs updating daily, memory grows ~500 MB/day, eventually causing OOM.

Solution

Add optional LRU+TTL caching controlled by new [api] configuration:

Config Default Description
dag_cache_size 64 Max cached DAG versions (0 = disabled)
dag_cache_ttl 3600 TTL in seconds (0 = LRU only)

Key Design Decisions

  1. API server only - Scheduler continues using simple dict (no caching overhead)
  2. Cache thrashing prevention - iter_all_latest_version_dags() bypasses cache
  3. Thread-safe - RLock protects cachetools operations in multi-threaded API server
  4. Observability - Metrics for cache hits, misses, and clears

Configuration

[api]
# Size of LRU cache (0 to disable)
dag_cache_size = 64

# TTL in seconds (0 for LRU-only, no time expiry)
dag_cache_ttl = 3600

Metrics

Metric Type Description
api_server.dag_bag.cache_hit Counter Cache hits
api_server.dag_bag.cache_miss Counter Cache misses
api_server.dag_bag.cache_clear Counter Cache clears
api_server.dag_bag.cache_size Gauge Cache size

Backward Compatibility

  • Default behavior unchanged for scheduler
  • API server gets caching by default (can disable with dag_cache_size = 0)
  • No breaking changes to public APIs

Was generative AI tooling used to co-author this PR?
  • Yes (please specify the tool below)

  • Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
  • For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
  • When adding dependency, check compliance with the ASF 3rd Party License Policy.
  • For significant user-facing changes create newsfragment: {pr_number}.significant.rst or {issue_number}.significant.rst, in airflow-core/newsfragments.

@kaxil kaxil added the full tests needed We need to run full set of tests for this PR to merge label Jan 20, 2026
@kaxil kaxil requested a review from vatsrahul1001 January 20, 2026 00:56
The API server's DBDagBag previously used an unbounded dict cache that
never expired, causing memory growth in long-running processes. This adds
configurable LRU+TTL caching controlled by [api] dag_cache_size and
dag_cache_ttl settings.

- Add cachetools dependency for LRU/TTL cache implementations
- DBDagBag accepts optional cache_size and cache_ttl parameters
- API server uses cached DBDagBag; scheduler unchanged (no caching)
- Prevent cache thrashing in iter_all_latest_version_dags
- Add metrics: dag_bag.cache_hit, dag_bag.cache_miss, dag_bag.cache_clear
- Thread-safe cache access with RLock
@kaxil kaxil force-pushed the add-api-server-dag-cache branch from 795bfb1 to 8a1f7fd Compare January 20, 2026 18:05
@kaxil kaxil marked this pull request as ready for review January 20, 2026 18:05
@kaxil kaxil added this to the Airflow 3.2.0 milestone Jan 20, 2026
Copy link
Member

@jason810496 jason810496 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvement! LGTM overall.

With 100+ DAGs updating daily, memory grows ~500 MB/day, eventually causing OOM.

It’s surprising that the API server without cache eviction can cause system instability.

self._dags: MutableMapping[str, SerializedDAG] = {}

# Initialize cache if cache_size is provided
if cache_size and cache_size > 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure would it be better to use existed _disable_cache as condition?

Suggested change
if cache_size and cache_size > 0:
if not self._disable_cache:

if self._use_cache:
Stats.incr("api_server.dag_bag.cache_hit")
return dag
if self._use_cache:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we could consolidate _disable_cache and _use_cache as same variable.

Comment on lines +128 to +131
if self._lock and not self._disable_cache:
with self._lock:
if dag := self._dags.get(version_id):
return dag
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if self._lock and not self._disable_cache:
with self._lock:
if dag := self._dags.get(version_id):
return dag

If I understand correctly, we have already handled the case where retrieve from the cache before fetching ‎dag_version.serialized_dag.


class TestCreateDagBag:
"""Tests for create_dag_bag() function."""

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although not necessary, we could consolidate these test methods using pytest.mark.parameterize with dag_cache_size, dag_cache_ttl, expected_class.

assert result == mock_sdm.dag
assert "test_version" in dag_bag._dags

def test_read_dag_without_caching(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior of this test case doesn't seem to match its naming for test_read_dag_caches_with_lock and test_read_dag_without_caching test cases. Regardless of whether the cache is enabled or not, self._dags[serdag.dag_version_id] = dag is executed in the _read_dag method.

@potiuk
Copy link
Member

potiuk commented Jan 21, 2026

It’s surprising that the API server without cache eviction can cause system instability.

One question. In gunicorn in Airflow 2 we had a way simpler solution. Simply the uvicorn servers have restarted every few (tens?) of minutes or every N requests - effectively cleaning the cache and also getting rid of some other side effects (and for example reloading UI plugins). Since api-server (except the cache) is essentially stateless, that did not have almost any negative side effects - except some load caused on the startup time and database refreshing happening then, but that's not much different than the caching implemented here provides.

Additionally that approach was far more "resilient" to any kinds of accumulation-type bugs, yes it was hiding them as well, but the overall stability and resilience to any kind of mistakes made with memory usage, or side-effects of imports or global state sharung was eventually high-up.

This approach is named "software rejuvenation" https://ieeexplore.ieee.org/document/466961 - there are some studies and recommendations to use it as it is effectively way more resilient and in complex systems it allows to handle much wide range of issues.

Maybe we should explore that as well (or instead) - I am not sure if fast-api/starlette has similar concept, but in case of all kinds of stateless webserves, the technique of restarting them gracefully while load-balancing requests has a long proven history.

Should we possibly do it instead of caching LRU/TTL ? That seems way more robust if this is easy and supported by Fast API

@kaxil
Copy link
Member Author

kaxil commented Jan 21, 2026

One question. In gunicorn in Airflow 2 we had a way simpler solution. Simply the uvicorn servers have restarted every few (tens?) of minutes or every N requests - effectively cleaning the cache and also getting rid of some other side effects (and for example reloading UI plugins). Since api-server (except the cache) is essentially stateless, that did not have almost any negative side effects - except some load caused on the startup time and database refreshing happening then, but that's not much different than the caching implemented here provides.

Good idea, worth trying that out too. Marking this as draft to playaround with it

@kaxil
Copy link
Member Author

kaxil commented Jan 22, 2026

@potiuk Alternate approach is here in #60919 which uses pure uvicorn signals to increment/decrement workers.

Limitations:

  • workers=1 needs workaround (briefly scales to 2 during refresh)
  • uvicorn's SIGTTOU kills newest worker (LIFO) unlike gunicorn which kills oldest (FIFO), so we send SIGTERM directly to old PIDs instead

The LIFO thing is worth noting since it's a non-obvious difference between uvicorn and gunicorn that anyone else looking at this would run into.

@kaxil
Copy link
Member Author

kaxil commented Jan 22, 2026

Another alternative using gunicorn is in #60940

@kaxil
Copy link
Member Author

kaxil commented Jan 22, 2026

Worth now comparing them side-by-side -- and will let other review all 3 of them. Will check back next week

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:API Airflow's REST/HTTP API area:ConfigTemplates area:task-sdk full tests needed We need to run full set of tests for this PR to merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants