Add configurable LRU+TTL caching for API server DAG retrieval #60804

kaxil · 2026-01-20T00:48:43Z

Fixes memory growth in long-running API servers by adding bounded LRU+TTL caching to DBDagBag. Previously, the cache was an unbounded dict that never expired, causing memory to grow indefinitely as DAG versions accumulated.

Problem

The API server's DBDagBag uses an internal dict to cache SerializedDAG objects (5-50 MB each). This cache:

Never expires - entries stay forever
Never evicts - grows with each new DAG version
Shared singleton - one instance for the entire API server lifetime

With 100+ DAGs updating daily, memory grows ~500 MB/day, eventually causing OOM.

Solution

Add optional LRU+TTL caching controlled by new [api] configuration:

Config	Default	Description
`dag_cache_size`	64	Max cached DAG versions (0 = disabled)
`dag_cache_ttl`	3600	TTL in seconds (0 = LRU only)

Key Design Decisions

API server only - Scheduler continues using simple dict (no caching overhead)
Cache thrashing prevention - iter_all_latest_version_dags() bypasses cache
Thread-safe - RLock protects cachetools operations in multi-threaded API server
Observability - Metrics for cache hits, misses, and clears

Configuration

[api]
# Size of LRU cache (0 to disable)
dag_cache_size = 64

# TTL in seconds (0 for LRU-only, no time expiry)
dag_cache_ttl = 3600

Metrics

Metric	Type	Description
`api_server.dag_bag.cache_hit`	Counter	Cache hits
`api_server.dag_bag.cache_miss`	Counter	Cache misses
`api_server.dag_bag.cache_clear`	Counter	Cache clears
`api_server.dag_bag.cache_size`	Gauge	Cache size

Backward Compatibility

Default behavior unchanged for scheduler
API server gets caching by default (can disable with dag_cache_size = 0)
No breaking changes to public APIs

Was generative AI tooling used to co-author this PR?

Yes (please specify the tool below)

Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
When adding dependency, check compliance with the ASF 3rd Party License Policy.
For significant user-facing changes create newsfragment: {pr_number}.significant.rst or {issue_number}.significant.rst, in airflow-core/newsfragments.

The API server's DBDagBag previously used an unbounded dict cache that never expired, causing memory growth in long-running processes. This adds configurable LRU+TTL caching controlled by [api] dag_cache_size and dag_cache_ttl settings. - Add cachetools dependency for LRU/TTL cache implementations - DBDagBag accepts optional cache_size and cache_ttl parameters - API server uses cached DBDagBag; scheduler unchanged (no caching) - Prevent cache thrashing in iter_all_latest_version_dags - Add metrics: dag_bag.cache_hit, dag_bag.cache_miss, dag_bag.cache_clear - Thread-safe cache access with RLock

jason810496

Nice improvement! LGTM overall.

With 100+ DAGs updating daily, memory grows ~500 MB/day, eventually causing OOM.

It’s surprising that the API server without cache eviction can cause system instability.

jason810496 · 2026-01-21T06:35:42Z

airflow-core/src/airflow/models/dagbag.py

+        self._dags: MutableMapping[str, SerializedDAG] = {}
+
+        # Initialize cache if cache_size is provided
+        if cache_size and cache_size > 0:


Not sure would it be better to use existed _disable_cache as condition?

Suggested change

if cache_size and cache_size > 0:

if not self._disable_cache:

jason810496 · 2026-01-21T06:42:54Z

airflow-core/src/airflow/models/dagbag.py

+                if self._use_cache:
+                    Stats.incr("api_server.dag_bag.cache_hit")
+                return dag
+            if self._use_cache:


It seems we could consolidate _disable_cache and _use_cache as same variable.

jason810496 · 2026-01-21T06:48:35Z

airflow-core/src/airflow/models/dagbag.py

+        if self._lock and not self._disable_cache:
+            with self._lock:
+                if dag := self._dags.get(version_id):
+                    return dag


Suggested change

if self._lock and not self._disable_cache:

with self._lock:

if dag := self._dags.get(version_id):

return dag

If I understand correctly, we have already handled the case where retrieve from the cache before fetching ‎dag_version.serialized_dag.

jason810496 · 2026-01-21T06:52:52Z

airflow-core/tests/unit/api_fastapi/common/test_dagbag.py

+
+class TestCreateDagBag:
+    """Tests for create_dag_bag() function."""
+


Although not necessary, we could consolidate these test methods using pytest.mark.parameterize with dag_cache_size, dag_cache_ttl, expected_class.

jason810496 · 2026-01-21T07:02:15Z

airflow-core/tests/unit/models/test_dagbag.py

+        assert result == mock_sdm.dag
+        assert "test_version" in dag_bag._dags
+
+    def test_read_dag_without_caching(self):


The behavior of this test case doesn't seem to match its naming for test_read_dag_caches_with_lock and test_read_dag_without_caching test cases. Regardless of whether the cache is enabled or not, self._dags[serdag.dag_version_id] = dag is executed in the _read_dag method.

potiuk · 2026-01-21T11:26:22Z

It’s surprising that the API server without cache eviction can cause system instability.

One question. In gunicorn in Airflow 2 we had a way simpler solution. Simply the uvicorn servers have restarted every few (tens?) of minutes or every N requests - effectively cleaning the cache and also getting rid of some other side effects (and for example reloading UI plugins). Since api-server (except the cache) is essentially stateless, that did not have almost any negative side effects - except some load caused on the startup time and database refreshing happening then, but that's not much different than the caching implemented here provides.

Additionally that approach was far more "resilient" to any kinds of accumulation-type bugs, yes it was hiding them as well, but the overall stability and resilience to any kind of mistakes made with memory usage, or side-effects of imports or global state sharung was eventually high-up.

This approach is named "software rejuvenation" https://ieeexplore.ieee.org/document/466961 - there are some studies and recommendations to use it as it is effectively way more resilient and in complex systems it allows to handle much wide range of issues.

Maybe we should explore that as well (or instead) - I am not sure if fast-api/starlette has similar concept, but in case of all kinds of stateless webserves, the technique of restarting them gracefully while load-balancing requests has a long proven history.

Should we possibly do it instead of caching LRU/TTL ? That seems way more robust if this is easy and supported by Fast API

kaxil · 2026-01-21T18:46:11Z

One question. In gunicorn in Airflow 2 we had a way simpler solution. Simply the uvicorn servers have restarted every few (tens?) of minutes or every N requests - effectively cleaning the cache and also getting rid of some other side effects (and for example reloading UI plugins). Since api-server (except the cache) is essentially stateless, that did not have almost any negative side effects - except some load caused on the startup time and database refreshing happening then, but that's not much different than the caching implemented here provides.

Good idea, worth trying that out too. Marking this as draft to playaround with it

kaxil · 2026-01-22T03:52:59Z

@potiuk Alternate approach is here in #60919 which uses pure uvicorn signals to increment/decrement workers.

Limitations:

workers=1 needs workaround (briefly scales to 2 during refresh)
uvicorn's SIGTTOU kills newest worker (LIFO) unlike gunicorn which kills oldest (FIFO), so we send SIGTERM directly to old PIDs instead

The LIFO thing is worth noting since it's a non-obvious difference between uvicorn and gunicorn that anyone else looking at this would run into.

kaxil · 2026-01-22T14:50:06Z

Another alternative using gunicorn is in #60940

kaxil · 2026-01-22T14:51:56Z

Worth now comparing them side-by-side -- and will let other review all 3 of them. Will check back next week

kaxil added the full tests needed We need to run full set of tests for this PR to merge label Jan 20, 2026

boring-cyborg bot added area:API Airflow's REST/HTTP API area:ConfigTemplates area:task-sdk labels Jan 20, 2026

kaxil requested a review from vatsrahul1001 January 20, 2026 00:56

kaxil force-pushed the add-api-server-dag-cache branch from 795bfb1 to 8a1f7fd Compare January 20, 2026 18:05

kaxil marked this pull request as ready for review January 20, 2026 18:05

kaxil requested review from XD-DENG, amoghrajesh, ashb, bugraoz93, choo121600, ephraimbuddy, jason810496, pierrejeambrun, potiuk, rawwar and shubhamraj-git as code owners January 20, 2026 18:05

kaxil added this to the Airflow 3.2.0 milestone Jan 20, 2026

jason810496 reviewed Jan 21, 2026

View reviewed changes

kaxil marked this pull request as draft January 21, 2026 18:46

kaxil mentioned this pull request Jan 22, 2026

Add UvicornMonitor for zero-downtime API server worker recycling #60919

Open

kaxil mentioned this pull request Jan 22, 2026

Add gunicorn support for API server with rolling worker restarts #60940

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add configurable LRU+TTL caching for API server DAG retrieval #60804

Add configurable LRU+TTL caching for API server DAG retrieval #60804

kaxil commented Jan 20, 2026 •

edited

Loading

Uh oh!

jason810496 left a comment

Uh oh!

jason810496 Jan 21, 2026

Uh oh!

jason810496 Jan 21, 2026

Uh oh!

jason810496 Jan 21, 2026

Uh oh!

jason810496 Jan 21, 2026

Uh oh!

jason810496 Jan 21, 2026

Uh oh!

potiuk commented Jan 21, 2026 •

edited

Loading

Uh oh!

kaxil commented Jan 21, 2026

Uh oh!

kaxil commented Jan 22, 2026

Uh oh!

kaxil commented Jan 22, 2026

Uh oh!

kaxil commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	if cache_size and cache_size > 0:
	if not self._disable_cache:


		class TestCreateDagBag:
		"""Tests for create_dag_bag() function."""

Add configurable LRU+TTL caching for API server DAG retrieval #60804

Are you sure you want to change the base?

Add configurable LRU+TTL caching for API server DAG retrieval #60804

Conversation

kaxil commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Key Design Decisions

Configuration

Metrics

Backward Compatibility

Was generative AI tooling used to co-author this PR?

Uh oh!

jason810496 left a comment

Choose a reason for hiding this comment

Uh oh!

jason810496 Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

jason810496 Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

jason810496 Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

jason810496 Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

jason810496 Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

potiuk commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kaxil commented Jan 21, 2026

Uh oh!

kaxil commented Jan 22, 2026

Uh oh!

kaxil commented Jan 22, 2026

Uh oh!

kaxil commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kaxil commented Jan 20, 2026 •

edited

Loading

potiuk commented Jan 21, 2026 •

edited

Loading