Skip to content

Conversation

@mlim19
Copy link
Contributor

@mlim19 mlim19 commented Oct 3, 2025

It seems the perf has known issue regarding memory utilization when perf runs continuously. As it's observed on idle system where there are not many processes running and output file sizes are not huge, it looks the perf has to be restarted if its internal memory usage is beyond a threshold.
The solution I implement here is to restart the perf collection when the memory utilization growth is more than 100MB (which can be changed depending on use cases, we can also make it as command line argument), compared the first memory RSS size.

The issue is different from what were reported at #990 which is related to perf output handling.

Description

Related Issue

Motivation and Context

How Has This Been Tested?

Screenshots

Checklist:

  • I have read the CONTRIBUTING document.
  • I have updated the relevant documentation.
  • I have added tests for new logic.

@dkorlovs dkorlovs force-pushed the fix_perf_slow_memory_util_growth branch 3 times, most recently from 1c26438 to 7f7610d Compare October 9, 2025 18:01
should_restart_time_based = (
time_elapsed >= self._RESTART_AFTER_S and current_rss >= self._PERF_MEMORY_USAGE_THRESHOLD
)
should_restart_growth_based = memory_growth > self._RSS_GROWTH_THRESHOLD
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, looks good
can we get some metrics behind this, what the average growth rate of perf from initialization to post profiling for busy vs non-busy system

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The system I tested is an AWS instance with small number of cpus. Therefore, our tests do not represent your use case. On idle system I do see around 50MB as baseline RSS and 1~2MB growth every duration. I tried to run more cpu intensive workloads but it seems not increase the rss much because the system configuration has limited memory and cpus. So, it would be good if you can evaluate this change on your side with real use cases. Can you try that?

Copy link
Contributor

@prashantbytesyntax prashantbytesyntax Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this would need testing in environments where there are lot of processes ( over 1k-1.5k) running. I would suggest we can start testing the root cause fix here #1002 and then test this fix.
@ashokbytebytego ^ ^

# we use double for dwarf.
_MMAP_SIZES = {"fp": 129, "dwarf": 257}
_RSS_GROWTH_THRESHOLD = 100 * 1024 * 1024 # 100MB in bytes
_BASELINE_COLLECTION_COUNT = 3 # Number of function calls to collect RSS before setting baseline
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, i would suggest having some data backed to configure this collection count baseline

@mlim19
Copy link
Contributor Author

mlim19 commented Oct 9, 2025

Here is the screenshot how memory utilization drops when perf got restarted:
image

@mlim19 mlim19 force-pushed the fix_perf_slow_memory_util_growth branch from 7f7610d to 495eaae Compare October 23, 2025 17:45
@mlim19
Copy link
Contributor Author

mlim19 commented Nov 6, 2025

This PR merge is on hold because Prashant mentioned that #1002 resolves most issues related memory usage. We will make a decision whether to merge it or not sometime after merging #1002

@mlim19 mlim19 marked this pull request as draft November 8, 2025 22:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants