Last week I had to fix a memory leak in a Python program for the first time. A long running process started eating too much RAM (only ~20GB to much) and the friendly OOM Killer had to step in and terminate this. Since this kept happening, I had to go ahead and fix the issue.
Step 1 - Reproduction
As with every bug, before you can reliably fix it, you must reproduce it.
Now, while I had a reliable reproduction (after all, the process had regular dates with the OOM Killer), 3 days isn’t the best cycle time when you wanna solve a bug. So into the code we go.
The main idea is to start with the main loop, and try to narrow down the code that is must run for the leak to manifest. The process involves some educated guesses (where are the likely memory and allocation hogs in your process? What parts are likely to leak? Do you have any code that requires cleanup?), waiting, frustration, and tools.
tracemalloc
While each developer and codebase have their own unique guesses and frustrations, good tooling applies more widely. For this part, I used Python’s tracemalloc module.
Among other things, tracemalloc
allows tracking memory usage between 2 points in your code in a very low-overhead manner.
|
|
After running this code, peak
will hold the peak-memory-usage during the trace period, and current
will hold the difference from the start of the trace to the current state. You should expect current
to be non-zero. But if it goes too high - your code is probably leaking.
By placing such traces around suspect pieces of our code, we can find which parts are leaking. Just remember - only do this with functions that are expected to retain no state. If a function mutates an external object, or is a member function, it is very to exhibit changes in memory usage.
Step 2 - Triage
Once we have a reproduction (that hopefully takes a relatively short amount of time), we want to find the leaking code. We can try and keep narrowing our measured code down until we find the relevant line, but the deeper we go, the harder it is to separate the leak from normal execution.
So at this point, we’d like to look into the allocated memory, and see which objects are there when they shouldn’t be.
pympler
For inspecting the objects in a Python process, I recommend using pympler
.
Pympler is a development tool to measure, monitor and analyze the memory behavior of Python objects in a running Python application.
We’re going to use it to do 2 things.
Inspecting Allocated Objects
First, we’re going to use pympler
to show us which objects were allocated during our repro & are still allocated.
|
|
Once we run this, we get a nice table showing us a summary of objects created and destroyed:
|
|
As you can see - there are quite a few primitive objects generated, and also some __main__.Value
objects. In my experience, primitives are harder to track, as they lack meaning in the code. Your own types, however, are usually only used in certain parts of the codebase, making them easier to make sense of.
Now that we see that we have 10 new Value
objects, it is time to figure out who’s holding them in memory.
|
|
This’ll print the following:
|
|
Giving away the issue - the lru_cache
is keeping our Value
objects. Just as designed…
I know this looks like a bit of a contrived example, but the lru_cache
keeping objects in memory was exactly the issue I had. It was just buried under far more code.
Step 3 - Solution
Currently, I use the ugliest solution I can imagine - functions decorated with lru_cache
have a cache_clear()
method, and I’m calling that at specific places in my code. It’s ugly, but it works.
A cleaner solution would require dedicated caches & better cleanup mechanisms. You can read a relevant discussion here.