Last week I had to fix a memory leak in a Python program for the first time. A long running process started eating too much RAM (only ~20GB to much) and the friendly OOM Killer had to step in and terminate this. Since this kept happening, I had to go ahead and fix the issue.
As with every bug, before you can reliably fix it, you must reproduce it.
Now, while I had a reliable reproduction (after all, the process had regular dates with the OOM Killer), 3 days isn’t the best cycle time when you wanna solve a bug. So into the code we go.
The main idea is to start with the main loop, and try to narrow down the code that is must run for the leak to manifest. The process involves some educated guesses (where are the likely memory and allocation hogs in your process? What parts are likely to leak? Do you have any code that requires cleanup?), waiting, frustration, and tools.
While each developer and codebase have their own unique guesses and frustrations, good tooling applies more widely. For this part, I used Python’s tracemalloc module.
Among other things,
tracemalloc allows tracking memory usage between 2 points in your code in a very low-overhead manner.
After running this code,
peak will hold the peak-memory-usage during the trace period, and
current will hold the difference from the start of the trace to the current state. You should expect
current to be non-zero. But if it goes too high - your code is probably leaking.
By placing such traces around suspect pieces of our code, we can find which parts are leaking. Just remember - only do this with functions that are expected to retain no state. If a function mutates an external object, or is a member function, it is very to exhibit changes in memory usage.
Once we have a reproduction (that hopefully takes a relatively short amount of time), we want to find the leaking code. We can try and keep narrowing our measured code down until we find the relevant line, but the deeper we go, the harder it is to separate the leak from normal execution.
So at this point, we’d like to look into the allocated memory, and see which objects are there when they shouldn’t be.
For inspecting the objects in a Python process, I recommend using
Pympler is a development tool to measure, monitor and analyze the memory behavior of Python objects in a running Python application.
We’re going to use it to do 2 things.
First, we’re going to use
pympler to show us which objects were allocated during our repro & are still allocated.
Once we run this, we get a nice table showing us a summary of objects created and destroyed:
As you can see - there are quite a few primitive objects generated, and also some
__main__.Value objects. In my experience, primitives are harder to track, as they lack meaning in the code. Your own types, however, are usually only used in certain parts of the codebase, making them easier to make sense of.
Now that we see that we have 10 new
Value objects, it is time to figure out who’s holding them in memory.
This’ll print the following:
Giving away the issue - the
lru_cache is keeping our
Value objects. Just as designed…
I know this looks like a bit of a contrived example, but the
lru_cache keeping objects in memory was exactly the issue I had. It was just buried under far more code.
Currently, I use the ugliest solution I can imagine - functions decorated with
lru_cache have a
cache_clear() method, and I’m calling that at specific places in my code. It’s ugly, but it works.
A cleaner solution would require dedicated caches & better cleanup mechanisms. You can read a relevant discussion here.