Here is a minimal example. Running this code results in unbounded memory use, looking like a memory leak:
from scipy.stats import hypergeom while True: x = hypergeom(100, 30, 40).cdf(3)
It turns out that this isn’t really a memory leak but rather a problem with NumPy’s vectorize method which creates a circular reference in some situations. Here’s the GitHub issue that I opened: numpy/issues/11867.
In the mean time, a workaround is to manually delete the
_ufunc attribute after using
from scipy.stats import hypergeom while True: h = hypergeom(100, 30, 40) x = h.cdf(3) del h.dist._cdfvec._ufunc
Alternatively, avoid the frozen distribution and call
from scipy.stats import hypergeom while True: x = hypergeom.cdf(3, 100, 30, 40)
It’s worth mentioning that memory_profiler is a great tool for finding memory leaks:
from scipy.stats import hypergeom from memory_profiler import profile @profile def main1(): for _ in range(1000): x = hypergeom(100, 30, 40).cdf(3) main1()
$ python3 geomprofile.py Filename: geomprofile.py Line # Mem usage Increment Line Contents ================================================ 4 69.2 MiB 69.2 MiB @profile 5 def main1(): 6 79.2 MiB 0.0 MiB for _ in range(1000): 7 79.2 MiB 10.0 MiB x = hypergeom(100, 30, 40).cdf(3)
We see that the
hypergeom line contributed to an increase in memory use of 10Mb.
Drilling down into NumPy’s
vectorize took a bit of manual debugging; I didn’t have as much luck with memory_profiler there.
In a production situation one might not have the luxury of finding the real cause of the memory leak immediately. In that case it might be enough to wrap the offending code in a call to multiprocessing so that the leaked memory is reclaimed frequently. A lightweight option is to use processify. See Liau Yung Siang’s blog post for more details.