Edit 2019-01-12: fixed here: numpy/pull/11977
Nadiah ran into an apparent memory leak in SciPy’s hypergeom distribution.
Here is a minimal example. Running this code results in unbounded memory use, looking like a memory leak:
from scipy.stats import hypergeom
while True:
x = hypergeom(100, 30, 40).cdf(3)
It turns out that this isn’t really a memory leak but rather a problem with NumPy’s vectorize method which creates a circular reference in some situations. Here’s the GitHub issue that I opened: numpy/issues/11867.
In the mean time, a workaround is to manually delete the _ufunc
attribute after using cdf
:
from scipy.stats import hypergeom
while True:
h = hypergeom(100, 30, 40)
x = h.cdf(3)
del h.dist._cdfvec._ufunc
Alternatively, avoid the frozen distribution and call cdf
directly:
from scipy.stats import hypergeom
while True:
x = hypergeom.cdf(3, 100, 30, 40)
It’s worth mentioning that memory_profiler is a great tool for finding memory leaks:
from scipy.stats import hypergeom
from memory_profiler import profile
@profile
def main1():
for _ in range(1000):
x = hypergeom(100, 30, 40).cdf(3)
main1()
Output:
$ python3 geomprofile.py
Filename: geomprofile.py
Line # Mem usage Increment Line Contents
================================================
4 69.2 MiB 69.2 MiB @profile
5 def main1():
6 79.2 MiB 0.0 MiB for _ in range(1000):
7 79.2 MiB 10.0 MiB x = hypergeom(100, 30, 40).cdf(3)
We see that the hypergeom
line contributed to an increase in memory use of 10Mb.
Drilling down into NumPy’s vectorize
took a bit of manual debugging; I didn’t have as much luck with memory_profiler there.
In a production situation one might not have the luxury of finding the real cause of the memory leak immediately. In that case it might be enough to wrap the offending code in a call to multiprocessing so that the leaked memory is reclaimed frequently. A lightweight option is to use processify. See Liau Yung Siang’s blog post for more details.