Unbounded memory use with SciPy’s hypergeom

Nadiah ran into an apparent memory leak in SciPy’s hypergeom distribution.

Here is a minimal example. Running this code results in unbounded memory use, looking like a memory leak:

from scipy.stats import hypergeom

while True:
    x = hypergeom(100, 30, 40).cdf(3)

It turns out that this isn’t really a memory leak but rather a problem with NumPy’s vectorize method which creates a circular reference in some situations. Here’s the GitHub issue that I opened: numpy/issues/11867.

In the mean time, a workaround is to manually delete the _ufunc attribute after using cdf:

from scipy.stats import hypergeom

while True:
    h = hypergeom(100, 30, 40)
    x = h.cdf(3)
    del h.dist._cdfvec._ufunc

Alternatively, avoid the frozen distribution and call cdf directly:

from scipy.stats import hypergeom

while True:
    x = hypergeom.cdf(3, 100, 30, 40)

It’s worth mentioning that memory_profiler is a great tool for finding memory leaks:

from scipy.stats import hypergeom
from memory_profiler import profile

@profile
def main1():
    for _ in range(1000):
        x = hypergeom(100, 30, 40).cdf(3)

main1()

Output:

$ python3 geomprofile.py 
Filename: geomprofile.py

Line #    Mem usage    Increment   Line Contents
================================================
     4     69.2 MiB     69.2 MiB   @profile
     5                             def main1():
     6     79.2 MiB      0.0 MiB       for _ in range(1000):
     7     79.2 MiB     10.0 MiB           x = hypergeom(100, 30, 40).cdf(3)

We see that the hypergeom line contributed to an increase in memory use of 10Mb.

Drilling down into NumPy’s vectorize took a bit of manual debugging; I didn’t have as much luck with memory_profiler there.

In a production situation one might not have the luxury of finding the real cause of the memory leak immediately. In that case it might be enough to wrap the offending code in a call to multiprocessing so that the leaked memory is reclaimed frequently. A lightweight option is to use processify. See Liau Yung Siang’s blog post for more details.

readFile :: IO -> [Char]

Using a modern GHC compiler, how much memory would this program use?

x <- readFile "foo"
x `deepseq` print ()

Linear in the size of foo? Or something else?

Turns out, for the default readFile from the Prelude, the answer is about 40 times the size of the input file.

The default Haskell strings take 5 words per character, so on a 64bit machine this is 40 5*8 = 40 bytes per character. The list of characters is stored as a linked list, roughly like this diagram (taken from Johan Tibbel’s ZuriHac 2015 talk, slides are here):

We can check the actual memory usage of a Haskell program (compiled with GHC) by using the RTS options:

ghc readfile.hs -Wall -O2 -rtsopts

./readfile +RTS -toutput <other options>

See this repository for some scripts to benchmark a few variants of readFile:

  • Prelude
  • Data.ByteString
  • Data.ByteString.Char8
  • Data.ByteString.Lazy
  • Data.ByteString.Lazy.Char8

Basically, anything’s better than the default readFile:

readfile_memory_usage

QuickCheck’s CoArbitrary – generate random functions!

Property based testing is a great way to improve code quality since it captures logical properties of your system. Instead of writing test cases by hand, you capture logical relationships and then let the test framework generate hundreds or thousands of examples for you.

For a general introduction to property based testing (language-independent), try this YOW! Night talk Property Based Testing Finding More Bugs with Less Effort by Charles O’Farrell.

QuickCheck provides a typeclass CoArbitrary for generating random functions. In this blog post we derive CoArbitrary in a standalone setting. For the real definition, see Test.QuickCheck. Source code for this post is here.

Some imports:


> {-# LANGUAGE InstanceSigs #-}
> {-# LANGUAGE RankNTypes   #-}
> 
> module Sample where
> 
> import Control.Applicative
> import Control.Monad
> import System.Random

First, a generator of type Gen a lets us use a random number generator to get a value of type a:


> newtype Gen a = Gen
>     { unGen :: StdGen -> a }

(The real Gen in QuickCheck is parameterised over the generator, but we don’t need that here. Also, the real Gen includes a size, which you can control using resize or scale).

The Arbitrary typeclass:


> class Arbitrary a where
>     arbitrary :: Gen a

Since False < True (there is an instance of Ord for Bool) we can use randomR to define arbitrary for Bool:


> instance Arbitrary Bool where
>     arbitrary = Gen $ \stdgen ->
>       let (r, _) = randomR (False, True) stdgen in r

Here’s a first attempt at implementing Gen (Bool -> Bool), a generator for boolean functions:


> genBoolFn0 :: Gen (Bool -> Bool)
> genBoolFn0 = Gen $ \stdgen ->
>   \a -> let (r, _) = randomR (False, True) stdgen in r

The type is right but it’s going to generate pretty boring functions since it doesn’t even use the a:


> runBoolFnGen :: Gen (Bool -> Bool) -> IO ()
> runBoolFnGen g = do
>   fns  <- samples g
> 
>   forM_ fns  $ \f -> do
>     putStrLn $ "True  => " ++ show (f True)
>     putStrLn $ "False => " ++ show (f False)
>     putStrLn ""

The functions are either const True or const False. Not useful.


ghci> runBoolFnGen genBoolFn0 
True  => True
False => True

True  => False
False => False

True  => False
False => False

True  => True
False => True

True  => False
False => False

True  => True
False => True

True  => False
False => False

True  => False
False => False

True  => False
False => False

We need to split on the a somehow:


> genBoolFn1 :: Gen (Bool -> Bool)
> genBoolFn1 = Gen $ \stdgen -> \a -> case a of
>   True  -> let (r, _) = randomR (False, True) stdgen in r
>   False -> let (r, _) = randomR (False, True) stdgen in r

This isn’t any better. The other thing we can change is the generator. Fortunately, StdGen is an instance of RandomGen, so we have the split function:


> -- The split operation allows one to obtain two
> -- distinct random number generators. This is
> -- very useful in functional programs (for example,
> -- when passing a random number generator down to
> -- recursive calls), but very little work has been done
> -- on statistically robust implementations of split
> -- ([System.Random, System.Random] are the only
> -- examples we know of).
> 
> split :: StdGen -> (StdGen, StdGen)
> split = ...

Taking advantage of laziness, we can use split to write a pure function that gives us an infinite sequence of statistically distinct generators:


> splitGenerator :: RandomGen a => a -> [a]
> splitGenerator r = r0 : splitGenerator r1
>   where
>     (r0, r1) = split r

This is exactly what we need to permute the generator in genBoolFn1. Let’s map True to the generator at index 0 and False to to the generator at index 1:


> genBoolFn2 :: Gen (Bool -> Bool)
> genBoolFn2 = Gen $ \stdgen -> \a -> case a of
>   True  -> let (r, _) = randomR
>                           (False, True)
>                           (splitGenerator stdgen !! 0)
>             in r
>   False -> let (r, _) = randomR
>                           (False, True)
>                           (splitGenerator stdgen !! 1)
>             in r

Now the random functions look more random:


ghci> runBoolFnGen genBoolFn2
True  => False
False => True

True  => True
False => False

True  => False
False => True

True  => True
False => False

True  => True
False => True

True  => True
False => False

True  => True
False => True

True  => False
False => False

True  => False
False => True

So, what about random integer functions Int -> Int? We need to map any integer to one of the split generators, in other words we need a mapping $\mathbb{N} \rightarrow \mathbb Z_{\ge 0}$. Send the non-negative integers to the non-negative even integers, and the negative integers to the positive odd integers:

\[
n \rightarrow
\begin{cases}
2n & \mbox{if } n \geq 0 \\
2(-n) + 1 & \mbox{if } n < 0 \end{cases} \]

In Haskell this looks like:


> \n -> variant $ if n >= 0 then 2*n else 2*(-n) + 1

where


> variant :: Int -> Gen b -> Gen b
> variant v (Gen g) = Gen $ \r -> g $ splitGenerator r !! (v+1)

Capture this with a typeclass:


> class CoArbitrary a where
>     coarbitrary :: a -> Gen b -> Gen b
> 
> instance CoArbitrary Bool where
>   coarbitrary False = variant 0
>   coarbitrary True  = variant 1
> 
> instance CoArbitrary Int where
>     coarbitrary n = variant $ if n >= 0 then 2*n else 2*(-n) + 1

With some experimentation we can extend the CoArbitrary definitions to lists and tuples:


> instance CoArbitrary a => CoArbitrary [a] where
>   coarbitrary []     = variant 0
>   coarbitrary (x:xs) = variant 1 . coarbitrary (x, xs)
> 
> instance (CoArbitrary a, CoArbitrary b) => CoArbitrary (a, b) where
>   coarbitrary (x, y) = coarbitrary x . coarbitrary y

In general, if we have CoArbitrary a and Arbitrary b then we can derive Arbitrary (a -> b):


> instance (CoArbitrary a, Arbitrary b) => Arbitrary (a -> b) where
>     arbitrary = promote (\a -> coarbitrary a arbitrary)
> 
> promote :: (a -> Gen b) -> (Gen (a->b))
> promote f = Gen $ \r ->
>   \a -> let Gen h = f a
>          in h r

Here are more examples of random functions:


> runGenFn :: (Arbitrary a, Arbitrary b, Show a, Show b)
>          => Gen (a -> b)
>          -> [a]
>          -> IO ()
> runGenFn g as = do
>   fns  <- samples g
> 
>   forM_ fns  $ \f -> do
>     forM_ as $ \a -> putStrLn $ show a ++ " => "
>                                        ++ show (f a)
>     putStrLn ""

Running in ghci:


ghci> runGenFn (arbitrary :: Gen (Int -> Int)) [0, 1, 2]
0 => 198
1 => 940
2 => -200

0 => 734
1 => -627
2 => 6

0 => 965
1 => 581
2 => -585

0 => -306
1 => -918
2 => 678

0 => -929
1 => 336
2 => -696

0 => -66
1 => 123
2 => 875

0 => -234
1 => -673
2 => 216

0 => 355
1 => 56
2 => -615

0 => 278
1 => 116
2 => 967

ghci> runGenFn (arbitrary :: Gen (Int -> Bool)) [0, 1, 2]
0 => False
1 => True
2 => False

0 => True
1 => False
2 => True

0 => True
1 => False
2 => False

0 => True
1 => False
2 => False

0 => True
1 => True
2 => True

0 => True
1 => True
2 => False

0 => False
1 => True
2 => False

0 => True
1 => True
2 => False

0 => True
1 => True
2 => True

ghci> runGenFn (arbitrary :: Gen ([Int] -> [Int])) [[42], [0, 1, 2]]
[42] => [-93,-540,-715,-557,-249]
[0,1,2] => [433,97,665,554,-690,635]

[42] => [-785,174,-676,-662,-549]
[0,1,2] => [-735,-328,226,-524,423]

[42] => [157,976,-774]
[0,1,2] => [-197,608,-520,-37,-689]

[42] => [-684]
[0,1,2] => [902,-138,-33,689,-774,-713,474,-638]

[42] => [-782,540,649,320,-326,92,896,-76]
[0,1,2] => [156]

[42] => [524,137]
[0,1,2] => [642,-763,771,-400,825,71,895,586,252,37]

[42] => [641,-304,-890,-375,449,-608,662,546,-740,-406]
[0,1,2] => [-632,-685,-232,202,-994,666,-121]

[42] => [200,599,-844]
[0,1,2] => [-554,149,370,547,-755,-706,131]

[42] => [-898,645,-472]
[0,1,2] => [-77]

Appendix

To get the code above to work we need instances for Monad, Functor, and Applicative. These are all lifted from Test.QuickCheck.


> instance Monad Gen where
>     return :: a -> Gen a
>     return a = Gen $ \_ -> a
> 
>     (>>=) :: Gen a -> (a -> Gen b) -> Gen b
> 
>     (Gen g) >>= f = Gen $ \r ->
>        let (r1, r2) = split r
>            Gen k = f $ g r1
>         in k r2
> 
> instance Functor Gen where
>     fmap f (Gen h) = Gen $ \r -> f (h r)
> 
> instance Applicative Gen where
>     pure x = Gen $ \_ -> x
> 
>     f <*> x = do
>         f' <- f
>         x' <- x
>         return $ f' x'
> 
> generate :: Gen a -> IO a
> generate (Gen g) = do
>     stdgen <- getStdGen
>     return $ g stdgen

Use sequence to get a few samples. Note that this relies on the Applicative instance’s definition of <*> to get a new standard number generator each time it is used, which in turn uses the Monad instance’s definition which uses split.


> samples :: Gen a -> IO [a]
> samples g = generate $ sequence [g, g, g, g, g, g, g, g, g]

> instance Arbitrary Int where
>   arbitrary = Gen $ \stdgen ->
>     let (r, _) = randomR (-1000, 1000) stdgen in r
> 
> choose :: Random a => (a,a) -> Gen a
> choose range = Gen $ \stdgen ->
>   let (r, _) = randomR range stdgen in r
> 
> vectorOf :: Int -> Gen a -> Gen [a]
> vectorOf = replicateM
> 
> listOf :: Gen a -> Gen [a]
> listOf g = do
>   k <- choose (1, 10)
>   vectorOf k g
> 
> instance Arbitrary a => Arbitrary [a] where
>   arbitrary = listOf arbitrary

Pricing an option on a cumulative sum.

Consider an option with payout on an underlying which is a cumulative sum. For example, a construction company might want to mitigate the risk of rainfall causing delays in a project. The construction company could buy a call option on the cumulative sum of rainfall not exceeding 180mm on any 3-day period. Further, we restrict this so that one day can only be used in one payoff. If the cumulative rainfall $r$ a 3-day period exceeded 180mm, the entity selling the call option would have the obligation to pay the construction company based on the difference $r – 180$.

As the entity selling one of these options, we need to work out the expected loss and worst possible scenario for payouts.

Let $r_0, r_1, \ldots$ be the sequence of daily rainfall. Form the 3-sums:

\[
s_0 = r_0+r_1+r_2, s_1 = r_1+r_2+r_3, s_2 = r_2+r_3+r_4, \ldots
\]

Let $C > 0$ be the cutoff (e.g. 180mm/day). Then form the sequence $w_0, w_1, \ldots$ where

\[
w_i = f_C(s_i)
\]

where $f_C(x)$ is $0$ if $x \leq C$ and $x$ otherwise.

As the seller of the option, the worst outcome for us (i.e. largest payout) corresponds to the subsequence

\[
w_{k_1}, w_{k_2}, \ldots w_{k_n}
\]

that maximises the sum $\sum_{l=1}^n w_{k_l}$ for some $n$, where the terms $w_{k_l}$ are mutually disjoint (i.e. we don’t take the 3-sum of any days that overlap). We consider this sum over one year, but the same approach applies to monthly, quarterly, etc.

Formulation as a graph problem

Label vertices with the 3-sums $w_i$ and create an edge $(w_i, w_j)$ if the sums $w_i$ and $w_j$ refer to overlapping time periods. Then our problem is equivalent to finding a Maximum Weighted Independent Set in the graph. This is a classic NP-hard problem.

Formulation as a linear programming problem

Create binary decision variables $x_0, x_1, \ldots$ and optimise

\[ \sum{w_i x_i} \]

subject to

\[
x_i + x_j \leq 1
\]

when $w_i$ and $w_j$ refer to non-disjoint cumulative 3-sums.

Python solver

Using PuLP we can very quickly write a solver in Python.

First set up a solver:

prob = pulp.LpProblem('maxcsums', pulp.LpMaximize)

Create binary variables corresponding to each vertex:

x_vars = [ pulp.LpVariable('x' + str(i), cat='Binary')
             for (i, _) in enumerate(weights) ]

Use reduce to form the sum \( \sum w_i x_i \):

prob += reduce(lambda a,b: a+b,
                   [ weights[i]*x_vars[i] for i in range(len(weights))])

Finally, given a list of indices (3-tuples), add the constraints to avoid solutions with adjacent vertices:

for (i, x) in enumerate(indices):
  for (j, y) in enumerate(indices):
    if x < y:
      if is_adjacent(x, y):
        prob += x_vars[i] + x_vars[j] <= 1

Changi daily rainfall

For a concrete example, let’s use the daily rainfall data for Changi in Singapore.

Here are the 3-sums:



For cutoffs from 100 to 300, here are the number of extreme events per year:

     100 120 140 160 180 200 220 240 260 280 300
1980   3   3   2   1   1   1   1   1   1   0   0
1981   1   1   1   1   0   0   0   0   0   0   0
1982   2   1   1   1   1   1   1   1   0   0   0
1983   2   2   2   2   2   1   0   0   0   0   0
1984   6   4   2   1   1   1   0   0   0   0   0
1985   1   1   0   0   0   0   0   0   0   0   0
1986   4   4   4   3   1   0   0   0   0   0   0
1987   4   3   2   2   1   1   1   1   1   0   0
1988   9   4   2   1   1   0   0   0   0   0   0
1989   3   1   1   1   1   1   1   1   0   0   0
1990   2   1   1   1   0   0   0   0   0   0   0
1991   4   3   1   0   0   0   0   0   0   0   0
1992   6   3   2   2   1   1   1   0   0   0   0
1993   1   0   0   0   0   0   0   0   0   0   0
1994   4   3   2   2   2   1   0   0   0   0   0
1995   3   3   3   2   2   2   1   0   0   0   0
1996   1   0   0   0   0   0   0   0   0   0   0
1997   0   0   0   0   0   0   0   0   0   0   0
1998   3   3   1   1   0   0   0   0   0   0   0
1999   2   0   0   0   0   0   0   0   0   0   0
2000   2   2   0   0   0   0   0   0   0   0   0
2001   5   3   2   2   2   2   2   2   2   1   0
2002   4   1   1   1   1   0   0   0   0   0   0
2003   5   1   1   1   1   1   0   0   0   0   0
2004   6   4   2   2   1   1   1   0   0   0   0
2005   2   1   1   1   0   0   0   0   0   0   0
2006   5   5   5   5   3   2   2   1   1   1   1
2007   7   5   4   3   1   1   1   1   1   0   0
2008   4   3   1   0   0   0   0   0   0   0   0
2009   1   1   0   0   0   0   0   0   0   0   0
2010   4   1   0   0   0   0   0   0   0   0   0
2011   4   2   1   1   1   1   1   1   1   1   0
2012   3   0   0   0   0   0   0   0   0   0   0
2013   5   3   3   1   0   0   0   0   0   0   0
2014   0   0   0   0   0   0   0   0   0   0   0
2015   0   0   0   0   0   0   0   0   0   0   0
2016   1   0   0   0   0   0   0   0   0   0   0

We will go with 180mm as the cutoff. Ultimately, choosing the strike is a business decision.

The burn is $\texttt{max}(w_i – 180, 0)$, indicating the payout that the seller of the option would be liable for. Here is the daily burn for Changi daily rainfall (zero rows are not shown):

            c3sum   burn
DATE                    
1980-01-20  275.5   95.5
1982-12-22  245.4   65.4
1983-08-22  190.6   10.6
1983-12-26  218.8   38.8
1984-02-02  219.3   39.3
1986-12-11  186.7    6.7
1987-01-11  260.5   80.5
1988-09-21  188.9    8.9
1989-12-01  252.5   72.5
1992-11-11  223.2   43.2
1994-11-11    209   29.0
1994-12-19  180.7    0.7
1995-11-02  203.4   23.4
1995-02-06  236.9   56.9
2001-01-16  268.7   88.7
2001-12-27  281.4  101.4
2002-01-25  182.1    2.1
2003-01-31  210.5   30.5
2004-01-25  237.9   57.9
2006-12-19  356.4  176.4
2006-12-26  222.7   42.7
2006-01-09  190.7   10.7
2007-01-12  260.4   80.4
2011-01-31  298.2  118.2

The yearly burn is $\sum_i{\texttt{max}(w_i-180,0)}$ where the $w_i$ are in a particular calendar year. Plotting the yearly burn:



The histogram of burn values shows that a value around $80$ is the most frequent total yearly burn:



Cumulative histogram of the yearly burn:



From the yearly burn we can compute the expected loss over various periods:


Expected loss (all years):     71.13
Expected loss (last 10 years): 86.22
Expected loss (last 20 years): 71.13

We see that the expected loss over the last 10, 20 years differs by about 15.0.
The 99th percentile is p99 = 223.051.

Unlike pricing options on a liquid instrument such as an interest rate swap, stock, currencies, etc, the pricing of this option is set in an incomplete market. There is no way to hedge, so the Black Scholes formula cannot be used. Typically the price will be set as

\[
f(\texttt{expected loss}, \texttt{high value on the tail})
\]

where the high value on the tail is p99 or p99.9 (or similar); the function $f$ will integrate the price maker’s risk appetite, but it will be approximately

\[
\texttt{option price} \approx \texttt{expected loss} + \Delta
\]

Notes

Source code is on Github: https://github.com/carlohamalainen/maxc3sum

Most of the data manipulation is done using Python Pandas which makes lots of things one-liners, e.g. to sum the daily burn to yearly burn, given an index of datetime.date, we can group by year and then sum the result:

    yearly_burn = daily_burn.groupby(lambda x: x.year).sum()

To set the gap for GLPK, pass the command line option when setting up the initial GLPK object:

    pulp.GLPK(options=['--mipgap', 0.1]).solve(prob)

Naturally, finding a good gap is problem-specific and requires some experimentation.

Push to Amazon Glacier using amazonka

Here is a small Haskell package for pushing files to Amazon Glacier: https://github.com/carlohamalainen/glacier-push. It uses Brendan Hay’s amazonka API, in particular amazonka-glacier.

One thing that I couldn’t find in amazonka was a way to calculate the tree hash of a file. The Glacier API needs this for each part that is uploaded as well as the whole file. Amazon explains how to calculate the tree hash in their Glacier docs and provides sample code in Java and C++. Since the algorithm is recursive, it is quite short in Haskell:


oneMb :: Int64
oneMb = 1024*1024

treeHash :: BS.ByteString -> Hash
treeHash s = treeHash' $ map sha256 $ oneMbChunks s
  where
    treeHash' []  = error "Internal error in treeHash'."
    treeHash' [x] = B16.encode x
    treeHash' xs  = treeHash' $ next xs

    next :: [BS.ByteString] -> [BS.ByteString]
    next []       = []
    next [a]      = [a]
    next (a:b:xs) = sha256 (BS.append a b) : next xs

    oneMbChunks :: BS.ByteString -> [BS.ByteString]
    oneMbChunks x
      | BS.length x <= oneMb = [x]
      | otherwise            = BS.take oneMb x : oneMbChunks (BS.drop oneMb x)

    sha256 :: BS.ByteString -> BS.ByteString
    sha256 = cs . SHA256.hashlazy

To push a large file to Glacier we do three things: initiate the multipart upload, upload each part (say, 100Mb chunks), and then finalize the upload.

Initiate the multipart upload

We do this to get an uploadId which is then used for each of the multipart uploads. We use initiateMultipartUpload, and need to set the part size.


initiateMulti env vault _partSize = send' env mpu
  where
    mpu = initiateMultipartUpload accountId vault
            & imuPartSize .~ (Just $ cs $ show _partSize)

Upload the parts

With the response from initiating the multipart upload (the mu parameter in uploadOnePart) we can push a single part using uploadMultipartPart. Here we have to also set the checksum and content range:


uploadOnePart env vault mu p = do
    let Part{..} = p

    body <- toHashed <$> getPart _path (_partStart, _partEnd)

    uploadId <- case mu ^. imursUploadId of
                    Nothing     -> throw MissingUploadID
                    Just uid    -> return uid

    let ump = uploadMultipartPart accountId vault uploadId body
                & umpChecksum .~ (Just $ cs $ p ^. partHash)
                & umpRange    .~ Just cr

    send' env ump

  where

    contentRange :: Int64 -> Int64 -> Text
    contentRange x y = "bytes " `append` cs (show x) `append` accountId `append` cs (show y) `append` "/*"

Complete the multipart upload

Completing the multipart upload lets Glacier know that it should start its job of assembling all the parts into a full archive. We have to set the archive size and the tree hash of the entire file.


completeMulti env vault mp mu = do
    uploadId <- case mu ^. imursUploadId of
                    Nothing     -> throw MissingUploadID
                    Just uid    -> return uid

    let cmu = completeMultipartUpload accountId vault uploadId
                & cmuArchiveSize .~ (Just $ cs $ show $ mp ^. multipartArchiveSize)
                & cmuChecksum    .~ (Just $ cs $ mp ^. multipartFullHash)

    send' env cmu

Notes

In each of these functions I used a helper for sending the request:


send' env x = liftIO $ runResourceT . runAWST env $ send x

I run the main block of work in a KatipContextT since I am using katip for structured logging. Adding new key-value info to the log context is accomplished using katipAddContext.


go vault' path = do
    $(logTM) InfoS "Startup."

    let vault = cs vault'

    let myPartSize = 128*oneMb

    mp  <- liftIO $ mkMultiPart path myPartSize

    env <- liftIO $ newEnv'
    mu  <- liftIO $ initiateMulti env vault myPartSize

    let uploadId = fromMaybe (error "No UploadId in response, can't continue multipart upload.")
                 $ mu ^. imursUploadId

    partResponses <- forM (mp ^. multiParts) $ \p ->
        katipAddContext (sl "uploadId" uploadId) $
        katipAddContext (sl "location" $ fromMaybe "(nothing)" $ mu ^. imursLocation) $ do
            doWithRetries 10 (uploadOnePart env vault mu p)

    case lefts partResponses of
        []   -> do $(logTM) InfoS "All parts uploaded successfully, now completing the multipart upload."
                   catch (do completeResponse <- completeMulti env vault mp mu
                             katipAddContext (sl "uploadId" uploadId)                             $
                              katipAddContext (sl "archiveId" $ completeResponse ^. acoArchiveId) $
                               katipAddContext (sl "checksum" $ completeResponse ^. acoChecksum ) $
                                katipAddContext (sl "location" $ completeResponse ^. acoLocation) $ do
                                  $(logTM) InfoS "Done"
                                  liftIO exitSuccess)
                         (\e -> do logServiceError "Failed to complete multipart upload." e
                                   $(logTM) ErrorS "Failed."
                                   liftIO exitFailure)

        errs -> do forM_ errs (logServiceError "Failed part upload.")
                   $(logTM) ErrorS "Too many part errors."
                   liftIO exitFailure

logServiceError msg (ServiceError e)
    = let smsg :: Text
          smsg = toText $ fromMaybe "" $ e ^. serviceMessage

          scode :: Text
          scode = toText $ e ^. serviceCode

        in katipAddContext (sl "serviceMessage" smsg) $
            katipAddContext (sl "serviceCode" scode)  $
             (headersAsContext $ e ^. serviceHeaders) $
               $(logTM) ErrorS msg

logServiceError msg (TransportError e)
    = let txt :: Text
          txt = toText $ show e
        in katipAddContext (sl "TransportError" txt) $
            $(logTM) ErrorS msg

logServiceError msg (SerializeError e)
    = let txt :: Text
          txt = toText $ show e
        in katipAddContext (sl "SerializeError" txt) $
            $(logTM) ErrorS msg

I found it handy to write this little helper function to turn each header from an amazonka ServiceError into a Katip context key/value pair:


headersAsContext :: KatipContext m => [Header] -> m a -> m a
headersAsContext hs = foldl (.) id $ map headerToContext hs
  where
    headerToContext :: KatipContext m => Header -> m a -> m a
    headerToContext (h, x) = let h' = cs $ CI.original h :: Text
                                 x' = cs x               :: Text
                               in katipAddContext (sl h' x')

Katip can write to ElasticSearch using katip-elasticsearch. Then you’d be able to search for errors on specific header fields, etc.

Sample run

glacier-push-exe basement myfile
[2017-08-30 12:44:47][glacier-push.main][Info][x4][1724][ThreadId 7][main:Main app/Main.hs:300:7] Startup.
[2017-08-30 12:44:50][glacier-push.main][Info][x4][1724][ThreadId 7][location:/998720554704/vaults/basement/multipart-uploads/vZMCGNsLGhfTJ_hJ-CJ_OF_juCAY1IaZDl_A3YqOZXnuQRH_AtPiMaUE-K1JDew-ZwiIuDZgR3QbjsJIEfWtGeMNeKDs][partEnd:134217727][partStart:0][uploadId:vZMCGNsLGhfTJ_hJ-CJ_OF_juCAY1IaZDl_A3YqOZXnuQRH_AtPiMaUE-K1JDew-ZwiIuDZgR3QbjsJIEfWtGeMNeKDs][main:Main app/Main.hs:213:15] Uploading part.
[2017-08-30 12:45:45][glacier-push.main][Info][x4][1724][ThreadId 7][location:/998720554704/vaults/basement/multipart-uploads/vZMCGNsLGhfTJ_hJ-CJ_OF_juCAY1IaZDl_A3YqOZXnuQRH_AtPiMaUE-K1JDew-ZwiIuDZgR3QbjsJIEfWtGeMNeKDs][partEnd:268435455][partStart:134217728][uploadId:vZMCGNsLGhfTJ_hJ-CJ_OF_juCAY1IaZDl_A3YqOZXnuQRH_AtPiMaUE-K1JDew-ZwiIuDZgR3QbjsJIEfWtGeMNeKDs][main:Main app/Main.hs:213:15] Uploading part.
[2017-08-30 12:46:37][glacier-push.main][Info][x4][1724][ThreadId 7][location:/998720554704/vaults/basement/multipart-uploads/vZMCGNsLGhfTJ_hJ-CJ_OF_juCAY1IaZDl_A3YqOZXnuQRH_AtPiMaUE-K1JDew-ZwiIuDZgR3QbjsJIEfWtGeMNeKDs][partEnd:293601279][partStart:268435456][uploadId:vZMCGNsLGhfTJ_hJ-CJ_OF_juCAY1IaZDl_A3YqOZXnuQRH_AtPiMaUE-K1JDew-ZwiIuDZgR3QbjsJIEfWtGeMNeKDs][main:Main app/Main.hs:213:15] Uploading part.
[2017-08-30 12:46:55][glacier-push.main][Info][x4][1724][ThreadId 7][main:Main app/Main.hs:320:22] All parts uploaded successfully, now completing the multipart upload.
[2017-08-30 12:46:57][glacier-push.main][Info][x4][1724][ThreadId 7][archiveId:bImG6jM0eQGNC7kIJTsC_wtcAXdPDUtJ-NyfstrkxeyTtXC_iEgkvenH-h_eQH-LYbhVKWJM7WuZlb7OHKtgKJNEpOtVaqxEhlNRTHphUtLCurcHAQDHKkiTnIXTpFxgPgvP9Q0axA][checksum:4f08645d8f3705dc222eef7547591c400362806abb7a6298b9267ebf2be7d901][location:/998720554704/vaults/basement/archives/bImG6jM0eQGNC7kIJTsC_wtcAXdPDUtJ-NyfstrkxeyTtXC_iEgkvenH-h_eQH-LYbhVKWJM7WuZlb7OHKtgKJNEpOtVaqxEhlNRTHphUtLCurcHAQDHKkiTnIXTpFxgPgvP9Q0axA][uploadId:vZMCGNsLGhfTJ_hJ-CJ_OF_juCAY1IaZDl_A3YqOZXnuQRH_AtPiMaUE-K1JDew-ZwiIuDZgR3QbjsJIEfWtGeMNeKDs][main:Main app/Main.hs:326:37] Done

The lines are pretty long (as they are intended for consumption into ElasticSearch, not human parsing) so here is one with line breaks:

[2017-08-30 12:45:45]
 [glacier-push.main]
 [Info]
 [x4]
 [1724]
 [ThreadId 7]
 [location:/998720554704/vaults/basement/multipart-uploads/vZMCGNsLGhfTJ_hJ-CJ_OF_juCAY1IaZDl_A3YqOZXnuQRH_AtPiMaUE-K1JDew-ZwiIuDZgR3QbjsJIEfWtGeMNeKDs]
 [partEnd:268435455]
 [partStart:134217728]
 [uploadId:vZMCGNsLGhfTJ_hJ-CJ_OF_juCAY1IaZDl_A3YqOZXnuQRH_AtPiMaUE-K1JDew-ZwiIuDZgR3QbjsJIEfWtGeMNeKDs]
 [main:Main app/Main.hs:213:15] Uploading part.

Checking out and compiling

Use Stack. Then:

$ git clone https://github.com/carlohamalainen/glacier-push.git
$ cd glacier-push
$ stack build

To browse the source on github, have a look at:

FSA B3164 bottom bracket replacement

My Boardman Team TXC 650b hardtail mountain bike has an FSA crankset and bottom bracket. After a year and a half the bottom bracket got quite rough so I decided to swap it out with a Hope Hollowtech II bottom bracket.

I couldn’t find much online about this bottom bracket. Markings include: “FSA B3164 MegaExo 24mm MS185” and “BC1.37″ x 24T”.

I replaced it with this Hope bottom bracket, the 68/73mm model.

hope

It comes with three spacers but not all are needed.

The drive side cup of the new bottom bracket screwed in easily but I stripped the thread of the bottom bracket on the non-drive side, probably due to gunk in the threads.

I managed to get the cup out and then used a dremel (on low speed) to clean up the threads.

To chase the threads, I made a few cuts into the new bottom bracket cup (following this video), which let me push into the bottom bracket enough to bite into the thread. If you do this you’ll probably destroy the thread completely, rendering the frame junk.

 

ghc-imported-from => ghc-mod (August 2017)

I have a pull request to merge ghc-imported-from into ghc-mod. The main benefit of being part of ghc-mod is that I don’t have to duplicate ghc-mod’s infrastructure for handling sandboxes, GHC options, interfaces to other build tools like Stack, and compatibility with more versions of GHC.

The pull request is still under review, so until then you can try it out by cloning the development branches:

git clone -b imported-from https://github.com/DanielG/ghc-mod.git ghc-mod-imported-from
cd ghc-mod-imported-from
cabal update && cabal sandbox init && cabal install
export PATH=`pwd`/.cabal-sandbox/bin:$PATH

Assuming that you use Plugged for managing Vim/Neovim plugins, use my branch of ghcmod-vim by adding this to your vimrc:

call plug#begin('~/.vim/plugged')

Plug 'carlohamalainen/ghcmod-vim', { 'branch': 'ghcmod-imported-from-cmd', 'for' : 'haskell' }

Install the plugin with :PlugInstall in vim.

Recently, xdg-open stopped working for me (others have had the same issue) so I recommend setting ghcmod_browser in your vimrc:

let g:ghcmod_browser = '/usr/bin/firefox'

Here are some handy key mappings:

au FileType  haskell nnoremap  :GhcModType
au FileType  haskell nnoremap  :GhcModInfo
au FileType  haskell nnoremap  :GhcModTypeClear

au FileType lhaskell nnoremap  :GhcModType
au FileType lhaskell nnoremap  :GhcModInfo
au FileType lhaskell nnoremap  :GhcModTypeClear

au FileType haskell  nnoremap  :GhcModOpenDoc
au FileType lhaskell nnoremap  :GhcModOpenDoc

au FileType haskell  nnoremap  :GhcModDocUrl
au FileType lhaskell nnoremap  :GhcModDocUrl

au FileType haskell  vnoremap  :GhcModOpenHaddockVismode
au FileType lhaskell vnoremap  :GhcModOpenHaddockVismode

au FileType haskell  vnoremap  :GhcModEchoUrlVismode
au FileType lhaskell vnoremap  :GhcModEchoUrlVismode

On the command line, use the imported-from command. It tells you the defining module, the exporting module, and the Haddock URL:

$ ghc-mod imported-from Foo.hs 9 34
base-4.8.2.0:System.IO.print Prelude /opt/ghc/7.10.3/share/doc/ghc/html/libraries/base-4.8.2.0/Prelude.html

From Vim/Neovim, navigate to a symbol and hit F4 which will open the Haddock URL in your browser, or F5 to echo the command-line output.

Ubuntu 17 – device not managed

I plugged in a D-Link DUB-1312 to my laptop running Ubuntu Zesty but Network Manager said that the interface was “not managed”.

The fix, found here, is to remove the contents of one file. Better to save the original file and touch an empty one:

$ sudo mv    /usr/lib/NetworkManager/conf.d/10-globally-managed-devices.conf{,_ORIGINAL}
$ sudo touch /usr/lib/NetworkManager/conf.d/10-globally-managed-devices.conf

For reference, here’s the info about the DUB-1312 USB ethernet adapter:

$ sudo apt update
$ sudo apt install hwinfo
$ sudo hwinfo --netcard

(other output snipped)

40: USB 00.0: 0200 Ethernet controller
  [Created at usb.122]
  Unique ID: VQs5.d0KcpDt5qE6
  Parent ID: 75L1.MLPSY0FvjsF
  SysFS ID: /devices/pci0000:00/0000:00:14.0/usb2/2-6/2-6.4/2-6.4.3/2-6.4.3:1.0
  SysFS BusID: 2-6.4.3:1.0
  Hardware Class: network
  Model: "D-Link DUB-1312"
  Hotplug: USB
  Vendor: usb 0x2001 "D-Link"
  Device: usb 0x4a00 "D-Link DUB-1312"
  Revision: "1.00"
  Serial ID: "000000000005FA"
  Driver: "ax88179_178a"
  Driver Modules: "ax88179_178a"
  Device File: enxe46f13f4be18
  HW Address: e4:6f:13:f4:be:18
  Permanent HW Address: e4:6f:13:f4:be:18
  Link detected: yes
  Module Alias: "usb:v2001p4A00d0100dcFFdscFFdp00icFFiscFFip00in00"
  Driver Info #0:
    Driver Status: ax88179_178a is active
    Driver Activation Cmd: "modprobe ax88179_178a"
  Config Status: cfg=new, avail=yes, need=no, active=unknown
  Attached to: #33 (Hub)

Structured logging to AWS ElasticSearch

A while ago I wrote about how to set up a structured logging service using PostgreSQL. AWS now makes it possible to have the same functionality (plus more) in the “serverless” style. For background on the idea of serverless architecture, watch this talk: GOTO 2017 • Serverless: the Future of Software Architecture • Peter Sbarski. Parts of this post are based on this guide on serverless AWS lambda elasticsearch and kibana.

First, create an Amazon Elasticsearch Service Domain. I used the smallest instance size since it’s just for personal use. Full docs are here.

For programmatic access control, create an AWS IAM user and make a note of its “arn” identifier, e.g. arn:aws:iam::000000000000:user/myiamuser. Then add an access policy as follows. We also add access to our IP address for the kibana interface. I made an ElasticSearch domain called “logs”; see the Resource field below:

{
  &quot;Version&quot;: &quot;2012-10-17&quot;,
  &quot;Statement&quot;: [
    {
      &quot;Effect&quot;: &quot;Allow&quot;,
      &quot;Principal&quot;: {
        &quot;AWS&quot;: &quot;arn:aws:iam::000000000000:user/myiamuser&quot;
      },
      &quot;Action&quot;: &quot;es:*&quot;,
      &quot;Resource&quot;: &quot;arn:aws:es:ap-southeast-1:000000000000:domain/logs/*&quot;
    },
    {
      &quot;Effect&quot;: &quot;Allow&quot;,
      &quot;Principal&quot;: {
        &quot;AWS&quot;: &quot;*&quot;
      },
      &quot;Action&quot;: &quot;es:*&quot;,
      &quot;Resource&quot;: &quot;arn:aws:es:ap-southeast-1:000000000000:domain/logs/*&quot;,
      &quot;Condition&quot;: {
        &quot;IpAddress&quot;: {
          &quot;aws:SourceIp&quot;: &quot;xxx.xxx.xxx.xxx&quot;
        }
      }
    }
  ]
}

To post to the ElasticSearch instance we use requests-aws4auth:

sudo pip install requests-aws4auth

Then we can post a document, a json blob, using the following script. Set the host, region, AWS key, and AWS secret key. This script saves the system temperature under an index system-stats with the ISO date attached.

import datetime 

from elasticsearch import Elasticsearch, RequestsHttpConnection
from requests_aws4auth import AWS4Auth

HOST        = 'search-logs-xxxxxxxxxxxxxxxxxxxxxxxxxx.ap-southeast-1.es.amazonaws.com' # see 'Endpoint' in ES status page
REGION      = 'ap-southeast-1' # choose the correct region
AWS_KEY     = 'XXXXXXXXXXXXXXXXXXXX'
AWS_SECRET  = 'YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY'
 
def get_temp():
    return 42.0 # actually read from 'sensors' or similar

if __name__ == '__main__':
    now  = datetime.datetime.now()
    date = now.date().isoformat()

    doc = {'host': 'blah', 'temperature': get_temp(), 'datetime': now.isoformat()}

    awsauth = AWS4Auth(AWS_KEY, AWS_SECRET, REGION, 'es')

    es = Elasticsearch(
            hosts=[{'host': HOST, 'port': 443}],
            http_auth=awsauth, use_ssl=True, verify_certs=True,
            connection_class=RequestsHttpConnection)

    _index = 'system-stats-' + date
    _type  = 'temperature'
    print doc
    print es.index(index=_index, doc_type=_type, body=doc)

To query data we use elasticsearch-dsl.

sudo pip install elasticsearch-dsl
from elasticsearch import Elasticsearch
from elasticsearch import RequestsHttpConnection
from requests_aws4auth import AWS4Auth

from elasticsearch_dsl import Search

from datetime import datetime

HOST        = 'search-logs-xxxxxxxxxxxxxxxxxxxxxxxxxx.ap-southeast-1.es.amazonaws.com' # see 'Endpoint' in ES status page
REGION      = 'ap-southeast-1' # choose the correct region
AWS_KEY     = 'XXXXXXXXXXXXXXXXXXXX'
AWS_SECRET  = 'YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY'

awsauth = AWS4Auth(AWS_KEY, AWS_SECRET, REGION, 'es')

client = Elasticsearch(
            hosts=[{'host': HOST, 'port': 443}],
            http_auth=awsauth, use_ssl=True, verify_certs=True,
            connection_class=RequestsHttpConnection)

plot_date      = '2017-08-06'
monitored_host = 'blah'

s = Search(using=client, index='system-stats-' + plot_date) \
       .query('match', host=monitored_host)

response = s.execute()

xy = [(datetime.strptime(hit.datetime, '%Y-%m-%dT%H:%M:%S.%f'), hit.temperature) for hit in response]
xy = sorted(xy, key=lambda z: z[0])

for (x, y) in xy:
    print(x,y)

Sample output:

$ python3 dump.py 
2017-08-06 04:00:02.337370 32.0
2017-08-06 05:00:01.779796 37.0
2017-08-06 07:00:01.789370 37.0
2017-08-06 11:00:01.711586 40.0
2017-08-06 12:00:02.054906 42.0
2017-08-06 16:00:02.075869 44.0
2017-08-06 18:00:01.619764 43.0
2017-08-06 19:00:02.319470 38.0
2017-08-06 20:00:03.098032 43.0
2017-08-06 22:00:03.629017 43.0

For exploring the data you can also use kibana, which is included with the ElasticSearch service from AWS.

Another nifty thing about the AWS infrastructure is that you can use Lambda to create ElasticSearch entries when objects drop in an S3 bucket. More details in this post.

Stripping html tags using TagSoup

I had a situation, when converting old blog posts to WordPress, where I wanted to strip all the extra info on the pre tags. For example this:

&lt;pre&gt;&lt;code&gt;&lt;span style=&quot;&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span style=&quot;color: blue; font-weight: bold;&quot;&gt;import&lt;/span&gt; &lt;span style=&quot;&quot;&gt;Data&lt;/span&gt;&lt;span style=&quot;&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;&quot;&gt;Char&lt;/span&gt;

would turn into:

&gt;import Data.Char

It turns out that this is really easy using TagSoup.

module Detag where

import Control.Monad
import Text.HTML.TagSoup

The function to strip tags works on a list of tags of strings:

strip :: [Tag String] -&gt; [Tag String]

strip [] = []

If we hit a pre tag, ignore its info (the underscore) and continue on recursively:

strip (TagOpen &quot;pre&quot; _ : rest) = TagOpen &quot;pre&quot; [] : strip rest

Similarly, strip the info off an opening code tag:

strip (TagOpen  &quot;code&quot; _ : rest) = strip rest
strip (TagClose &quot;code&quot;   : rest) = strip rest

If we hit a span, followed by some text, and a closing span, then keep the text tag and continue:

strip (TagOpen &quot;span&quot; _ : TagText t : TagClose &quot;span&quot; : rest)
  = TagText t : strip rest

Don’t change other tags:

strip (t:ts) = t : strip ts

Parsing input from stdin is straightforward. We use optEscape and optRawTag to avoid mangling other html in the input.

main :: IO ()
main = do
    s &lt;- getContents
    let tags = parseTags s
        ropts = renderOptions{optEscape = id, optRawTag = const True}
    putStrLn $ renderTagsOptions ropts $ strip tags

Example output:

$ runhaskell Detag.hs 
&lt;pre class=&quot;sourceCode haskell&quot;&gt;&lt;code class=&quot;sourceCode haskell&quot;&gt;&lt;span style=&quot;&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span style=&quot;color: green;&quot;&gt;{-# LANGUAGE RankNTypes          #-}&lt;/span&gt;
&lt;pre&gt;&gt; {-# LANGUAGE RankNTypes          #-}