Knight on a chess board

Here is a well-known interview/code golf question: a knight is placed on a chess board. The knight chooses from its 8 possible moves uniformly at random. When it steps off the board it doesn’t move anymore. What is the probability that the knight is still on the board after \( n \) steps?

We could calculate this directly but it’s more interesting to frame it as a Markov chain.

Calculation using the transition matrix

Model the chess board as the tuples \( \{ (r, c) \mid 0 \leq i, j \leq 7 \} \).

Here are the valid moves and a helper function to check if a move
\( (r,c) \rightarrow (u,v) \) is valid and if a cell is on the usual \( 8 \times 8 \) chessboard:

moves = [(-2, 1), (-1, 2), (1, 2), (2, 1), 
         (2,-1), (1,-2), (-1,-2), (-2,-1)]

def is_move(r, c, u, v):
    for m in moves:
        if (u, v) == (r + m[0], c + m[1]):
            return True
    return False

def on_board(x):
    return 0 <= x[0] < 8 and 0 <= x[1] < 8

The valid states are all the on-board positions plus the immediate off-board positions:

states = [(r, c) for r in range(-2, 8+2) for c in range(-2, 8+2)]

Now we can set up the transition matrix.

def make_matrix(states):
    """
    Create the transition matrix for a knight on a chess board
    with all moves chosen uniformly at random. When the knight
    moves off-board, no more moves are made.
    """

    # Handy mapping from (row, col) -> index into 'states'
    to_idx = dict([(s, i) for (i, s) in enumerate(states)])

    P = np.array([[0.0 for _ in range(len(states))] for _ in range(len(states))], dtype='float64')
    assert P.shape == (len(states), len(states))

    for (i, (r, c)) in enumerate(states):
        for (j, (u, v)) in enumerate(states):
            # On board, equal probability to each destination, even if goes off board.
            if on_board((r, c)):
                if is_move(r, c, u, v):
                    P[i][j] = 1.0/len(moves)

            # Off board, no more moves.
            else:
                if (r, c) == (u, v): # terminal state
                    P[i][j] = 1.0
                else:
                    P[i][j] = 0.0

    return to_idx, P

We can visualise the transition graph using graphviz (full code here):

with_corners

Oops! The corners aren’t connected to anything so we have 5 communicating classes (the 4 corners plus the rest). We never reach these nodes from any of the starting positions so we can get rid of them:

corners = [(-2,9), (9,9), (-2,-2), (9,-2)]
states = [(r, c) for r in range(-2, 8+2) for c in range(-2, 8+2) if (r,c) not in corners]

Here’s the new transition graph:

no_corners

Intuitively, the knights problem is symmetric, and this graph is symmetric, so it’s likely that we’ve set things up correctly.

Let \( X_0 \), \( X_1 \), \( \ldots \), \( X_n \) be the positions of the knight. Then then probability of the knight moving from state \( i \) to \( j \) in \( n \) steps is

\[
P(X_n = j \mid X_0 = i) = (P^n)_{i,j}
\]

So the probability of being on the board after \( n \) steps, starting from \(i\), will be

\[
\sum_{k \in \mathcal{B}} (P^n)_{i,k}
\]

where \( \mathcal{B} \) is the set of on-board states. This is easy to calculate using Numpy:

start = (3, 3)
n     = 5

idx = to_idx[start]
Pn = matrix_power(P, n)
pr = sum([Pn[idx][r] for (r, s) in enumerate(states) if on_board(s)])

For this case we get probability \( 0.35565185546875 \).

Here are a few more calculations:

start: (0, 0)	n: 0	Pr(on board): 1.0
start: (3, 3)	n: 1	Pr(on board): 1.0
start: (0, 0)	n: 1	Pr(on board): 0.25
start: (3, 3)	n: 4	Pr(on board): 0.48291015625
start: (3, 3)	n: 5	Pr(on board): 0.35565185546875
start: (3, 3)	n: 100	Pr(on board): 5.730392258771815e-13

It’s always good to do a quick Monte Carlo simulation to sanity check our results:

def do_n_steps(start, n):
    current = start

    for _ in range(n):
        move = random.choice(moves)
        new = (current[0] + move[0], current[1] + move[1])
        if not on_board(new): break
        current = new

    return on_board(new)

N_sims = 10000000
n = 5

nr_on_board = 0

for _ in range(N_sims):
    if do_n_steps((3,3), n): nr_on_board += 1

print('pr on board from (3,3) after 5 steps:', nr_on_board/N_sims)

The estimate is fairly close to the value we got from taking power of the transition matrix:

pr on board from (3,3) after 5 steps: 0.3554605

Absorbing states

An absorbing state of a Markov chain is a state that, once entered, cannot be left. In our problem the absorbing states are precisely the off-board states.

A natural question is: given a starting location, how many steps (on average) will it take the knight to step off the board?

With a bit of matrix algebra we can get this from the transition matrix \( \boldsymbol{P} \). Partition \( \boldsymbol{P} \) by the state type: let \( \boldsymbol{Q} \) be the transitions of transient states (here, these are the on-board states to other on-board states); let \( \boldsymbol{R} \) be transitions from transient states to absorbing states (on-board to off-board); and let \( \boldsymbol{I} \) be the identity matrix (transitions of the absorbing states). Then \( \boldsymbol{P} \) can be written in block-matrix form:

\[
\boldsymbol{P}=
\left(
\begin{array}{c|c}
\boldsymbol{Q} & \boldsymbol{R} \\
\hline
\boldsymbol{0} & \boldsymbol{I}
\end{array}
\right)
\]

We can calculate powers of \( \boldsymbol{P} \):

\[
\boldsymbol{P}^2=
\left(
\begin{array}{c|c}
\boldsymbol{Q} & \boldsymbol{R} \\
\hline
\boldsymbol{0} & \boldsymbol{I}
\end{array}
\right)
\left(
\begin{array}{c|c}
\boldsymbol{Q} & \boldsymbol{R} \\
\hline
\boldsymbol{0} & \boldsymbol{I}
\end{array}
\right)
=
\left(
\begin{array}{c|c}
\boldsymbol{Q}^2 & (\boldsymbol{I} + \boldsymbol{Q})\boldsymbol{R} \\
\hline
\boldsymbol{0} & \boldsymbol{I}
\end{array}
\right)
\]

\[
\boldsymbol{P}^3=
\left(
\begin{array}{c|c}
\boldsymbol{Q}^3 & (\boldsymbol{I} + \boldsymbol{Q} + \boldsymbol{Q}^2)\boldsymbol{R} \\
\hline
\boldsymbol{0} & \boldsymbol{I}
\end{array}
\right)
\]

In general:

\[
\boldsymbol{P}^n=
\left(
\begin{array}{c|c}
\boldsymbol{Q}^n & (\boldsymbol{I} + \boldsymbol{Q} + \cdots + \boldsymbol{Q}^{n-1})\boldsymbol{R} \\
\hline
\boldsymbol{0} & \boldsymbol{I}
\end{array}
\right)
\]

We want to calculate \( \lim_{n \rightarrow \infty} \boldsymbol{P}^n \) since this will tell us the long-term probability of moving from one state to another. In particular, the top-right block will tell us the long-term probability of moving from a transient state to an absorbing state.

Here is a handy result from matrix algebra:

Lemma. Let \( \boldsymbol{A} \) be a square matrix with the property that \( \boldsymbol{A}^n \rightarrow \mathbf{0} \) as \( n \rightarrow \infty \). Then
\[
\sum_{n=0}^\infty = (\boldsymbol{I} – \boldsymbol{A})^{-1}.
\]

Applying this to the block form gives:

\[
\begin{align*}
\lim_{n \rightarrow \infty} \boldsymbol{P}^n
&=
\left(
\begin{array}{c|c}
\boldsymbol{Q}^n & (\boldsymbol{I} + \boldsymbol{Q} + \cdots + \boldsymbol{Q}^{n-1})\boldsymbol{R} \\
\hline
\boldsymbol{0} & \boldsymbol{I}
\end{array}
\right) \\
&= \left(
\begin{array}{c|c}
\lim_{n \rightarrow \infty} \boldsymbol{Q}^n & \lim_{n \rightarrow \infty} (\boldsymbol{I} + \boldsymbol{Q} + \cdots + \boldsymbol{Q}^{n-1})\boldsymbol{R} \\
\hline
\boldsymbol{0} & \boldsymbol{I}
\end{array}
\right) \\
&= \left(
\begin{array}{c|c}
\mathbf{0} & (\boldsymbol{I} – \boldsymbol{Q})^{-1}\boldsymbol{R} \\
\hline
\boldsymbol{0} & \boldsymbol{I}
\end{array}
\right)
\end{align*}
\]

where \( \lim_{n \rightarrow \infty} \boldsymbol{Q}^n = 0\) since all of the entries in \( \boldsymbol{Q} \) are transient.

The top-right corner also contains the fundamental matrix as defined in the following theorem:

Theorem Consider an absorbing Markov chain with \( t \) transient states. Let \( \boldsymbol{F} \) be a \( t \times t \) matrix indexed by the transient states, where \( \boldsymbol{F}_{i,j} \) is the expected number of visits to \( j \) given that the chain starts in \( i \). Then

\[
\boldsymbol{F} = (\boldsymbol{I} – \boldsymbol{Q})^{-1}.
\]

Taking the row sums of \( \boldsymbol{F} \) gives the expected number of steps \( a_i \) starting from state \( i \) until absorption (i.e. we count the number of visits to each transient state before eventual absorption):

\[
a_i = \sum_{k} \boldsymbol{F}_{i,k}
\]

Back in our Python code, we can rearrange the states vector so that the transition matrix is appropriately partitioned. Taking the \( \boldsymbol{Q} \) matrix is very quick using Numpy’s slicing notation:

states = [s for s in states if on_board(s)]
       + [s for s in states if not on_board(s)]
(to_idx, P) = make_matrix(states)

# k states
k = len(states)

# t transient states
t = len([s for s in states if on_board(s)])

Q = P[:t, :t]
assert Q.shape == (t, t)
assert Q.shape == (64, 64)

F = linalg.inv(np.eye(*Q.shape) - Q)

# example calculation for a_(3,3):

state = (3, 3)
print(F[to_idx[state], :].sum())

Again, compare to a Monte Carlo simulation to verify that the numbers are correct:

start: (0, 0)	Avg nr steps to absorb (MC): 1.9527606
start: (0, 0)	Avg nr steps (F matrix):     1.9525249995183136

start: (3, 3)	Avg nr steps to absorb (MC): 5.4187947
start: (3, 3)	Avg nr steps (F matrix):     5.417750460813215

So, on average, if we start in the corner \( (0,0) \) we will step off the board after \( 1.95 \) steps; if we start in the centre at \( (3,3) \) we will step off the board after \( 5.41 \) steps.

Further reading

The theoretical parts of this blog post follow the presentation in chapter 3 of Introduction to Stochastic Processes with R (Dobrow).

Source code: github.com/carlohamalainen/playground/tree/master/python/knight-random-walk.

Using Hypothesis to solve the farmer-chicken-fox puzzle

The farmer-chicken-fox puzzle goes something like this: a farmer is at a shop, having bought a chicken, fox, and a bag of corn. The farmer would like to get to her house on the other side of the river using a small boat. For some reason she can take at most one item at a time. If the chicken is left alone with the corn, it will eat the corn. If the fox is left alone with the chicken, it will eat the chicken. How does the farmer get across the river without losing any item?

A straightforward way to solve this is to use a breadth-first search, enumerating all valid moves from a set of states. The initial state would be

shop: {farmer, chicken, fox, corn}
house: {}

One valid first move is to go to the other side with the chicken, giving the new state:

shop: {fox, corn}
house: {farmer, chicken}

Another way is to use the Hypothesis library to simulate a finite state machine, and assert that no sequence of rules leads to the state where everyone is on the house side of the river. This is an adaptation of the Die Hard water jugs problem.

We model the state using two sets of strings:

class FarmerChickenFox(RuleBasedStateMachine):

    def __init__(self):
        self.shop  = set(['farmer', 'chicken', 'fox', 'corn'])
        self.house = set([])
        super().__init__()

Here’s the rule to take the chicken. We work out which side the farmer is in, and then move the farmer and the chicken to the other side. If this transition results in an invalid state (e.g. the fox left alone with the chicken) we undo the state-change.

    @rule()
    def take_chicken(self):
        self.save_state()

        if 'farmer' in self.shop and 'chicken' in self.shop:
            self.shop.remove('farmer')
            self.shop.remove('chicken')
            self.house.add('farmer')
            self.house.add('chicken')
            if not self.state_ok(): self.undo_state()
        elif 'farmer' in self.house and 'chicken' in self.house:
            self.house.remove('farmer')
            self.house.remove('chicken')
            self.shop.add('farmer')
            self.shop.add('chicken')
            if not self.state_ok(): self.undo_state()

We have two invariants to ensure that the state of the system is consistent:

    @invariant()
    def fox_not_with_chicken(self):
        return not self.is_fox_alone_with_chicken()

    @invariant()
    def chicken_not_with_corn(self):
        return not self.is_chicken_alone_with_corn()

The trick is to create an invariant to check if everyone is on the house side:

    @invariant()
    def not_solved(self):
        note("::: shop: {s}, house: {h}".format(s=self.shop, h=self.house))
        assert len(self.house) != 4

Running using pytest finds the solution, including the steps used:

$ pytest farmerchickenfox.py

(snipped)

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = FarmerChickenFox({})

    @invariant()
    def not_solved(self):
        note("::: shop: {s}, house: {h}".format(s=self.shop, h=self.house))
>       assert len(self.house) != 4
E       AssertionError: assert 4 != 4
E        +  where 4 = len({'chicken', 'corn', 'farmer', 'fox'})
E        +    where {'chicken', 'corn', 'farmer', 'fox'} = FarmerChickenFox({}).house

farmerchickenfox.py:117: AssertionError
-------------------------------------------------------------------------------------------- Hypothesis --------------------------------------------------------------------------------------------
Falsifying example: run_state_machine(factory=FarmerChickenFox, data=data(...))
state = FarmerChickenFox()
::: shop: {'fox', 'chicken', 'farmer', 'corn'}, house: set()
state.take_chicken()
::: shop: {'fox', 'corn'}, house: {'chicken', 'farmer'}
state.go_alone()
::: shop: {'fox', 'farmer', 'corn'}, house: {'chicken'}
state.take_corn()
::: shop: {'fox'}, house: {'corn', 'chicken', 'farmer'}
state.take_chicken()
::: shop: {'fox', 'chicken', 'farmer'}, house: {'corn'}
state.take_fox()
::: shop: {'chicken'}, house: {'corn', 'fox', 'farmer'}
state.go_alone()
::: shop: {'chicken', 'farmer'}, house: {'corn', 'fox'}
state.take_chicken()
::: shop: set(), house: {'corn', 'fox', 'farmer', 'chicken'}
state.teardown()

You can reproduce this example by temporarily adding @reproduce_failure('4.1.0', b'AAEBAQABAgEBAQMBAAEB') as a decorator on your test case
===================================================================================== 1 failed in 0.46 seconds =====================================================================================

Full source: farmerchickenfox.py.

An expression evaluator in CSV

Many business problems boil down to reading data from somewhere, transforming it, and writing it somewhere else.

We could implement the transformations in code, but non-technical users might like to see and edit the rules without having to deploy a new build.

Here’s an example rule:

  1. Read the value at http://example.com/foo/bar/x, refer to it as val.
  2. Return 10*(1.0/val).

Or, a more complicated rule:

  1. Read the value at http://example.com/foo/bar/x, refer to it as val_x.
  2. Read the value at http://example.com/foo/bar/y, refer to it as val_y.
  3. Return the average of 1.0/val_x and 1.0/val_y.

The question is how to flatten this out into a format suitable for a CSV file, to allow users to easily update the rules using Excel.

One approach would be to implement an expression DSL, but this gets a bit painful when the input space is cells in a spreadsheet or CSV file. There are also questions about encoding the order of evaluation.

Reverse Polish notation is a simple way to encode arbitrary mathematical formulas in a flat sequence. Here’s how to write the first example:

CONSTANT 1
GET_DATA /foo/bar/x
DIV
CONSTANT 10
MUL

This sequence of operations will do the following:

  1. Push 1 onto the stack.
  2. Get data from the URL, push onto the stack.
  3. Perform the divide operation (1.0 divided by the value we got from the URL), and push that onto the stack.
  4. Push 10 onto the stack.
  5. Multiply the result from step 3 by 10.

We might use the following format in a CSV file. The ORDER column ensures the correct sequencing of operations.

IDOUTPUT_IDORDEROPCONSTDATA_SOURCE
KEY0001RULE0010CONSTANT1 
KEY0002RULE0011GET_DATA /foo/bar/x
KEY0003RULE0012DIV  
KEY0004RULE0013CONSTANT10 
KEY0005RULE0014MUL  

And here is how to encode the second example:

IDOUTPUT_IDORDEROPCONSTDATA_SOURCE
KEY0200RULE0020CONSTANT1 
KEY0201RULE0021GET_DATA /foo/bar/x
KEY0202RULE0022DIV  
KEY0203RULE0023CONSTANT1 
KEY0204RULE0024GET_DATA /foo/bar/y
KEY0205RULE0025DIV  
KEY0206RULE0026PLUS  
KEY0207RULE0027CONSTANT2 
KEY0208RULE0028DIV  

Evaluating a sequence of operations is straightforward. Start with an empty stack (in Python, this is just a normal list). If the next operation is CONSTANT or GET_DATA, push the value onto the stack. Otherwise, an operation like PLUS will need two operands, so pop two things off the stack and then do that actual operation. As a bonus, we can render a normal mathematical expression as we go: instead of putting a floating point number onto the stack, put a string onto the stack.

Here is the entire evaluator:

def eval_rule(rule):
    s = []

    expr = []

    for (_, x) in rule.sort_values('ORDER').iterrows():
        op = x['OP']

        if op == 'GET_DATA':
            s.append(x['GET_DATA'])
            expr.append('(GET_DATA: ' + str(x['DATA_SOURCE']) + ')')

        elif op == 'CONSTANT':
            s.append(x['CONST'])
            expr.append(str(x['CONST']))

        elif op == 'MUL':
            b = s.pop()
            a = s.pop()
            s.append(a*b)

            b2 = expr.pop()
            a2 = expr.pop()
            expr.append('(' + a2 + '*' + b2 + ')')

        elif op == 'PLUS':
            b = s.pop()
            a = s.pop()
            s.append(a+b)

            b2 = expr.pop()
            a2 = expr.pop()
            expr.append('(' + a2 + '+' + b2 + ')')

        elif op == 'MINUS':
            b = s.pop()
            a = s.pop()
            s.append(a-b)

            b2 = expr.pop()
            a2 = expr.pop()
            expr.append('(' + a2 + '-' + b2 + ')')

        elif op == 'DIV':
            denominator = s.pop()
            numerator   = s.pop()
            s.append(numerator/denominator)

            denominator2 = expr.pop()
            numerator2   = expr.pop()
            expr.append('(' + numerator2 + '/' + denominator2 + ')')
        else:
            raise ValueError('Unknown operator: ' + op)

    if len(s) != 1:
        raise ValueError('Expected one item on the evaluation stack, but found: ' + str(s))

    if len(expr) != 1:
        raise ValueError('Expected one item on the expression stack, but found: ' + str(expr))

    return s[0], expr[0]

Just for fun, we will also evaluate the example from the Wikipedia page on Reverse Polish notation:

IDOUTPUT_IDORDEROPCONSTDATA_SOURCE
KEY0101RULE0030CONSTANT15 
KEY0102RULE0031CONSTANT7 
KEY0103RULE0032CONSTANT1 
KEY0104RULE0033CONSTANT1 
KEY0105RULE0034PLUS  
KEY0106RULE0035MINUS  
KEY0107RULE0036DIV  
KEY0108RULE0037CONSTANT3 
KEY0109RULE0038MUL  
KEY0110RULE0039CONSTANT2 
KEY0111RULE00310CONSTANT1 
KEY0112RULE00311CONSTANT1 
KEY0113RULE00312PLUS  
KEY0114RULE00313PLUS  
KEY0153RULE00314MINUS  

Here’s the output:

$ python3 rpn.py
RULE001
((1.0/(GET_DATA: /foo/bar/x))*10.0)
0.23809523809523808

RULE002
(((1.0/(GET_DATA: /foo/bar/x))+(1.0/(GET_DATA: /foo/bar/y)))/2.0)
0.02857142857142857

RULE003
(((15.0/(7.0-(1.0+1.0)))*3.0)-(2.0+(1.0+1.0)))
5.0

 

Reverse Polish Notation gives us a compact way to represent a sequence of operators with no ambiguity about the order of evaluation.

 

 

Optimising treehash calculations

I’ve been using my Glacier Push utility for a while now and it has been working well. But recently I noticed that on really large files the initial memory usage spiked, and then fell off. I suspected that the treehash calculation (implemented in Haskell) was not efficient so I re-implemented it in plain C.

The first thing was to write a utility to calculate the SHA256 of a buffer. Fortunately the OpenSSL docs have lots of examples to work from. Ignoring error handling, this is fairly straightforward:

void sha256(unsigned char *buffer,
            unsigned int buffer_size,
            unsigned char *output)
{
    EVP_MD_CTX *mdctx;
    const EVP_MD *md;
    unsigned int md_len;

    md = EVP_get_digestbyname("SHA256");
    assert(md != NULL);

    mdctx = EVP_MD_CTX_new();
    EVP_DigestInit_ex(mdctx, md, NULL);
    EVP_DigestUpdate(mdctx, buffer, buffer_size);
    EVP_DigestFinal_ex(mdctx, output, &md_len);
    EVP_MD_CTX_free(mdctx);

    assert(md_len == DIGEST_SIZE);
}

The next thing is the tree hash algorithm. This is basically a Merkle tree on SHA256.

In short, we divide the file into 1Mb blocks and hash each of them. Then we hash pairs, pairs of pairs, and so on, until we have one block left. For example a 4Mb file would take two steps of hashing:

img_3012-1

We don’t need a full tree ADT because we have an upper bound on the number of blocks (the file on disk is fixed) so we can make do with a buffer for the hashes and a temp buffer for writing intermediate results. The algorithm boils down to this loop (full source is here):

while (nr_blocks_left > 1) {
    int pairs_to_process = nr_blocks_left/2;

    // sha256 the pairs
    for(unsigned int k = 0; k < pairs_to_process; k++)
        sha256(&digests[2*k*DIGEST_SIZE],
               2*DIGEST_SIZE,
               &next_tmp[k*DIGEST_SIZE]);

    // Copy back into digests.
    for(int i = 0; i < pairs_to_process*DIGEST_SIZE; i++)
        digests[i] = next_tmp[i];

    // If there is a block left over, copy it.
    if (nr_blocks_left % 2 == 1) {
        for(int i = 0; i < DIGEST_SIZE; i++)
            digests[pairs_to_process*DIGEST_SIZE + i]
                = digests[(nr_blocks_left-1)*DIGEST_SIZE + i];
    }

    if (nr_blocks_left % 2 == 0)
        nr_blocks_left = pairs_to_process;
    else
        nr_blocks_left = pairs_to_process + 1;
}

Our C interface to treehash has this declaration:

char * treehash(char *fname, 
                unsigned long long start, 
                unsigned long long end);

To call this from Haskell we do some FFI:

{-# LANGUAGE ForeignFunctionInterface #-}

module TreehashFFI (treehash_FFI, treehash_FFI') where

import System.Posix.Files

import Foreign
import Foreign.C.Types
import Foreign.C.String

foreign import ccall "treehash.h treehash"
  c_treehash :: CString -> CULong -> CULong -> IO CString

treehash_FFI :: String -> Int64 -> Int64 -> IO String
treehash_FFI filename start end = withCString filename $ \c_filename -> do
  ptrHash <- c_treehash c_filename (fromIntegral start) (fromIntegral end)
  h <- peekCString ptrHash
  free ptrHash
  return h

And we can write a nice wrapper in Haskell by providing the file size:


treehash_FFI' :: String -> IO String
treehash_FFI' fp = do
  stat <- getFileStatus fp
  let lastByte = fromIntegral $ toInteger (fileSize stat) - 1
  treehash_FFI fp 0 lastByte

To monitor the memory performance I logged the output of ps and plotted the results using gnuplot (scripts borrowed from Bruno Girin).

Memory usage (as percentage of 16Gb total memory) when pushing a 7.5Gb file to Glacier:

original_vs_C

The C version uses so little memory that it barely shows up on the plot. Both implementations use about the same amount of memory after the treehash calculation.

Unbounded memory use with SciPy’s hypergeom

Edit 2019-01-12: fixed here: numpy/pull/11977

Nadiah ran into an apparent memory leak in SciPy’s hypergeom distribution.

Here is a minimal example. Running this code results in unbounded memory use, looking like a memory leak:

from scipy.stats import hypergeom

while True:
    x = hypergeom(100, 30, 40).cdf(3)

It turns out that this isn’t really a memory leak but rather a problem with NumPy’s vectorize method which creates a circular reference in some situations. Here’s the GitHub issue that I opened: numpy/issues/11867.

In the mean time, a workaround is to manually delete the _ufunc attribute after using cdf:

from scipy.stats import hypergeom

while True:
    h = hypergeom(100, 30, 40)
    x = h.cdf(3)
    del h.dist._cdfvec._ufunc

Alternatively, avoid the frozen distribution and call cdf directly:

from scipy.stats import hypergeom

while True:
    x = hypergeom.cdf(3, 100, 30, 40)

It’s worth mentioning that memory_profiler is a great tool for finding memory leaks:

from scipy.stats import hypergeom
from memory_profiler import profile

@profile
def main1():
    for _ in range(1000):
        x = hypergeom(100, 30, 40).cdf(3)

main1()

Output:

$ python3 geomprofile.py 
Filename: geomprofile.py

Line #    Mem usage    Increment   Line Contents
================================================
     4     69.2 MiB     69.2 MiB   @profile
     5                             def main1():
     6     79.2 MiB      0.0 MiB       for _ in range(1000):
     7     79.2 MiB     10.0 MiB           x = hypergeom(100, 30, 40).cdf(3)

We see that the hypergeom line contributed to an increase in memory use of 10Mb.

Drilling down into NumPy’s vectorize took a bit of manual debugging; I didn’t have as much luck with memory_profiler there.

In a production situation one might not have the luxury of finding the real cause of the memory leak immediately. In that case it might be enough to wrap the offending code in a call to multiprocessing so that the leaked memory is reclaimed frequently. A lightweight option is to use processify. See Liau Yung Siang’s blog post for more details.

readFile :: String -> IO [Char]

Using a modern GHC compiler, how much memory would this program use?

x &lt;- readFile "foo"
x `deepseq` print ()

Linear in the size of foo? Or something else?

Turns out, for the default readFile from the Prelude, the answer is about 40 times the size of the input file.

The default Haskell strings take 5 words per character, so on a 64bit machine this is 40 5*8 = 40 bytes per character. The list of characters is stored as a linked list, roughly like this diagram (taken from Johan Tibbel’s ZuriHac 2015 talk, slides are here):

We can check the actual memory usage of a Haskell program (compiled with GHC) by using the RTS options:

ghc readfile.hs -Wall -O2 -rtsopts

./readfile +RTS -toutput &lt;other options&gt;

See this repository for some scripts to benchmark a few variants of readFile:

  • Prelude
  • Data.ByteString
  • Data.ByteString.Char8
  • Data.ByteString.Lazy
  • Data.ByteString.Lazy.Char8

Basically, anything’s better than the default readFile:

readfile_memory_usage

QuickCheck’s CoArbitrary – generate random functions!

Property based testing is a great way to improve code quality since it captures logical properties of your system. Instead of writing test cases by hand, you capture logical relationships and then let the test framework generate hundreds or thousands of examples for you.

For a general introduction to property based testing (language-independent), try this YOW! Night talk Property Based Testing Finding More Bugs with Less Effort by Charles O’Farrell.

QuickCheck provides a typeclass CoArbitrary for generating random functions. In this blog post we derive CoArbitrary in a standalone setting. For the real definition, see Test.QuickCheck. Source code for this post is here.

Some imports:


> {-# LANGUAGE InstanceSigs #-}
> {-# LANGUAGE RankNTypes   #-}
> 
> module Sample where
> 
> import Control.Applicative
> import Control.Monad
> import System.Random

First, a generator of type Gen a lets us use a random number generator to get a value of type a:


> newtype Gen a = Gen
>     { unGen :: StdGen -> a }

(The real Gen in QuickCheck is parameterised over the generator, but we don’t need that here. Also, the real Gen includes a size, which you can control using resize or scale).

The Arbitrary typeclass:


> class Arbitrary a where
>     arbitrary :: Gen a

Since False < True (there is an instance of Ord for Bool) we can use randomR to define arbitrary for Bool:


> instance Arbitrary Bool where
>     arbitrary = Gen $ \stdgen ->
>       let (r, _) = randomR (False, True) stdgen in r

Here’s a first attempt at implementing Gen (Bool -> Bool), a generator for boolean functions:


> genBoolFn0 :: Gen (Bool -> Bool)
> genBoolFn0 = Gen $ \stdgen ->
>   \a -> let (r, _) = randomR (False, True) stdgen in r

The type is right but it’s going to generate pretty boring functions since it doesn’t even use the a:


> runBoolFnGen :: Gen (Bool -> Bool) -> IO ()
> runBoolFnGen g = do
>   fns  <- samples g
> 
>   forM_ fns  $ \f -> do
>     putStrLn $ "True  => " ++ show (f True)
>     putStrLn $ "False => " ++ show (f False)
>     putStrLn ""

The functions are either const True or const False. Not useful.


ghci> runBoolFnGen genBoolFn0 
True  => True
False => True

True  => False
False => False

True  => False
False => False

True  => True
False => True

True  => False
False => False

True  => True
False => True

True  => False
False => False

True  => False
False => False

True  => False
False => False

We need to split on the a somehow:


> genBoolFn1 :: Gen (Bool -> Bool)
> genBoolFn1 = Gen $ \stdgen -> \a -> case a of
>   True  -> let (r, _) = randomR (False, True) stdgen in r
>   False -> let (r, _) = randomR (False, True) stdgen in r

This isn’t any better. The other thing we can change is the generator. Fortunately, StdGen is an instance of RandomGen, so we have the split function:


> -- The split operation allows one to obtain two
> -- distinct random number generators. This is
> -- very useful in functional programs (for example,
> -- when passing a random number generator down to
> -- recursive calls), but very little work has been done
> -- on statistically robust implementations of split
> -- ([System.Random, System.Random] are the only
> -- examples we know of).
> 
> split :: StdGen -> (StdGen, StdGen)
> split = ...

Taking advantage of laziness, we can use split to write a pure function that gives us an infinite sequence of statistically distinct generators:


> splitGenerator :: RandomGen a => a -> [a]
> splitGenerator r = r0 : splitGenerator r1
>   where
>     (r0, r1) = split r

This is exactly what we need to permute the generator in genBoolFn1. Let’s map True to the generator at index 0 and False to to the generator at index 1:


> genBoolFn2 :: Gen (Bool -> Bool)
> genBoolFn2 = Gen $ \stdgen -> \a -> case a of
>   True  -> let (r, _) = randomR
>                           (False, True)
>                           (splitGenerator stdgen !! 0)
>             in r
>   False -> let (r, _) = randomR
>                           (False, True)
>                           (splitGenerator stdgen !! 1)
>             in r

Now the random functions look more random:


ghci> runBoolFnGen genBoolFn2
True  => False
False => True

True  => True
False => False

True  => False
False => True

True  => True
False => False

True  => True
False => True

True  => True
False => False

True  => True
False => True

True  => False
False => False

True  => False
False => True

So, what about random integer functions Int -> Int? We need to map any integer to one of the split generators, in other words we need a mapping $\mathbb{N} \rightarrow \mathbb Z_{\ge 0}$. Send the non-negative integers to the non-negative even integers, and the negative integers to the positive odd integers:

\[
n \rightarrow
\begin{cases}
2n & \mbox{if } n \geq 0 \\
2(-n) + 1 & \mbox{if } n < 0 \end{cases} \]

In Haskell this looks like:


> \n -> variant $ if n >= 0 then 2*n else 2*(-n) + 1

where


> variant :: Int -> Gen b -> Gen b
> variant v (Gen g) = Gen $ \r -> g $ splitGenerator r !! (v+1)

Capture this with a typeclass:


> class CoArbitrary a where
>     coarbitrary :: a -> Gen b -> Gen b
> 
> instance CoArbitrary Bool where
>   coarbitrary False = variant 0
>   coarbitrary True  = variant 1
> 
> instance CoArbitrary Int where
>     coarbitrary n = variant $ if n >= 0 then 2*n else 2*(-n) + 1

With some experimentation we can extend the CoArbitrary definitions to lists and tuples:


> instance CoArbitrary a => CoArbitrary [a] where
>   coarbitrary []     = variant 0
>   coarbitrary (x:xs) = variant 1 . coarbitrary (x, xs)
> 
> instance (CoArbitrary a, CoArbitrary b) => CoArbitrary (a, b) where
>   coarbitrary (x, y) = coarbitrary x . coarbitrary y

In general, if we have CoArbitrary a and Arbitrary b then we can derive Arbitrary (a -> b):


> instance (CoArbitrary a, Arbitrary b) => Arbitrary (a -> b) where
>     arbitrary = promote (\a -> coarbitrary a arbitrary)
> 
> promote :: (a -> Gen b) -> (Gen (a->b))
> promote f = Gen $ \r ->
>   \a -> let Gen h = f a
>          in h r

Here are more examples of random functions:


> runGenFn :: (Arbitrary a, Arbitrary b, Show a, Show b)
>          => Gen (a -> b)
>          -> [a]
>          -> IO ()
> runGenFn g as = do
>   fns  <- samples g
> 
>   forM_ fns  $ \f -> do
>     forM_ as $ \a -> putStrLn $ show a ++ " => "
>                                        ++ show (f a)
>     putStrLn ""

Running in ghci:


ghci> runGenFn (arbitrary :: Gen (Int -> Int)) [0, 1, 2]
0 => 198
1 => 940
2 => -200

0 => 734
1 => -627
2 => 6

0 => 965
1 => 581
2 => -585

0 => -306
1 => -918
2 => 678

0 => -929
1 => 336
2 => -696

0 => -66
1 => 123
2 => 875

0 => -234
1 => -673
2 => 216

0 => 355
1 => 56
2 => -615

0 => 278
1 => 116
2 => 967

ghci> runGenFn (arbitrary :: Gen (Int -> Bool)) [0, 1, 2]
0 => False
1 => True
2 => False

0 => True
1 => False
2 => True

0 => True
1 => False
2 => False

0 => True
1 => False
2 => False

0 => True
1 => True
2 => True

0 => True
1 => True
2 => False

0 => False
1 => True
2 => False

0 => True
1 => True
2 => False

0 => True
1 => True
2 => True

ghci> runGenFn (arbitrary :: Gen ([Int] -> [Int])) [[42], [0, 1, 2]]
[42] => [-93,-540,-715,-557,-249]
[0,1,2] => [433,97,665,554,-690,635]

[42] => [-785,174,-676,-662,-549]
[0,1,2] => [-735,-328,226,-524,423]

[42] => [157,976,-774]
[0,1,2] => [-197,608,-520,-37,-689]

[42] => [-684]
[0,1,2] => [902,-138,-33,689,-774,-713,474,-638]

[42] => [-782,540,649,320,-326,92,896,-76]
[0,1,2] => [156]

[42] => [524,137]
[0,1,2] => [642,-763,771,-400,825,71,895,586,252,37]

[42] => [641,-304,-890,-375,449,-608,662,546,-740,-406]
[0,1,2] => [-632,-685,-232,202,-994,666,-121]

[42] => [200,599,-844]
[0,1,2] => [-554,149,370,547,-755,-706,131]

[42] => [-898,645,-472]
[0,1,2] => [-77]

Appendix

To get the code above to work we need instances for Monad, Functor, and Applicative. These are all lifted from Test.QuickCheck.


> instance Monad Gen where
>     return :: a -> Gen a
>     return a = Gen $ \_ -> a
> 
>     (>>=) :: Gen a -> (a -> Gen b) -> Gen b
> 
>     (Gen g) >>= f = Gen $ \r ->
>        let (r1, r2) = split r
>            Gen k = f $ g r1
>         in k r2
> 
> instance Functor Gen where
>     fmap f (Gen h) = Gen $ \r -> f (h r)
> 
> instance Applicative Gen where
>     pure x = Gen $ \_ -> x
> 
>     f <*> x = do
>         f' <- f
>         x' <- x
>         return $ f' x'
> 
> generate :: Gen a -> IO a
> generate (Gen g) = do
>     stdgen <- getStdGen
>     return $ g stdgen

Use sequence to get a few samples. Note that this relies on the Applicative instance’s definition of <*> to get a new standard number generator each time it is used, which in turn uses the Monad instance’s definition which uses split.


> samples :: Gen a -> IO [a]
> samples g = generate $ sequence [g, g, g, g, g, g, g, g, g]

> instance Arbitrary Int where
>   arbitrary = Gen $ \stdgen ->
>     let (r, _) = randomR (-1000, 1000) stdgen in r
> 
> choose :: Random a => (a,a) -> Gen a
> choose range = Gen $ \stdgen ->
>   let (r, _) = randomR range stdgen in r
> 
> vectorOf :: Int -> Gen a -> Gen [a]
> vectorOf = replicateM
> 
> listOf :: Gen a -> Gen [a]
> listOf g = do
>   k <- choose (1, 10)
>   vectorOf k g
> 
> instance Arbitrary a => Arbitrary [a] where
>   arbitrary = listOf arbitrary

Pricing an option on a cumulative sum.

Consider an option with payout on an underlying which is a cumulative sum. For example, a construction company might want to mitigate the risk of rainfall causing delays in a project. The construction company could buy a call option on the cumulative sum of rainfall not exceeding 180mm on any 3-day period. Further, we restrict this so that one day can only be used in one payoff. If the cumulative rainfall $r$ a 3-day period exceeded 180mm, the entity selling the call option would have the obligation to pay the construction company based on the difference $r – 180$.

As the entity selling one of these options, we need to work out the expected loss and worst possible scenario for payouts.

Let $r_0, r_1, \ldots$ be the sequence of daily rainfall. Form the 3-sums:

\[
s_0 = r_0+r_1+r_2, s_1 = r_1+r_2+r_3, s_2 = r_2+r_3+r_4, \ldots
\]

Let $C > 0$ be the cutoff (e.g. 180mm/day). Then form the sequence $w_0, w_1, \ldots$ where

\[
w_i = f_C(s_i)
\]

where $f_C(x)$ is $0$ if $x \leq C$ and $x$ otherwise.

As the seller of the option, the worst outcome for us (i.e. largest payout) corresponds to the subsequence

\[
w_{k_1}, w_{k_2}, \ldots w_{k_n}
\]

that maximises the sum $\sum_{l=1}^n w_{k_l}$ for some $n$, where the terms $w_{k_l}$ are mutually disjoint (i.e. we don’t take the 3-sum of any days that overlap). We consider this sum over one year, but the same approach applies to monthly, quarterly, etc.

Formulation as a graph problem

Label vertices with the 3-sums $w_i$ and create an edge $(w_i, w_j)$ if the sums $w_i$ and $w_j$ refer to overlapping time periods. Then our problem is equivalent to finding a Maximum Weighted Independent Set in the graph. This is a classic NP-hard problem.

Formulation as a linear programming problem

Create binary decision variables $x_0, x_1, \ldots$ and optimise

\[ \sum{w_i x_i} \]

subject to

\[
x_i + x_j \leq 1
\]

when $w_i$ and $w_j$ refer to non-disjoint cumulative 3-sums.

Python solver

Using PuLP we can very quickly write a solver in Python.

First set up a solver:

prob = pulp.LpProblem('maxcsums', pulp.LpMaximize)

Create binary variables corresponding to each vertex:

x_vars = [ pulp.LpVariable('x' + str(i), cat='Binary')
             for (i, _) in enumerate(weights) ]

Use reduce to form the sum \( \sum w_i x_i \):

prob += reduce(lambda a,b: a+b,
                   [ weights[i]*x_vars[i] for i in range(len(weights))])

Finally, given a list of indices (3-tuples), add the constraints to avoid solutions with adjacent vertices:

for (i, x) in enumerate(indices):
  for (j, y) in enumerate(indices):
    if x < y:
      if is_adjacent(x, y):
        prob += x_vars[i] + x_vars[j] <= 1

Changi daily rainfall

For a concrete example, let’s use the daily rainfall data for Changi in Singapore.

Here are the 3-sums:



For cutoffs from 100 to 300, here are the number of extreme events per year:

     100 120 140 160 180 200 220 240 260 280 300
1980   3   3   2   1   1   1   1   1   1   0   0
1981   1   1   1   1   0   0   0   0   0   0   0
1982   2   1   1   1   1   1   1   1   0   0   0
1983   2   2   2   2   2   1   0   0   0   0   0
1984   6   4   2   1   1   1   0   0   0   0   0
1985   1   1   0   0   0   0   0   0   0   0   0
1986   4   4   4   3   1   0   0   0   0   0   0
1987   4   3   2   2   1   1   1   1   1   0   0
1988   9   4   2   1   1   0   0   0   0   0   0
1989   3   1   1   1   1   1   1   1   0   0   0
1990   2   1   1   1   0   0   0   0   0   0   0
1991   4   3   1   0   0   0   0   0   0   0   0
1992   6   3   2   2   1   1   1   0   0   0   0
1993   1   0   0   0   0   0   0   0   0   0   0
1994   4   3   2   2   2   1   0   0   0   0   0
1995   3   3   3   2   2   2   1   0   0   0   0
1996   1   0   0   0   0   0   0   0   0   0   0
1997   0   0   0   0   0   0   0   0   0   0   0
1998   3   3   1   1   0   0   0   0   0   0   0
1999   2   0   0   0   0   0   0   0   0   0   0
2000   2   2   0   0   0   0   0   0   0   0   0
2001   5   3   2   2   2   2   2   2   2   1   0
2002   4   1   1   1   1   0   0   0   0   0   0
2003   5   1   1   1   1   1   0   0   0   0   0
2004   6   4   2   2   1   1   1   0   0   0   0
2005   2   1   1   1   0   0   0   0   0   0   0
2006   5   5   5   5   3   2   2   1   1   1   1
2007   7   5   4   3   1   1   1   1   1   0   0
2008   4   3   1   0   0   0   0   0   0   0   0
2009   1   1   0   0   0   0   0   0   0   0   0
2010   4   1   0   0   0   0   0   0   0   0   0
2011   4   2   1   1   1   1   1   1   1   1   0
2012   3   0   0   0   0   0   0   0   0   0   0
2013   5   3   3   1   0   0   0   0   0   0   0
2014   0   0   0   0   0   0   0   0   0   0   0
2015   0   0   0   0   0   0   0   0   0   0   0
2016   1   0   0   0   0   0   0   0   0   0   0

We will go with 180mm as the cutoff. Ultimately, choosing the strike is a business decision.

The burn is $\texttt{max}(w_i – 180, 0)$, indicating the payout that the seller of the option would be liable for. Here is the daily burn for Changi daily rainfall (zero rows are not shown):

            c3sum   burn
DATE                    
1980-01-20  275.5   95.5
1982-12-22  245.4   65.4
1983-08-22  190.6   10.6
1983-12-26  218.8   38.8
1984-02-02  219.3   39.3
1986-12-11  186.7    6.7
1987-01-11  260.5   80.5
1988-09-21  188.9    8.9
1989-12-01  252.5   72.5
1992-11-11  223.2   43.2
1994-11-11    209   29.0
1994-12-19  180.7    0.7
1995-11-02  203.4   23.4
1995-02-06  236.9   56.9
2001-01-16  268.7   88.7
2001-12-27  281.4  101.4
2002-01-25  182.1    2.1
2003-01-31  210.5   30.5
2004-01-25  237.9   57.9
2006-12-19  356.4  176.4
2006-12-26  222.7   42.7
2006-01-09  190.7   10.7
2007-01-12  260.4   80.4
2011-01-31  298.2  118.2

The yearly burn is $\sum_i{\texttt{max}(w_i-180,0)}$ where the $w_i$ are in a particular calendar year. Plotting the yearly burn:



The histogram of burn values shows that a value around $80$ is the most frequent total yearly burn:



Cumulative histogram of the yearly burn:



From the yearly burn we can compute the expected loss over various periods:


Expected loss (all years):     71.13
Expected loss (last 10 years): 86.22
Expected loss (last 20 years): 71.13

We see that the expected loss over the last 10, 20 years differs by about 15.0.
The 99th percentile is p99 = 223.051.

Unlike pricing options on a liquid instrument such as an interest rate swap, stock, currencies, etc, the pricing of this option is set in an incomplete market. There is no way to hedge, so the Black Scholes formula cannot be used. Typically the price will be set as

\[
f(\texttt{expected loss}, \texttt{high value on the tail})
\]

where the high value on the tail is p99 or p99.9 (or similar); the function $f$ will integrate the price maker’s risk appetite, but it will be approximately

\[
\texttt{option price} \approx \texttt{expected loss} + \Delta
\]

Notes

Source code is on Github: https://github.com/carlohamalainen/maxc3sum

Most of the data manipulation is done using Python Pandas which makes lots of things one-liners, e.g. to sum the daily burn to yearly burn, given an index of datetime.date, we can group by year and then sum the result:

    yearly_burn = daily_burn.groupby(lambda x: x.year).sum()

To set the gap for GLPK, pass the command line option when setting up the initial GLPK object:

    pulp.GLPK(options=['--mipgap', 0.1]).solve(prob)

Naturally, finding a good gap is problem-specific and requires some experimentation.

Push to Amazon Glacier using amazonka

Here is a small Haskell package for pushing files to Amazon Glacier: https://github.com/carlohamalainen/glacier-push. It uses Brendan Hay’s amazonka API, in particular amazonka-glacier.

One thing that I couldn’t find in amazonka was a way to calculate the tree hash of a file. The Glacier API needs this for each part that is uploaded as well as the whole file. Amazon explains how to calculate the tree hash in their Glacier docs and provides sample code in Java and C++. Since the algorithm is recursive, it is quite short in Haskell:


oneMb :: Int64
oneMb = 1024*1024

treeHash :: BS.ByteString -> Hash
treeHash s = treeHash' $ map sha256 $ oneMbChunks s
  where
    treeHash' []  = error "Internal error in treeHash'."
    treeHash' [x] = B16.encode x
    treeHash' xs  = treeHash' $ next xs

    next :: [BS.ByteString] -> [BS.ByteString]
    next []       = []
    next [a]      = [a]
    next (a:b:xs) = sha256 (BS.append a b) : next xs

    oneMbChunks :: BS.ByteString -> [BS.ByteString]
    oneMbChunks x
      | BS.length x <= oneMb = [x]
      | otherwise            = BS.take oneMb x : oneMbChunks (BS.drop oneMb x)

    sha256 :: BS.ByteString -> BS.ByteString
    sha256 = cs . SHA256.hashlazy

To push a large file to Glacier we do three things: initiate the multipart upload, upload each part (say, 100Mb chunks), and then finalize the upload.

Initiate the multipart upload

We do this to get an uploadId which is then used for each of the multipart uploads. We use initiateMultipartUpload, and need to set the part size.


initiateMulti env vault _partSize = send' env mpu
  where
    mpu = initiateMultipartUpload accountId vault
            & imuPartSize .~ (Just $ cs $ show _partSize)

Upload the parts

With the response from initiating the multipart upload (the mu parameter in uploadOnePart) we can push a single part using uploadMultipartPart. Here we have to also set the checksum and content range:


uploadOnePart env vault mu p = do
    let Part{..} = p

    body <- toHashed <$> getPart _path (_partStart, _partEnd)

    uploadId <- case mu ^. imursUploadId of
                    Nothing     -> throw MissingUploadID
                    Just uid    -> return uid

    let ump = uploadMultipartPart accountId vault uploadId body
                & umpChecksum .~ (Just $ cs $ p ^. partHash)
                & umpRange    .~ Just cr

    send' env ump

  where

    contentRange :: Int64 -> Int64 -> Text
    contentRange x y = "bytes " `append` cs (show x) `append` accountId `append` cs (show y) `append` "/*"

Complete the multipart upload

Completing the multipart upload lets Glacier know that it should start its job of assembling all the parts into a full archive. We have to set the archive size and the tree hash of the entire file.


completeMulti env vault mp mu = do
    uploadId <- case mu ^. imursUploadId of
                    Nothing     -> throw MissingUploadID
                    Just uid    -> return uid

    let cmu = completeMultipartUpload accountId vault uploadId
                & cmuArchiveSize .~ (Just $ cs $ show $ mp ^. multipartArchiveSize)
                & cmuChecksum    .~ (Just $ cs $ mp ^. multipartFullHash)

    send' env cmu

Notes

In each of these functions I used a helper for sending the request:


send' env x = liftIO $ runResourceT . runAWST env $ send x

I run the main block of work in a KatipContextT since I am using katip for structured logging. Adding new key-value info to the log context is accomplished using katipAddContext.


go vault' path = do
    $(logTM) InfoS "Startup."

    let vault = cs vault'

    let myPartSize = 128*oneMb

    mp  <- liftIO $ mkMultiPart path myPartSize

    env <- liftIO $ newEnv'
    mu  <- liftIO $ initiateMulti env vault myPartSize

    let uploadId = fromMaybe (error "No UploadId in response, can't continue multipart upload.")
                 $ mu ^. imursUploadId

    partResponses <- forM (mp ^. multiParts) $ \p ->
        katipAddContext (sl "uploadId" uploadId) $
        katipAddContext (sl "location" $ fromMaybe "(nothing)" $ mu ^. imursLocation) $ do
            doWithRetries 10 (uploadOnePart env vault mu p)

    case lefts partResponses of
        []   -> do $(logTM) InfoS "All parts uploaded successfully, now completing the multipart upload."
                   catch (do completeResponse <- completeMulti env vault mp mu
                             katipAddContext (sl "uploadId" uploadId)                             $
                              katipAddContext (sl "archiveId" $ completeResponse ^. acoArchiveId) $
                               katipAddContext (sl "checksum" $ completeResponse ^. acoChecksum ) $
                                katipAddContext (sl "location" $ completeResponse ^. acoLocation) $ do
                                  $(logTM) InfoS "Done"
                                  liftIO exitSuccess)
                         (\e -> do logServiceError "Failed to complete multipart upload." e
                                   $(logTM) ErrorS "Failed."
                                   liftIO exitFailure)

        errs -> do forM_ errs (logServiceError "Failed part upload.")
                   $(logTM) ErrorS "Too many part errors."
                   liftIO exitFailure

logServiceError msg (ServiceError e)
    = let smsg :: Text
          smsg = toText $ fromMaybe "" $ e ^. serviceMessage

          scode :: Text
          scode = toText $ e ^. serviceCode

        in katipAddContext (sl "serviceMessage" smsg) $
            katipAddContext (sl "serviceCode" scode)  $
             (headersAsContext $ e ^. serviceHeaders) $
               $(logTM) ErrorS msg

logServiceError msg (TransportError e)
    = let txt :: Text
          txt = toText $ show e
        in katipAddContext (sl "TransportError" txt) $
            $(logTM) ErrorS msg

logServiceError msg (SerializeError e)
    = let txt :: Text
          txt = toText $ show e
        in katipAddContext (sl "SerializeError" txt) $
            $(logTM) ErrorS msg

I found it handy to write this little helper function to turn each header from an amazonka ServiceError into a Katip context key/value pair:


headersAsContext :: KatipContext m => [Header] -> m a -> m a
headersAsContext hs = foldl (.) id $ map headerToContext hs
  where
    headerToContext :: KatipContext m => Header -> m a -> m a
    headerToContext (h, x) = let h' = cs $ CI.original h :: Text
                                 x' = cs x               :: Text
                               in katipAddContext (sl h' x')

Katip can write to ElasticSearch using katip-elasticsearch. Then you’d be able to search for errors on specific header fields, etc.

Sample run

glacier-push-exe basement myfile
[2017-08-30 12:44:47][glacier-push.main][Info][x4][1724][ThreadId 7][main:Main app/Main.hs:300:7] Startup.
[2017-08-30 12:44:50][glacier-push.main][Info][x4][1724][ThreadId 7][location:/998720554704/vaults/basement/multipart-uploads/vZMCGNsLGhfTJ_hJ-CJ_OF_juCAY1IaZDl_A3YqOZXnuQRH_AtPiMaUE-K1JDew-ZwiIuDZgR3QbjsJIEfWtGeMNeKDs][partEnd:134217727][partStart:0][uploadId:vZMCGNsLGhfTJ_hJ-CJ_OF_juCAY1IaZDl_A3YqOZXnuQRH_AtPiMaUE-K1JDew-ZwiIuDZgR3QbjsJIEfWtGeMNeKDs][main:Main app/Main.hs:213:15] Uploading part.
[2017-08-30 12:45:45][glacier-push.main][Info][x4][1724][ThreadId 7][location:/998720554704/vaults/basement/multipart-uploads/vZMCGNsLGhfTJ_hJ-CJ_OF_juCAY1IaZDl_A3YqOZXnuQRH_AtPiMaUE-K1JDew-ZwiIuDZgR3QbjsJIEfWtGeMNeKDs][partEnd:268435455][partStart:134217728][uploadId:vZMCGNsLGhfTJ_hJ-CJ_OF_juCAY1IaZDl_A3YqOZXnuQRH_AtPiMaUE-K1JDew-ZwiIuDZgR3QbjsJIEfWtGeMNeKDs][main:Main app/Main.hs:213:15] Uploading part.
[2017-08-30 12:46:37][glacier-push.main][Info][x4][1724][ThreadId 7][location:/998720554704/vaults/basement/multipart-uploads/vZMCGNsLGhfTJ_hJ-CJ_OF_juCAY1IaZDl_A3YqOZXnuQRH_AtPiMaUE-K1JDew-ZwiIuDZgR3QbjsJIEfWtGeMNeKDs][partEnd:293601279][partStart:268435456][uploadId:vZMCGNsLGhfTJ_hJ-CJ_OF_juCAY1IaZDl_A3YqOZXnuQRH_AtPiMaUE-K1JDew-ZwiIuDZgR3QbjsJIEfWtGeMNeKDs][main:Main app/Main.hs:213:15] Uploading part.
[2017-08-30 12:46:55][glacier-push.main][Info][x4][1724][ThreadId 7][main:Main app/Main.hs:320:22] All parts uploaded successfully, now completing the multipart upload.
[2017-08-30 12:46:57][glacier-push.main][Info][x4][1724][ThreadId 7][archiveId:bImG6jM0eQGNC7kIJTsC_wtcAXdPDUtJ-NyfstrkxeyTtXC_iEgkvenH-h_eQH-LYbhVKWJM7WuZlb7OHKtgKJNEpOtVaqxEhlNRTHphUtLCurcHAQDHKkiTnIXTpFxgPgvP9Q0axA][checksum:4f08645d8f3705dc222eef7547591c400362806abb7a6298b9267ebf2be7d901][location:/998720554704/vaults/basement/archives/bImG6jM0eQGNC7kIJTsC_wtcAXdPDUtJ-NyfstrkxeyTtXC_iEgkvenH-h_eQH-LYbhVKWJM7WuZlb7OHKtgKJNEpOtVaqxEhlNRTHphUtLCurcHAQDHKkiTnIXTpFxgPgvP9Q0axA][uploadId:vZMCGNsLGhfTJ_hJ-CJ_OF_juCAY1IaZDl_A3YqOZXnuQRH_AtPiMaUE-K1JDew-ZwiIuDZgR3QbjsJIEfWtGeMNeKDs][main:Main app/Main.hs:326:37] Done

The lines are pretty long (as they are intended for consumption into ElasticSearch, not human parsing) so here is one with line breaks:

[2017-08-30 12:45:45]
 [glacier-push.main]
 [Info]
 [x4]
 [1724]
 [ThreadId 7]
 [location:/998720554704/vaults/basement/multipart-uploads/vZMCGNsLGhfTJ_hJ-CJ_OF_juCAY1IaZDl_A3YqOZXnuQRH_AtPiMaUE-K1JDew-ZwiIuDZgR3QbjsJIEfWtGeMNeKDs]
 [partEnd:268435455]
 [partStart:134217728]
 [uploadId:vZMCGNsLGhfTJ_hJ-CJ_OF_juCAY1IaZDl_A3YqOZXnuQRH_AtPiMaUE-K1JDew-ZwiIuDZgR3QbjsJIEfWtGeMNeKDs]
 [main:Main app/Main.hs:213:15] Uploading part.

Checking out and compiling

Use Stack. Then:

$ git clone https://github.com/carlohamalainen/glacier-push.git
$ cd glacier-push
$ stack build

To browse the source on github, have a look at: