Folds and parallelism

A Monday recap

Someone asked a thoughtful question on Monday about folds, specifically about foldr vs foldl and foldl'.

I wasn't very happy with the answers I came up with, because I was in a bit of a rush.

Always be suspicious of teachers who get into a muddle if you spring an unexpected question!
Do they really know what they're talking about, or are they just good at reading slides?

Today, let's slow down and think about folds a little more.

Folding from the right

Here's the definition of foldr:

foldr :: (a -> b -> b) -> b -> [a] -> b
foldr f = go
  where
    go (x:xs) = f x (go xs)
    go z      = z

It associates to the right, so here's what a use of foldr looks like, expanded:

foldr f z [a,b,c,d] = f a (f b (f c (f d z)))

foldr and strictness: 0 of n

Suppose we want to sum the elements of a list using foldr:

foldr (+) 0 [3..6]

In order for the (+) operator to add two numbers, it has to know what each one is, so at least for the usual types we work with, it's strict in both of its arguments.

(Note: (+) is not required to be strict. That behaviour depends on the instance of Num in use.)

foldr and strictness: 1 of n

Let's do a stepwise expansion of the foldr-based sum.

foldr (+) 0 (3:{-...-}) = 3 + {-...-}

We can't produce an answer after consuming the first element of the list, because + is strict in its arguments.

foldr and strictness: 2 of n

foldr (+) 0 (3:4:{-...-}) = 3 + (4 + {-...-})

Neither can we produce an answer after the second element.

In fact, we're building up an expression that we can't reduce to a smaller form, because we're associating to the right and we haven't seen the rest of the list yet (see the definition of foldr).

foldr and strictness: n of n

In fact, not until we reach the end of the list do we have a number on the right hand side of a +:

foldr (+) 0 (3:4:5:6:[]) = 3 + (4 + (5 + (6 + 0)))

What we've done is build up a big expression that we can't evaluate until we've seen the entire list.

This is due in equal part to two factors:

foldr is right associative.
The function we're folding with is strict in both arguments.

So how about foldl?

If foldr was getting us all screwed up due to the associativity, maybe foldl will do better.

foldl :: (a -> b -> a) -> a -> [b] -> a
foldl f = go
  where
    go z (x:xs) = go (f z x) xs
    go z _      = z

Let's see an expansion.

foldl f z [a,b,c,d] = f (f (f (f z a) b) c) d

Ouch! Maybe we can make that clearer with infix syntax?

foldl (+) z [a,b,c,d] = (((z + a) + b) + c) + d

foldl and strictness

For summing elements, this looks better than foldr on the face of things, because at every step through the list, the function we're folding with can see both of its arguments.

But there's a catch.

None of the intermediate results are visible to the caller of foldl, so the caller cannot force those results to be evaluated.

In other words, we're still building up a big thunk!

A strict left fold

This function is defined for us in Data.List, and it has the same type as foldl:

{-# LANGUAGE BangPatterns #-}

foldl' :: (a -> b -> a) -> a -> [b] -> a
foldl' f = go
  where
    go !z (x:xs) = go (f z x) xs
    go !z  _     = z

The crucial difference lies with the strictness annotation, which forces each intermediate result to be evaluated to WHNF.

foldl' and strictness: 1 of n

That same sum again.

foldl' (+) 0 [1,2,3]

First step:

foldl' (+) 0 (1:{-...-}) = foldl' (0+1) {-...-}
                         = foldl' 1     {-...-}

foldl' and strictness: 2 of n

Second step:

foldl' (+) 1 (2:{-...-}) = foldl' (1+2) {-...-}
                         = foldl' 2     {-...-}

The structure of foldl' causes us to replace the accumulator z with a new value at each step.

The strictness annotation (we could have used seq too) causes that accumulator to be evaluated to WHNF before we continue.

foldl' and strictness: n of n

Last step:

foldl' (+) 3 [] = 3

I hope this makes the difference between foldl' and foldr more apparent!

Don't use foldl

It should be clear by now that plain old foldl is very rarely useful.

In fact, I've never personally found a situation where it was the right kind of fold to use.

If you find yourself thinking "I need a left fold", it's safe to assume that foldl' is what you'll need.

foldr and laziness

We've discussed the fact that using foldr with a strict function is a bad idea, because it generates a chain of thunks.

What if we used a function that was lazy in its second argument?

Hmm. Conveniently enough, we have an everyday example of just such a function: the list constructor!

What does this function do?

foldr (:) []

A handy way to think about foldr

Let's do another expansion, but this time we'll align the input list and result expression lexically:

foldr (<+>) z 
  (a  :  (b  :  (c  :  (d  :  [])))) =
  (a <+> (b <+> (c <+> (d <+> z ))))

Notice that each (:) in the input list is "replaced" by an application of (<+>) in the result, and the [] at the end is replaced with z.

(It doesn't matter what (<+>) actually does here. We're purely interested in the structures of the two expressions.)

Quick exercise 1: map

Write a version of map that does not pattern match on its input list, but instead uses foldr:

map :: (a -> b) -> [a] -> [b]

map (+1) [1,2,3] == [2,3,4]

Quick exercise 1: my implementation

myMap f = foldr go []
  where go x xs = f x : xs

Quick exercise 2: filter

Write a version of filter that again uses foldr:

filter :: (a -> Bool) -> [a] -> [a]

filter odd [0..5] == [1,3,5]

Quick exercise 2: my version

myFilter p = foldr go []
  where
    go x xs
      | p x       = x : xs
      | otherwise = xs

Quick exercise 3: append

Write an append function that has the same behaviour as the (++) operator:

(++) :: [a] -> [a] -> [a]

[1,2,3] ++ [4,5,6] == [1,2,3,4,5,6]

For bonus points, make your implementation as short as possible.

Quick exercise 3: mine

append = flip (foldr (:))

Parallelism vs concurrency

A handy rule of thumb for distinguishing the two:

Concurrency: "I have a number of potentially unrelated tasks to deal with."
Parallelism: "I want my single question to be answered sooner if I can throw more resources at the problem."

Many languages blur these notions together, by providing a "one size fits all" toolkit for concurrency and parallelism.

What does a concurrent system need?

We do not need multiple CPU cores to achieve concurrent processing, nor do we even need threads.

However, a good language runtime and programming environment will offer improved throughput and latency on concurrent problems if it can use multiple cores.

What about a parallel system?

The essence of parallel programming is to have multiple computing elements process parts of a problem simultaneously.

GHC provides a "baked in" way of using multiple CPU cores to do this.

To take advantage of this, we must build our executables with -threaded.

This links a Haskell program against the "threaded runtime", which can schedule lightweight Haskell threads across multiple CPU cores, and which can also run parallel programs on those cores.

Low-level parallel programming

In the Control.Parallel module lives an intriguing function.

par :: a -> b -> b

This is the parallel cousin to seq. It tells the threaded runtime that it may be beneficial to evaluate the first argument in parallel with the second.

Like seq, par returns the value of its second argument.

pseq

As it happens, we need another oddball function to work with par.

pseq :: a -> b -> b

This is semantically identical to seq, but with a subtle operational difference.

seq is strict in both arguments, so the compiler can take an expression like this:

a `seq` b

And transform it into this:

b `seq` a `seq` b

This can be a problem when annotating code for parallelism, when we need more control over the order of evaluation.

In contrast to seq, pseq is only strict in its first argument.

This restricts the transformations that the compiler can perform, and lets us retain control of the evaluation order.

When to use par

We use par when two conditions are true:

The first argument is somewhat expensive to compute.
We don't have an immediate need for the result of the first computation.

If the first argument is too cheap, then the book-keeping overhead to evaluate it in parallel will diminish or erase any performance gain.

Futures

The par operator makes use of the overlap between lazy evaluation and futures.

To implement lazy evaluation, we need thunks: a representation for expressions which are not yet evaluated but whose value may later be demanded.

A future is a computation whose value is being evaluated in parallel and which we may wait for.

par offers a way to annotate a lazy computation as being potentially profitable to evaluate in parallel.

In other words, it turns a lazy computation into a future.

O rly?

In practice, par and pseq are a pain in the neck to use.

To use par effectively, we have to get some surprisingly subtle things right:

Pass an unevaluated computation to par
The unevaluated computation must be somewhat expensive
Ensure that its value will not be required by the enclosing computation for a while
Ensure that the result is shared by the rest of the program

If we miss any of 1 through 3, then we achieve little or no speedup.

If we miss 4, then the GC may get rid of the computed result before it can be used.

In practice, even experts fall afoul of these requirements on a regular basis.

The Par monad

The monad-par package provides a library that makes parallelism easier to achieve and reason about.

cabal install monad-par

It gives us a type named Par:

newtype Par a

instance Functor Par
instance Applicative Par
instance Monad Par

To evaluate a Par computation, we use runPar:

runPar :: Par a -> a

Notice that this has no side effects, so it will run deterministically.

Building blocks

To start a parallel computation, we use the fork action:

fork :: Par () -> Par ()

Forked tasks need a way to communicate with each other, for which we use the IVar type:

data IVar a
    deriving (Eq)

new :: Par (IVar a)
get :: IVar a -> Par a
put :: (NFData a) => IVar a -> a -> Par ()

The IVar type is a write-once mutable reference. The get function blocks until a put has been performed.

The NFData class

We know that seq evaluates an expression to WHNF, but sometimes we want to completely evaluate an expression to normal form.

Enter the deeqseq package:

import Control.DeepSeq

class NFData a where
    rnf :: a -> ()

We can write an NFData instance for any custom type. Our rnf implementation must reduce an expression to normal form.

data Foo a = Foo Int a

instance (NFData a) => NFData (Foo a) where
    rnf (Foo x y) = rnf x `seq` rnf y

Higher level operations

An extremely common pattern is for a thread to fork several children and wait for them to finish.

We can easily capture this idea with a suitable combinator.

spawn :: (NFData a) => Par a -> Par (IVar a)
spawn act = do
  i <- new
  fork (put i =<< act)
  return i

In fact, usually all we want is to simply wait for all children and return a list of their results.

parMapM :: (NFData b) => (a -> Par b) -> [a] -> Par [b]
parMapM f acts = do 
  ivars <- mapM (spawn . f) acts
  mapM get ivars

Computing the mean of a sample

Suppose we need the mean of a list of numbers. The naive way to do so would be sum xs / length xs, but this has numerical stability problems.

Here's a more stable method:

import Data.List

data T a = T !a !Int

mean :: (RealFrac a) => [a] -> a
mean = fini . foldl' go (T 0 0)
  where
    fini (T a _) = a
    go (T m n) x = T m' n'
      where m' = m + (x - m) / fromIntegral n'
            n' = n + 1

Questionable numbers

Supposing we have a set of sample data that we know little about, in particular its precision and variance.

This is exactly the kind of problem that the criterion benchmarking library faces: we have performance data, but it's dirty and doesn't follow any particular statistical distribution.

A technique called the jackknife lets us estimate these parameters.

We successively recompute a function (such as mean) over subsets of a sample, each time with a sliding window cut out.

For a w-width window and k samples, the jackknife has a cost of O((k - w)²).

jackknife :: ([a] -> b) -> [a] -> [b]
jackknife f = map f . resamples 500

Resampling

It's easy to write a resampling function using familiar building blocks.

resamples :: Int -> [a] -> [[a]]
resamples k xs =
    take (length xs - k) $
    zipWith (++) (inits xs) (map (drop k) (tails xs))

Our function resamples a list with a window size of k.

>> resamples 2 [0..5]
[[    2,3,4,5],
 [0,    3,4,5],
 [0,1,    4,5],
 [0,1,2,    5]]

Speeding up the jackknife

The nice thing about the jackknife is that each element of the result list is independent, so it's an "embarrassingly parallel" problem.

The monad-par package is very helpful in making this trivial to parallelize.

import Control.Monad.Par (runPar, parMap)

jackknifeP f = runPar . parMap f . resamples 500

Let's try it out!

Suitably noisy data

Noise!

A test program

import System.Random.Mersenne

crud = zipWith (\x a -> sin (x / 300)**2 + a) [0..]

main = do
  (xs,ys) <- splitAt 1500 . take 6000 <$> (randoms =<< getStdGen)
  let rs = crud xs ++ ys
  putStrLn $ "sample mean:    " ++ show (mean rs)

  let j = jackknifeP mean rs
  putStrLn $ "jack mean min:  " ++ show (minimum j)
  putStrLn $ "jack mean max:  " ++ show (maximum j)

Compiling

We have to remember to build using both -threaded and -rtsopts.

ghc -O --make -rtsopts -threaded Jackknife

-threaded gives us the threaded runtime.
We'll need -rtsopts to tell the RTS how many cores to use. It will default to just one if we do not specify otherwise.

To use e.g. 3 cores:

./Jackknife +RTS -N3

If we want to use all cores:

./Jackknife +RTS -N

Other approaches to parallelism

Meanwhile, par and the monad-par packages are not the only ways to achieve parallel speedups in Haskell.

There are other more "researchy" projects under way:

For multicore machines: Data Parallel Haskell
For GPU computing: the accelerate package

Data Parallel Haskell

DPH is the name of a research project to support nested data parallelism under GHC on multicore CPUs.

Nested data parallelism extends the programming model of flat data parallelism, as known from modern Fortran dialects, to:

Irregular parallel computations (such as divide-and-conquer algorithms) and
Irregular data structures (such as sparse matrices and tree structures).

This project is still a work in progress, and is quite technically intricate.

Accelerate

The accelerate library defines an embedded language for regular, multi-dimensional array computations with multiple backends to facilitate high-performance implementations.

Currently, there are two backends:

An interpreter that serves as a reference implementation of the intended semantics of the language.
A CUDA backend generating code for CUDA-capable NVIDIA GPUs.

Busting a myth

You'll sometimes hear uninformed people claim that functional programming is some kind of magic pixie dust for parallel programming.

This is not really true.

Certainly, aspects of the programming model make incidental parts of the problem easier.

However, tricky considerations like data dependencies, memory placement, and thread scheduling have every bit as insidious an effect on parallel performance in Haskell as in C.

Parallel programming is a subtle art no matter which language you do it in.