Someone asked a thoughtful question on Monday about folds, specifically about foldr
vs foldl
and foldl'
.
I wasn't very happy with the answers I came up with, because I was in a bit of a rush.
Always be suspicious of teachers who get into a muddle if you spring an unexpected question!
Do they really know what they're talking about, or are they just good at reading slides?
Today, let's slow down and think about folds a little more.
Here's the definition of foldr
:
foldr :: (a -> b -> b) -> b -> [a] -> b
foldr f = go
where
go (x:xs) = f x (go xs)
go z = z
It associates to the right, so here's what a use of foldr
looks like, expanded:
foldr f z [a,b,c,d] = f a (f b (f c (f d z)))
Suppose we want to sum the elements of a list using foldr
:
foldr (+) 0 [3..6]
In order for the (+)
operator to add two numbers, it has to know what each one is, so at least for the usual types we work with, it's strict in both of its arguments.
(Note: (+)
is not required to be strict. That behaviour depends on the instance of Num
in use.)
Let's do a stepwise expansion of the foldr
-based sum.
foldr (+) 0 (3:{-...-}) = 3 + {-...-}
We can't produce an answer after consuming the first element of the list, because +
is strict in its arguments.
foldr (+) 0 (3:4:{-...-}) = 3 + (4 + {-...-})
Neither can we produce an answer after the second element.
In fact, we're building up an expression that we can't reduce to a smaller form, because we're associating to the right and we haven't seen the rest of the list yet (see the definition of foldr
).
In fact, not until we reach the end of the list do we have a number on the right hand side of a +
:
foldr (+) 0 (3:4:5:6:[]) = 3 + (4 + (5 + (6 + 0)))
What we've done is build up a big expression that we can't evaluate until we've seen the entire list.
This is due in equal part to two factors:
foldr
is right associative.
The function we're folding with is strict in both arguments.
If foldr
was getting us all screwed up due to the associativity, maybe foldl
will do better.
foldl :: (a -> b -> a) -> a -> [b] -> a
foldl f = go
where
go z (x:xs) = go (f z x) xs
go z _ = z
Let's see an expansion.
foldl f z [a,b,c,d] = f (f (f (f z a) b) c) d
Ouch! Maybe we can make that clearer with infix syntax?
foldl (+) z [a,b,c,d] = (((z + a) + b) + c) + d
For summing elements, this looks better than foldr
on the face of things, because at every step through the list, the function we're folding with can see both of its arguments.
But there's a catch.
None of the intermediate results are visible to the caller of foldl
, so the caller cannot force those results to be evaluated.
In other words, we're still building up a big thunk!
This function is defined for us in Data.List
, and it has the same type as foldl
:
{-# LANGUAGE BangPatterns #-}
foldl' :: (a -> b -> a) -> a -> [b] -> a
foldl' f = go
where
go !z (x:xs) = go (f z x) xs
go !z _ = z
The crucial difference lies with the strictness annotation, which forces each intermediate result to be evaluated to WHNF.
That same sum again.
foldl' (+) 0 [1,2,3]
First step:
foldl' (+) 0 (1:{-...-}) = foldl' (0+1) {-...-}
= foldl' 1 {-...-}
Second step:
foldl' (+) 1 (2:{-...-}) = foldl' (1+2) {-...-}
= foldl' 2 {-...-}
The structure of foldl'
causes us to replace the accumulator z
with a new value at each step.
The strictness annotation (we could have used seq
too) causes that accumulator to be evaluated to WHNF before we continue.
Last step:
foldl' (+) 3 [] = 3
I hope this makes the difference between foldl'
and foldr
more apparent!
It should be clear by now that plain old foldl
is very rarely useful.
In fact, I've never personally found a situation where it was the right kind of fold to use.
If you find yourself thinking "I need a left fold", it's safe to assume that foldl'
is what you'll need.
We've discussed the fact that using foldr
with a strict function is a bad idea, because it generates a chain of thunks.
What if we used a function that was lazy in its second argument?
Hmm. Conveniently enough, we have an everyday example of just such a function: the list constructor!
What does this function do?
foldr (:) []
Let's do another expansion, but this time we'll align the input list and result expression lexically:
foldr (<+>) z
(a : (b : (c : (d : [])))) =
(a <+> (b <+> (c <+> (d <+> z ))))
Notice that each (:)
in the input list is "replaced" by an application of (<+>)
in the result, and the []
at the end is replaced with z
.
(It doesn't matter what (<+>)
actually does here. We're purely interested in the structures of the two expressions.)
Write a version of map
that does not pattern match on its input list, but instead uses foldr
:
map :: (a -> b) -> [a] -> [b]
map (+1) [1,2,3] == [2,3,4]
myMap f = foldr go []
where go x xs = f x : xs
Write a version of filter
that again uses foldr
:
filter :: (a -> Bool) -> [a] -> [a]
filter odd [0..5] == [1,3,5]
myFilter p = foldr go []
where
go x xs
| p x = x : xs
| otherwise = xs
Write an append
function that has the same behaviour as the (++)
operator:
(++) :: [a] -> [a] -> [a]
[1,2,3] ++ [4,5,6] == [1,2,3,4,5,6]
For bonus points, make your implementation as short as possible.
append = flip (foldr (:))
A handy rule of thumb for distinguishing the two:
Concurrency: "I have a number of potentially unrelated tasks to deal with."
Parallelism: "I want my single question to be answered sooner if I can throw more resources at the problem."
Many languages blur these notions together, by providing a "one size fits all" toolkit for concurrency and parallelism.
We do not need multiple CPU cores to achieve concurrent processing, nor do we even need threads.
However, a good language runtime and programming environment will offer improved throughput and latency on concurrent problems if it can use multiple cores.
The essence of parallel programming is to have multiple computing elements process parts of a problem simultaneously.
GHC provides a "baked in" way of using multiple CPU cores to do this.
To take advantage of this, we must build our executables with -threaded
.
This links a Haskell program against the "threaded runtime", which can schedule lightweight Haskell threads across multiple CPU cores, and which can also run parallel programs on those cores.
In the Control.Parallel
module lives an intriguing function.
par :: a -> b -> b
This is the parallel cousin to seq
. It tells the threaded runtime that it may be beneficial to evaluate the first argument in parallel with the second.
Like seq
, par
returns the value of its second argument.
As it happens, we need another oddball function to work with par
.
pseq :: a -> b -> b
This is semantically identical to seq
, but with a subtle operational difference.
seq
is strict in both arguments, so the compiler can take an expression like this:
a `seq` b
And transform it into this:
b `seq` a `seq` b
This can be a problem when annotating code for parallelism, when we need more control over the order of evaluation.
In contrast to seq
, pseq
is only strict in its first argument.
This restricts the transformations that the compiler can perform, and lets us retain control of the evaluation order.
We use par
when two conditions are true:
The first argument is somewhat expensive to compute.
We don't have an immediate need for the result of the first computation.
If the first argument is too cheap, then the book-keeping overhead to evaluate it in parallel will diminish or erase any performance gain.
The par
operator makes use of the overlap between lazy evaluation and futures.
To implement lazy evaluation, we need thunks: a representation for expressions which are not yet evaluated but whose value may later be demanded.
A future is a computation whose value is being evaluated in parallel and which we may wait for.
par
offers a way to annotate a lazy computation as being potentially profitable to evaluate in parallel.
In other words, it turns a lazy computation into a future.
In practice, par
and pseq
are a pain in the neck to use.
To use par
effectively, we have to get some surprisingly subtle things right:
Pass an unevaluated computation to par
The unevaluated computation must be somewhat expensive
Ensure that its value will not be required by the enclosing computation for a while
Ensure that the result is shared by the rest of the program
If we miss any of 1 through 3, then we achieve little or no speedup.
If we miss 4, then the GC may get rid of the computed result before it can be used.
The monad-par
package provides a library that makes parallelism easier to achieve and reason about.
cabal install monad-par
It gives us a type named Par
:
newtype Par a
instance Functor Par
instance Applicative Par
instance Monad Par
To evaluate a Par
computation, we use runPar
:
runPar :: Par a -> a
Notice that this has no side effects, so it will run deterministically.
To start a parallel computation, we use the fork
action:
fork :: Par () -> Par ()
Forked tasks need a way to communicate with each other, for which we use the IVar
type:
data IVar a
deriving (Eq)
new :: Par (IVar a)
get :: IVar a -> Par a
put :: (NFData a) => IVar a -> a -> Par ()
The IVar
type is a write-once mutable reference. The get
function blocks until a put
has been performed.
We know that seq
evaluates an expression to WHNF, but sometimes we want to completely evaluate an expression to normal form.
Enter the deeqseq
package:
import Control.DeepSeq
class NFData a where
rnf :: a -> ()
We can write an NFData
instance for any custom type. Our rnf
implementation must reduce an expression to normal form.
data Foo a = Foo Int a
instance (NFData a) => NFData (Foo a) where
rnf (Foo x y) = rnf x `seq` rnf y
An extremely common pattern is for a thread to fork
several children and wait for them to finish.
We can easily capture this idea with a suitable combinator.
spawn :: (NFData a) => Par a -> Par (IVar a)
spawn act = do
i <- new
fork (put i =<< act)
return i
In fact, usually all we want is to simply wait for all children and return a list of their results.
parMapM :: (NFData b) => (a -> Par b) -> [a] -> Par [b]
parMapM f acts = do
ivars <- mapM (spawn . f) acts
mapM get ivars
Suppose we need the mean of a list of numbers. The naive way to do so would be sum xs / length xs
, but this has numerical stability problems.
Here's a more stable method:
import Data.List
data T a = T !a !Int
mean :: (RealFrac a) => [a] -> a
mean = fini . foldl' go (T 0 0)
where
fini (T a _) = a
go (T m n) x = T m' n'
where m' = m + (x - m) / fromIntegral n'
n' = n + 1
Supposing we have a set of sample data that we know little about, in particular its precision and variance.
This is exactly the kind of problem that the criterion benchmarking library faces: we have performance data, but it's dirty and doesn't follow any particular statistical distribution.
A technique called the jackknife lets us estimate these parameters.
We successively recompute a function (such as mean
) over subsets of a sample, each time with a sliding window cut out.
For a w-width window and k samples, the jackknife has a cost of O((k - w)2).
jackknife :: ([a] -> b) -> [a] -> [b]
jackknife f = map f . resamples 500
It's easy to write a resampling function using familiar building blocks.
resamples :: Int -> [a] -> [[a]]
resamples k xs =
take (length xs - k) $
zipWith (++) (inits xs) (map (drop k) (tails xs))
Our function resamples a list with a window size of k
.
>> resamples 2 [0..5]
[[ 2,3,4,5],
[0, 3,4,5],
[0,1, 4,5],
[0,1,2, 5]]
The nice thing about the jackknife is that each element of the result list is independent, so it's an "embarrassingly parallel" problem.
The monad-par
package is very helpful in making this trivial to parallelize.
import Control.Monad.Par (runPar, parMap)
jackknifeP f = runPar . parMap f . resamples 500
Let's try it out!
import System.Random.Mersenne
crud = zipWith (\x a -> sin (x / 300)**2 + a) [0..]
main = do
(xs,ys) <- splitAt 1500 . take 6000 <$> (randoms =<< getStdGen)
let rs = crud xs ++ ys
putStrLn $ "sample mean: " ++ show (mean rs)
let j = jackknifeP mean rs
putStrLn $ "jack mean min: " ++ show (minimum j)
putStrLn $ "jack mean max: " ++ show (maximum j)
We have to remember to build using both -threaded
and -rtsopts
.
ghc -O --make -rtsopts -threaded Jackknife
-threaded
gives us the threaded runtime.
We'll need -rtsopts
to tell the RTS how many cores to use. It will default to just one if we do not specify otherwise.
To use e.g. 3 cores:
./Jackknife +RTS -N3
If we want to use all cores:
./Jackknife +RTS -N
Meanwhile, par
and the monad-par
packages are not the only ways to achieve parallel speedups in Haskell.
There are other more "researchy" projects under way:
For multicore machines: Data Parallel Haskell
For GPU computing: the accelerate package
DPH is the name of a research project to support nested data parallelism under GHC on multicore CPUs.
Nested data parallelism extends the programming model of flat data parallelism, as known from modern Fortran dialects, to:
Irregular parallel computations (such as divide-and-conquer algorithms) and
Irregular data structures (such as sparse matrices and tree structures).
This project is still a work in progress, and is quite technically intricate.
The accelerate library defines an embedded language for regular, multi-dimensional array computations with multiple backends to facilitate high-performance implementations.
Currently, there are two backends:
An interpreter that serves as a reference implementation of the intended semantics of the language.
A CUDA backend generating code for CUDA-capable NVIDIA GPUs.
You'll sometimes hear uninformed people claim that functional programming is some kind of magic pixie dust for parallel programming.
This is not really true.
Certainly, aspects of the programming model make incidental parts of the problem easier.
However, tricky considerations like data dependencies, memory placement, and thread scheduling have every bit as insidious an effect on parallel performance in Haskell as in C.
Parallel programming is a subtle art no matter which language you do it in.