Intro

Hi!

Office hours:

Wednesdays, 14:30-15:00
The room full of grad students next to Gates 290
Will not be present Jan 20

Let’s talk about parsing

A whole lot of programming involves interacting with external sources of data:

Files containing junk
Network peers sending us junk

Our sad duty is to try to make sense of this stuff.

O hai, HTTP 1.1

The very first lines of BNF from RFC 2616:

Request = Request-Line              ; Section 5.1
          *(( general-header        ; Section 4.5
           | request-header         ; Section 5.3
           | entity-header ) CRLF)  ; Section 7.1
          CRLF
          [ message-body ]          ; Section 4.3

(Never read an RFC? See RFC 2616 section 2 for a quick tour of IETF BNF syntax.)

Parsing by hand: C++ parser fragment

Long ago, the ancients wrote all of their parsers by hand, because that was all they could do.

void Response::ProcessStatusLine( std::string const& line )
{
    const char* p = line.c_str();

    while( *p && *p == ' ' )
        ++p;

    while( *p && *p != ' ' )
        m_VersionString += *p++;
    while( *p && *p == ' ' )
        ++p;

    std::string status;
    while( *p && *p != ' ' )
        status += *p++;
    while( *p && *p == ' ' )
        ++p;

    while( *p )
        m_Reason += *p++;

    m_Status = atoi( status.c_str() );
    if( m_Status < 100 || m_Status > 999 )
        throw Wobbly( "BadStatusLine (%s)", line.c_str() );

    if( m_VersionString == "HTTP:/1.0" )
        m_Version = 10;
    else if( 0==m_VersionString.compare( 0,7,"HTTP/1." ) )
        m_Version = 11;
    else
        throw Wobbly( "UnknownProtocol (%s)", m_VersionString.c_str() );

    m_State = HEADERS;
    m_HeaderAccum.clear();
}

Parsing by hand: Java parser fragment, page 1

public class HttpResponseParser extends AbstractMessageParser {

    private final HttpResponseFactory responseFactory;
    private final CharArrayBuffer lineBuf;

    public HttpResponseParser(
            final SessionInputBuffer buffer,
            final LineParser parser,
            final HttpResponseFactory responseFactory,
            final HttpParams params) {
        super(buffer, parser, params);
        if (responseFactory == null) {
            throw new IllegalArgumentException("Response factory may not be null");
        }
        this.responseFactory = responseFactory;
        this.lineBuf = new CharArrayBuffer(128);
    }

Parsing by hand: Java parser fragment, page 2

    protected HttpMessage parseHead(
            final SessionInputBuffer sessionBuffer)
        throws IOException, HttpException, ParseException {

        this.lineBuf.clear();
        int i = sessionBuffer.readLine(this.lineBuf);
        if (i == -1) {
            throw new NoHttpResponseException("The target server failed to respond");
        }
        //create the status line from the status string
        ParserCursor cursor = new ParserCursor(0, this.lineBuf.length());
        StatusLine statusline = lineParser.parseStatusLine(this.lineBuf, cursor);
        return this.responseFactory.newHttpResponse(statusline, null);
    }
}

Commentary

Hand-written parsers can be quite efficient.

But…

Commentary

Hand-written parsers can be quite efficient.

But…

Domain specific languages

%%{
  machine http_parser_common;

  scheme = ( alpha | digit | "+" | "-" | "." )* ;
  absolute_uri = (scheme ":" (uchar | reserved )*);

  path = ( pchar+ ( "/" pchar* )* ) ;
  query = ( uchar | reserved )* %query_string ;
  param = ( pchar | "/" )* ;
  params = ( param ( ";" param )* ) ;
  rel_path = ( path? %request_path (";" params)? ) ("?" %start_query query)?;
  absolute_path = ( "/"+ rel_path );

  Request_URI = ( "*" | absolute_uri | absolute_path ) >mark %request_uri;
  Fragment = ( uchar | reserved )* >mark %fragment;
  Method = ( upper | digit | safe ){1,20} >mark %request_method;

  http_number = ( digit+ "." digit+ ) ;
  HTTP_Version = ( "HTTP/" http_number ) >mark %http_version ;
  Request_Line = ( Method " " Request_URI ("#" Fragment){0,1} " " HTTP_Version CRLF ) ;

  field_name = ( token -- ":" )+ >start_field %write_field;

  field_value = any* >start_value %write_value;

  message_header = field_name ":" " "* field_value :> CRLF;

  Request = Request_Line ( message_header )* ( CRLF @done );

  main := Request;
}%%

What was that?

That last slide used a DSL named Ragel.

We squeezed most of a parser into one slide!

Problems:

A completely new language.
Generated code is hard to follow.
Error reporting and recovery can be nasty.

Advantages:

Concise.
DSL-powered parsers (bison, antlr) can optimize parsers and find certain classes of grammar bugs.

Parsing in Haskell

Let’s build a parser of our own.

The dumbest parser we can think of:

Accept a string to match against,
and another string as input,
then tell us if the input string matches our first string.

What should this parser’s type signature be?

Parsing in Haskell

Let’s build a parser of our own.

The dumbest parser we can think of:

Accept a string to match against,
and another string as input,
then tell us if the input string matches our first string.

What should this parser’s type signature be?

string :: String -> String -> Bool

(Is this a good type signature??)

Parsing a number

Suppose we want to parse a decimal number.

What should our type signature be?

Parsing a number

Suppose we want to parse a decimal number.

What should our type signature be?

int :: String -> Int

(Is this a good type signature?)

A concrete example

import Data.List

string :: String -> String -> Bool
string = (==)

number :: String -> Int
number = read

This is a dead end

We can’t do anything useful with these functions.

What are we missing?

Not is a dead end

We can’t do anything useful with these functions.

What are we missing?

A new perspective:

A parser consumes as much input as it can.
It converts the input as necessary.
It returns the result of the conversion.

But most importantly:

It also returns the leftover input that it could not consume.

Abstractly

Given an input s and a desired result a, we could model a parser via this type:

type Parser s a = s -> (a, s)

What are we missing?

Abstractly

Given an input s and a desired result a, we could model a parser via this type:

type Parser s a = s -> (a, s)

What are we missing?

Parsers can fail.

type Parser s a = s -> Maybe (a, s)

Less crappy

import Data.Char
import Data.List
import Data.Maybe

type Parser s a = s -> Maybe (a,s)

string :: String -> Parser String String
string pat input =
  case stripPrefix pat input of
    Nothing   -> Nothing
    Just rest -> Just (pat, rest)

Please supply a type signature and body for number. You have 2 minutes.

number = undefined

Skeleton: http://cs240h.scs.stanford.edu/DumbP.hs

Your turn!

Build a small parser from string and number.

If you call:

version "HTTP/1.1\r\n"

You should get back a result containing:

(1,1)

You have 5 minutes.

Horrible first version

Look at this staircase of case expressions!

versionDumb i0 =
  case string "HTTP/" i0 of
    Nothing -> Nothing
    Just (_,i1) ->
      case number i1 of
        Nothing -> Nothing
        Just (maj,i2) ->
          case string "." i2 of
            Nothing -> Nothing
            Just (n,i3) ->
              case number i3 of
                Nothing -> Nothing
                Just (min,i4) -> Just ((maj,min),i4)

What’s the pattern here?

Noticing things

versionDumb i0 =
  case string "HTTP/" i0 of
    Nothing -> Nothing
    Just (_,i1) ->
      case number i1 of
        {- ... -}

We have:

A parser is fed some input
We do case analysis of the result
On success, we pass the leftover input to the next parser

Turn the boilerplate into a function

andThen :: Parser s a -> (a -> Parser s b) -> Parser s b
andThen parse next = \input ->
  case parse input of
    Nothing          -> Nothing
    Just (a, input') -> next a input'

Now what?

version2 =
  string "HTTP/" `andThen` \_ ->
  number `andThen` \maj ->
  string "." `andThen` \_ ->
  number `andThen` \min ->
  {- ... give back (maj,min) -}

From 12 lines to 4!

All that’s missing:

How do we construct the (maj,min) tuple?

Handing back a result

We need to stuff our result into a Parser box:

stuff :: a -> Parser s a
stuff a = \input -> Just (a, input)

Thus:

version2 =
  string "HTTP/" `andThen` \_ ->
  number `andThen` \maj ->
  string "." `andThen` \_ ->
  number `andThen` \min ->
  stuff (maj,min)

Types

Look at this:

andThen :: Parser s a -> (a -> Parser s b) -> Parser s b
(>>=)   :: Monad m =>
           m        a -> (a -> m        b) -> m        b

And this:

stuff  ::            a -> Parser s a
return :: Monad m => a -> m        a

So close!

Making a “proper” monad

We don’t want to write a Monad instance for this type:

type Parser s a = s -> (a, s)

If we did, every function from s to (a,s) would be an instance of our weird monad.

Instead of type, let’s use newtype.

-- type Parser s a =             s -> Maybe (a, s)
newtype Parser s a = P { runP :: s -> Maybe (a, s) }

Our Monad instance:

instance Monad (Parser s) where
  (>>=)  = bind
  return = shove

return

Let’s look at our old parser (now named OldParser) and our new Parser side by side.

stuff :: a -> OldParser s a
shove :: a ->    Parser s a

stuff a =     \input -> Just (a, input)

shove a = P $ \input -> Just (a, input)

It should be clear that the only difference is some newtype machinery.

bind

andThen :: OldParser s a -> (a -> OldParser s b) -> OldParser s b
bind    ::    Parser s a -> (a ->    Parser s b) ->    Parser s b

andThen parse next =     \input ->
  case      parse input of
    Nothing          -> Nothing
    Just (a, input') ->      (next a) input'

bind parse next    = P $ \input ->
  case runP parse input of
    Nothing          -> Nothing
    Just (a, input') -> runP (next a) input'

What should the Functor instance look like?

class Functor where
  fmap :: (a -> b) -> f a -> f b

Functor

instance Functor (Parser s) where
  fmap f parser = P $ \input ->
    case runP parser input of
      Nothing          -> Nothing
      Just (a, input') -> Just (f a, input')

Revisiting our smaller parser

Before:

version2 =
  string "HTTP/" `andThen` \_ ->
  number `andThen` \maj ->
  string "." `andThen` \_ ->
  number `andThen` \min ->
  stuff (maj,min)

Suppose we plumb P and runP into the right places in string and number.

After:

version3 = do
  string "HTTP/"
  maj <- number
  string "."
  min <- number
  return (maj,min)

Can we do better?

We needed our Functor instance so we can write this:

import Control.Applicative
import Control.Monad (ap)

instance Applicative (Parser s) where
  pure  = return
  (<*>) = ap

Our parser now becomes:

version4 = (,) <$>
           (string "HTTP/" *> number <* string ".") <*>
           number

Nice!

What next? A small trick!

Let’s write less-polymorphic combinators to aid our notation:

(.*>) :: String -> Parser String b -> Parser String b
a .*> b = string a *> b
infixl 4 .*>

(<*.) :: Parser String a -> String -> Parser String a
a <*. b = a <* string b
infixl 4 <*.

And finally:

version5 = (,) <$> ("HTTP/" .*> number <*. ".")
               <*> number

From 12 lines to two!

Your turn: Alternative

How hard is it to build a parser that can choose another parse if the first one fails?

class Applicative f => Alternative f where
  empty :: f a
  (<|>) :: f a -> f a -> f a

Take 5 minutes to write your own.

instance Alternative (Parser s) where
    {- ... -}

Skeleton: http://cs240h.scs.stanford.edu/P.hs

Desired behaviour:

>>> let p = string "foo" <|> string "bar"
>>> runP p "foowhee"
Just ("foo","whee")
>>> runP p "quuxly"
Nothing
>>> runP p "barely"
Just ("bar","ely")

Alternative

My implementation took about one minute:

instance Alternative (Parser s) where
    empty = P $ \_ -> Nothing

    f <|> g = P $ \input ->
      case runP f input of
        Nothing -> runP g input
        result  -> result

Parsec

There’s a long history of functional programmers writing parsing libraries.

Parsec (Leijen 2001) was the first to be practical for real world use.

Parsec and attoparsec

Inspired by Parsec, I wrote a specialized derivative named attoparsec.

How do the two differ?

attoparsec is optimized for fast stream and file parsing.
attoparsec focuses on the ByteString and Text types.
Parsec is more general: it can parse String too (in fact arbitrary token types).
attoparsec does not attempt to give friendly error messages (no file names or line numbers). It’s aimed at data generated by machines.
Parsec might be a better choice for e.g. parsing source files, where the friendliness/performance tradeoff is a little different.

Mise en scene

Typical network protocol problem:

You receive a TCP segment off the network
You need to decode a variable-length message that lacks message length information (e.g. a JSON blob)
How do you tell whether your packet/segment contains enough data to parse the whole message?

More importantly:

How messed-up does your parser become to accommodate this need?

One approach

What Parsec does:

Built as a monad transformer (stackable on top of another monad)
Combines parsing with ability to perform IO (e.g. read more TCP segments)

Monad transformers

In general:

The composition of two monads is not itself a monad.
Monad transfomers allow us to “stack” a series of “transformers” on top of a base monad, so that the stacked combination is a monad.

For our Parsec case, the base monad is IO, and the transformer stacked on top is ParsecT.

Read more on Haskellwiki if you need to know.

Monad transformers

In general:

The composition of two monads is not itself a monad.
Monad transfomers allow us to “stack” a series of “transformers” on top of a base monad, so that the stacked combination is a monad.

For our Parsec case, the base monad is IO, and the transformer stacked on top is ParsecT.

Read more on Haskellwiki if you dare.

A simple attoparsec parser

Parse the Request-Line of an HTTP 1.1 request:

{-# LANGUAGE OverloadedStrings #-}

import Data.Attoparsec.ByteString.Char8 as A
import Control.Applicative

request = (,,) <$>
          (verb <* skipSpace) <*>
          (url <* skipSpace) <*>
          (version <* endOfLine)

verb = "GET" <|> "POST"
url = takeTill A.isSpace
version = (,) <$> ("HTTP/" *> decimal) <*> ("." *> decimal)

Let’s try this in ghci

>>> :load AttoHttp.hs
>>> parse request "GET /foo HTTP/1.1\r\n"
Done "" ("GET","/foo",(1,1))

What happens on incomplete input?

>>> parse request "GET /foo HTTP/"
Partial _

Feeding more input

The Partial constructor has a parameter that is a function. We can use feed to supply it with more input.

>>> let r = parse request "GET /foo HTTP/"
>>> r
Partial _
>>> r `feed` "1.1\r\n"
Done "" ("GET","/foo",(1,1))

How does this work?

Basic continuation based parsing

Instead of returning a value:

If a parse succeeds or fails, we call a function.

{-# LANGUAGE RankNTypes #-}

type ErrMsg = String

newtype ContP a = ContP {
    runP :: forall r.                        -- wat
            String                           -- input
         -> (ErrMsg -> Either ErrMsg r)      -- failure
         -> (String -> a -> Either ErrMsg r) -- success
         -> Either ErrMsg r                  -- result
  }

Rank-1 types

For polymorphic Haskell functions, there is a type quantifier that is not written.

id :: a -> a

Really means:

id :: forall a. a -> a

And forall is a type-level lambda, corresponding to the value-level lambda “\”.

So in the above signature, we say:

“Caller, provide me a type and I’ll bind it to the type variable a.”

We call this a rank-1 type because the forall is present at the outermost level, or rank.

Rank-2 types

Our continuation-based parser has a rank-2 type.

newtype ContP a = ContP {
    runP :: forall r.                        -- wat
            String                           -- input
         -> (ErrMsg -> Either ErrMsg r)      -- failure
         -> (String -> a -> Either ErrMsg r) -- success
         -> Either ErrMsg r                  -- result
  }

The caller of runP cannot pick the type of r.

Instead, the callee controls it.

Continuations are tricky

Working with continuations is tough.

Here’s a simple example:

instance Functor ContP where
  fmap f p = ContP $ \bs0 fail0 succ0 ->
    runP p bs0 fail0 $ \bs1 a -> succ0 bs1 (f a)

Help me fumble through a Monad instance!

Attoparsec internals: the types

newtype Parser t a = Parser {
      runParser :: forall r. Input t -> Added t -> More
                -> Failure t   r
                -> Success t a r
                -> IResult t r
    }

type Failure t   r = Input t -> Added t -> More
                   -> [String] -> String -> IResult t r
type Success t a r = Input t -> Added t -> More
                   -> a -> IResult t r

newtype Input t = I {unI :: t} deriving (Monoid)
newtype Added t = A {unA :: t} deriving (Monoid)

-- | Have we reached EOF?
data More = Complete | Incomplete
            deriving (Eq, Show)

data IResult t r = Fail t [String] String
                 | Partial (t -> IResult t r)
                 | Done t r

Types I

We have to somehow make feed a function that is possible to write.

If we are given more input, we use these types to track it.

newtype Input t = I {unI :: t} deriving (Monoid)
newtype Added t = A {unA :: t} deriving (Monoid)

Input is all the input we’ve ever seen.

Added is the extra input we’ve been given via feed.

We keep track of the two separately because parsing can fail.

When parsing fails and we try another branch, we backtrack.

When we backtrack, we throw away the Input state.

We tack all the data our caller Added back on again, so we won’t forget it.

Types I.5

Suppose our caller runs out of input to give us on one branch.

But our parse fails.

We need to remember this so that after we backtrack, we won’t ask for more input on the other side of the branch.

data More = Complete | Incomplete
            deriving (Eq, Show)

Types II

If a parse fails, we report an error message, along with a stack of context information that might help with debugging.

type Failure t   r = Input t -> Added t -> More
                   -> [String] -> String -> IResult t r

If the parse succeeds, we call our successor continuation with our result.

type Success t a r = Input t -> Added t -> More
                   -> a -> IResult t r

The result is just a fancier Either with the possibility of saying “feed me more before I answer you”.

data IResult t r = Fail t [String] String
                 | Partial (t -> IResult t r)
                 | Done t r

Running and ending a parse

As is usually the case with monads, the user-visible “run this monad” function is quite simple.

parse :: Monoid t => Parser t a -> t -> IResult t a
parse m s = runParser m (I s) (A mempty) Incomplete failK successK

The only slight complication:

We need to create “terminal continuations”
i.e. continuations that do not chain up yet another continuation, but instead return us to our usual mode of computation

failK :: Failure t a
failK i0 _a0 _m0 stack msg = Fail (unI i0) stack msg

successK :: Success t a a
successK i0 _a0 _m0 a = Done (unI i0) a

Lab 1

http://www.scs.stanford.edu/16wi-cs240h/labs/lab1.html

Write a simple Haskell version of the UNIX tr program.