Parsing

Intro

Hi!

Office hours:

Let’s talk about parsing

A whole lot of programming involves interacting with external sources of data:

Our sad duty is to try to make sense of this stuff.

O hai, HTTP 1.1

The very first lines of BNF from RFC 2616:

Request = Request-Line              ; Section 5.1
          *(( general-header        ; Section 4.5
           | request-header         ; Section 5.3
           | entity-header ) CRLF)  ; Section 7.1
          CRLF
          [ message-body ]          ; Section 4.3

(Never read an RFC? See RFC 2616 section 2 for a quick tour of IETF BNF syntax.)

Parsing by hand: C++ parser fragment

Long ago, the ancients wrote all of their parsers by hand, because that was all they could do.

void Response::ProcessStatusLine( std::string const& line )
{
    const char* p = line.c_str();

    while( *p && *p == ' ' )
        ++p;

    while( *p && *p != ' ' )
        m_VersionString += *p++;
    while( *p && *p == ' ' )
        ++p;

    std::string status;
    while( *p && *p != ' ' )
        status += *p++;
    while( *p && *p == ' ' )
        ++p;

    while( *p )
        m_Reason += *p++;

    m_Status = atoi( status.c_str() );
    if( m_Status < 100 || m_Status > 999 )
        throw Wobbly( "BadStatusLine (%s)", line.c_str() );

    if( m_VersionString == "HTTP:/1.0" )
        m_Version = 10;
    else if( 0==m_VersionString.compare( 0,7,"HTTP/1." ) )
        m_Version = 11;
    else
        throw Wobbly( "UnknownProtocol (%s)", m_VersionString.c_str() );

    m_State = HEADERS;
    m_HeaderAccum.clear();
}

Parsing by hand: Java parser fragment, page 1

public class HttpResponseParser extends AbstractMessageParser {

    private final HttpResponseFactory responseFactory;
    private final CharArrayBuffer lineBuf;

    public HttpResponseParser(
            final SessionInputBuffer buffer,
            final LineParser parser,
            final HttpResponseFactory responseFactory,
            final HttpParams params) {
        super(buffer, parser, params);
        if (responseFactory == null) {
            throw new IllegalArgumentException("Response factory may not be null");
        }
        this.responseFactory = responseFactory;
        this.lineBuf = new CharArrayBuffer(128);
    }

Parsing by hand: Java parser fragment, page 2

    protected HttpMessage parseHead(
            final SessionInputBuffer sessionBuffer)
        throws IOException, HttpException, ParseException {

        this.lineBuf.clear();
        int i = sessionBuffer.readLine(this.lineBuf);
        if (i == -1) {
            throw new NoHttpResponseException("The target server failed to respond");
        }
        //create the status line from the status string
        ParserCursor cursor = new ParserCursor(0, this.lineBuf.length());
        StatusLine statusline = lineParser.parseStatusLine(this.lineBuf, cursor);
        return this.responseFactory.newHttpResponse(statusline, null);
    }
}

Commentary

Hand-written parsers can be quite efficient.

But…

Commentary

Hand-written parsers can be quite efficient.

But…

Domain specific languages

%%{
  machine http_parser_common;

  scheme = ( alpha | digit | "+" | "-" | "." )* ;
  absolute_uri = (scheme ":" (uchar | reserved )*);

  path = ( pchar+ ( "/" pchar* )* ) ;
  query = ( uchar | reserved )* %query_string ;
  param = ( pchar | "/" )* ;
  params = ( param ( ";" param )* ) ;
  rel_path = ( path? %request_path (";" params)? ) ("?" %start_query query)?;
  absolute_path = ( "/"+ rel_path );

  Request_URI = ( "*" | absolute_uri | absolute_path ) >mark %request_uri;
  Fragment = ( uchar | reserved )* >mark %fragment;
  Method = ( upper | digit | safe ){1,20} >mark %request_method;

  http_number = ( digit+ "." digit+ ) ;
  HTTP_Version = ( "HTTP/" http_number ) >mark %http_version ;
  Request_Line = ( Method " " Request_URI ("#" Fragment){0,1} " " HTTP_Version CRLF ) ;

  field_name = ( token -- ":" )+ >start_field %write_field;

  field_value = any* >start_value %write_value;

  message_header = field_name ":" " "* field_value :> CRLF;

  Request = Request_Line ( message_header )* ( CRLF @done );

  main := Request;
}%%

What was that?

That last slide used a DSL named Ragel.

We squeezed most of a parser into one slide!

Problems:

Advantages:

Parsing in Haskell

Let’s build a parser of our own.

The dumbest parser we can think of:

What should this parser’s type signature be?

Parsing in Haskell

Let’s build a parser of our own.

The dumbest parser we can think of:

What should this parser’s type signature be?

string :: String -> String -> Bool

(Is this a good type signature??)

Parsing a number

Suppose we want to parse a decimal number.

What should our type signature be?

Parsing a number

Suppose we want to parse a decimal number.

What should our type signature be?

int :: String -> Int

(Is this a good type signature?)

A concrete example

import Data.List

string :: String -> String -> Bool
string = (==)

number :: String -> Int
number = read

This is a dead end

We can’t do anything useful with these functions.

What are we missing?

Not is a dead end

We can’t do anything useful with these functions.

What are we missing?

A new perspective:

But most importantly:

Abstractly

Given an input s and a desired result a, we could model a parser via this type:

type Parser s a = s -> (a, s)

What are we missing?

Abstractly

Given an input s and a desired result a, we could model a parser via this type:

type Parser s a = s -> (a, s)

What are we missing?

type Parser s a = s -> Maybe (a, s)

Less crappy

import Data.Char
import Data.List
import Data.Maybe

type Parser s a = s -> Maybe (a,s)

string :: String -> Parser String String
string pat input =
  case stripPrefix pat input of
    Nothing   -> Nothing
    Just rest -> Just (pat, rest)

Please supply a type signature and body for number. You have 2 minutes.

number = undefined

Skeleton: http://cs240h.scs.stanford.edu/DumbP.hs

Your turn!

Build a small parser from string and number.

If you call:

version "HTTP/1.1\r\n"

You should get back a result containing:

(1,1)

You have 5 minutes.

Horrible first version

Look at this staircase of case expressions!

versionDumb i0 =
  case string "HTTP/" i0 of
    Nothing -> Nothing
    Just (_,i1) ->
      case number i1 of
        Nothing -> Nothing
        Just (maj,i2) ->
          case string "." i2 of
            Nothing -> Nothing
            Just (n,i3) ->
              case number i3 of
                Nothing -> Nothing
                Just (min,i4) -> Just ((maj,min),i4)

What’s the pattern here?

Noticing things

versionDumb i0 =
  case string "HTTP/" i0 of
    Nothing -> Nothing
    Just (_,i1) ->
      case number i1 of
        {- ... -}

We have:

Turn the boilerplate into a function

andThen :: Parser s a -> (a -> Parser s b) -> Parser s b
andThen parse next = \input ->
  case parse input of
    Nothing          -> Nothing
    Just (a, input') -> next a input'

Now what?

version2 =
  string "HTTP/" `andThen` \_ ->
  number `andThen` \maj ->
  string "." `andThen` \_ ->
  number `andThen` \min ->
  {- ... give back (maj,min) -}

From 12 lines to 4!

All that’s missing:

Handing back a result

We need to stuff our result into a Parser box:

stuff :: a -> Parser s a
stuff a = \input -> Just (a, input)

Thus:

version2 =
  string "HTTP/" `andThen` \_ ->
  number `andThen` \maj ->
  string "." `andThen` \_ ->
  number `andThen` \min ->
  stuff (maj,min)

Types

Look at this:

andThen :: Parser s a -> (a -> Parser s b) -> Parser s b
(>>=)   :: Monad m =>
           m        a -> (a -> m        b) -> m        b

And this:

stuff  ::            a -> Parser s a
return :: Monad m => a -> m        a

So close!

Making a “proper” monad

We don’t want to write a Monad instance for this type:

type Parser s a = s -> (a, s)

If we did, every function from s to (a,s) would be an instance of our weird monad.

Instead of type, let’s use newtype.

-- type Parser s a =             s -> Maybe (a, s)
newtype Parser s a = P { runP :: s -> Maybe (a, s) }

Our Monad instance:

instance Monad (Parser s) where
  (>>=)  = bind
  return = shove

return

Let’s look at our old parser (now named OldParser) and our new Parser side by side.

stuff :: a -> OldParser s a
shove :: a ->    Parser s a

stuff a =     \input -> Just (a, input)

shove a = P $ \input -> Just (a, input)

It should be clear that the only difference is some newtype machinery.

bind

andThen :: OldParser s a -> (a -> OldParser s b) -> OldParser s b
bind    ::    Parser s a -> (a ->    Parser s b) ->    Parser s b

andThen parse next =     \input ->
  case      parse input of
    Nothing          -> Nothing
    Just (a, input') ->      (next a) input'

bind parse next    = P $ \input ->
  case runP parse input of
    Nothing          -> Nothing
    Just (a, input') -> runP (next a) input'

What should the Functor instance look like?

class Functor where
  fmap :: (a -> b) -> f a -> f b

Functor

instance Functor (Parser s) where
  fmap f parser = P $ \input ->
    case runP parser input of
      Nothing          -> Nothing
      Just (a, input') -> Just (f a, input')

Revisiting our smaller parser

Before:

version2 =
  string "HTTP/" `andThen` \_ ->
  number `andThen` \maj ->
  string "." `andThen` \_ ->
  number `andThen` \min ->
  stuff (maj,min)

Suppose we plumb P and runP into the right places in string and number.

After:

version3 = do
  string "HTTP/"
  maj <- number
  string "."
  min <- number
  return (maj,min)

Can we do better?

We needed our Functor instance so we can write this:

import Control.Applicative
import Control.Monad (ap)

instance Applicative (Parser s) where
  pure  = return
  (<*>) = ap

Our parser now becomes:

version4 = (,) <$>
           (string "HTTP/" *> number <* string ".") <*>
           number

Nice!

What next? A small trick!

Let’s write less-polymorphic combinators to aid our notation:

(.*>) :: String -> Parser String b -> Parser String b
a .*> b = string a *> b
infixl 4 .*>

(<*.) :: Parser String a -> String -> Parser String a
a <*. b = a <* string b
infixl 4 <*.

And finally:

version5 = (,) <$> ("HTTP/" .*> number <*. ".")
               <*> number

From 12 lines to two!

Your turn: Alternative

How hard is it to build a parser that can choose another parse if the first one fails?

class Applicative f => Alternative f where
  empty :: f a
  (<|>) :: f a -> f a -> f a

Take 5 minutes to write your own.

instance Alternative (Parser s) where
    {- ... -}

Skeleton: http://cs240h.scs.stanford.edu/P.hs

Desired behaviour:

>>> let p = string "foo" <|> string "bar"
>>> runP p "foowhee"
Just ("foo","whee")
>>> runP p "quuxly"
Nothing
>>> runP p "barely"
Just ("bar","ely")

Alternative

My implementation took about one minute:

instance Alternative (Parser s) where
    empty = P $ \_ -> Nothing

    f <|> g = P $ \input ->
      case runP f input of
        Nothing -> runP g input
        result  -> result

Parsec

There’s a long history of functional programmers writing parsing libraries.

Parsec (Leijen 2001) was the first to be practical for real world use.

Parsec and attoparsec

Inspired by Parsec, I wrote a specialized derivative named attoparsec.

How do the two differ?

Mise en scene

Typical network protocol problem:

More importantly:

One approach

What Parsec does:

Monad transformers

In general:

For our Parsec case, the base monad is IO, and the transformer stacked on top is ParsecT.

Read more on Haskellwiki if you need to know.

Monad transformers

In general:

For our Parsec case, the base monad is IO, and the transformer stacked on top is ParsecT.

Read more on Haskellwiki if you dare.

A simple attoparsec parser

Parse the Request-Line of an HTTP 1.1 request:

{-# LANGUAGE OverloadedStrings #-}

import Data.Attoparsec.ByteString.Char8 as A
import Control.Applicative

request = (,,) <$>
          (verb <* skipSpace) <*>
          (url <* skipSpace) <*>
          (version <* endOfLine)

verb = "GET" <|> "POST"
url = takeTill A.isSpace
version = (,) <$> ("HTTP/" *> decimal) <*> ("." *> decimal)

Let’s try this in ghci

>>> :load AttoHttp.hs
>>> parse request "GET /foo HTTP/1.1\r\n"
Done "" ("GET","/foo",(1,1))

What happens on incomplete input?

>>> parse request "GET /foo HTTP/"
Partial _

Feeding more input

The Partial constructor has a parameter that is a function. We can use feed to supply it with more input.

>>> let r = parse request "GET /foo HTTP/"
>>> r
Partial _
>>> r `feed` "1.1\r\n"
Done "" ("GET","/foo",(1,1))

How does this work?

Basic continuation based parsing

Instead of returning a value:

{-# LANGUAGE RankNTypes #-}

type ErrMsg = String

newtype ContP a = ContP {
    runP :: forall r.                        -- wat
            String                           -- input
         -> (ErrMsg -> Either ErrMsg r)      -- failure
         -> (String -> a -> Either ErrMsg r) -- success
         -> Either ErrMsg r                  -- result
  }

Rank-1 types

For polymorphic Haskell functions, there is a type quantifier that is not written.

id :: a -> a

Really means:

id :: forall a. a -> a

And forall is a type-level lambda, corresponding to the value-level lambda “\”.

So in the above signature, we say:

We call this a rank-1 type because the forall is present at the outermost level, or rank.

Rank-2 types

Our continuation-based parser has a rank-2 type.

newtype ContP a = ContP {
    runP :: forall r.                        -- wat
            String                           -- input
         -> (ErrMsg -> Either ErrMsg r)      -- failure
         -> (String -> a -> Either ErrMsg r) -- success
         -> Either ErrMsg r                  -- result
  }

The caller of runP cannot pick the type of r.

Continuations are tricky

Working with continuations is tough.

Here’s a simple example:

instance Functor ContP where
  fmap f p = ContP $ \bs0 fail0 succ0 ->
    runP p bs0 fail0 $ \bs1 a -> succ0 bs1 (f a)

Help me fumble through a Monad instance!

Attoparsec internals: the types

newtype Parser t a = Parser {
      runParser :: forall r. Input t -> Added t -> More
                -> Failure t   r
                -> Success t a r
                -> IResult t r
    }

type Failure t   r = Input t -> Added t -> More
                   -> [String] -> String -> IResult t r
type Success t a r = Input t -> Added t -> More
                   -> a -> IResult t r

newtype Input t = I {unI :: t} deriving (Monoid)
newtype Added t = A {unA :: t} deriving (Monoid)

-- | Have we reached EOF?
data More = Complete | Incomplete
            deriving (Eq, Show)

data IResult t r = Fail t [String] String
                 | Partial (t -> IResult t r)
                 | Done t r

Types I

We have to somehow make feed a function that is possible to write.

If we are given more input, we use these types to track it.

newtype Input t = I {unI :: t} deriving (Monoid)
newtype Added t = A {unA :: t} deriving (Monoid)

Input is all the input we’ve ever seen.

Added is the extra input we’ve been given via feed.

We keep track of the two separately because parsing can fail.

When we backtrack, we throw away the Input state.

We tack all the data our caller Added back on again, so we won’t forget it.

Types I.5

Suppose our caller runs out of input to give us on one branch.

But our parse fails.

We need to remember this so that after we backtrack, we won’t ask for more input on the other side of the branch.

data More = Complete | Incomplete
            deriving (Eq, Show)

Types II

If a parse fails, we report an error message, along with a stack of context information that might help with debugging.

type Failure t   r = Input t -> Added t -> More
                   -> [String] -> String -> IResult t r

If the parse succeeds, we call our successor continuation with our result.

type Success t a r = Input t -> Added t -> More
                   -> a -> IResult t r

The result is just a fancier Either with the possibility of saying “feed me more before I answer you”.

data IResult t r = Fail t [String] String
                 | Partial (t -> IResult t r)
                 | Done t r

Running and ending a parse

As is usually the case with monads, the user-visible “run this monad” function is quite simple.

parse :: Monoid t => Parser t a -> t -> IResult t a
parse m s = runParser m (I s) (A mempty) Incomplete failK successK

The only slight complication:

failK :: Failure t a
failK i0 _a0 _m0 stack msg = Fail (unI i0) stack msg

successK :: Success t a a
successK i0 _a0 _m0 a = Done (unI i0) a

Lab 1

http://www.scs.stanford.edu/16wi-cs240h/labs/lab1.html

Write a simple Haskell version of the UNIX tr program.