% Parsing
# Intro
Hi!
Office hours:
* Wednesdays, 14:30-15:00
* The room full of grad students next to Gates 290
* **Will not be present** Jan 20
# Let's talk about parsing
A whole lot of programming involves interacting with external sources
of data:
* Files containing junk
* Network peers sending us junk
Our sad duty is to try to make sense of this stuff.
# O hai, HTTP 1.1
The very first lines of BNF from RFC 2616:
~~~~ {.commonlisp}
Request = Request-Line ; Section 5.1
*(( general-header ; Section 4.5
| request-header ; Section 5.3
| entity-header ) CRLF) ; Section 7.1
CRLF
[ message-body ] ; Section 4.3
~~~~
(Never read an RFC? See
[RFC 2616 section 2](http://www.w3.org/Protocols/rfc2616/rfc2616-sec2.html#sec2)
for a quick tour of IETF BNF syntax.)
# Parsing by hand: C++ parser fragment
Long ago, the ancients wrote all of their parsers by hand, because
that was all they could do.
~~~~ {.cpp}
void Response::ProcessStatusLine( std::string const& line )
{
const char* p = line.c_str();
while( *p && *p == ' ' )
++p;
while( *p && *p != ' ' )
m_VersionString += *p++;
while( *p && *p == ' ' )
++p;
std::string status;
while( *p && *p != ' ' )
status += *p++;
while( *p && *p == ' ' )
++p;
while( *p )
m_Reason += *p++;
m_Status = atoi( status.c_str() );
if( m_Status < 100 || m_Status > 999 )
throw Wobbly( "BadStatusLine (%s)", line.c_str() );
if( m_VersionString == "HTTP:/1.0" )
m_Version = 10;
else if( 0==m_VersionString.compare( 0,7,"HTTP/1." ) )
m_Version = 11;
else
throw Wobbly( "UnknownProtocol (%s)", m_VersionString.c_str() );
m_State = HEADERS;
m_HeaderAccum.clear();
}
~~~~
# Parsing by hand: Java parser fragment, page 1
~~~~ {.java}
public class HttpResponseParser extends AbstractMessageParser {
private final HttpResponseFactory responseFactory;
private final CharArrayBuffer lineBuf;
public HttpResponseParser(
final SessionInputBuffer buffer,
final LineParser parser,
final HttpResponseFactory responseFactory,
final HttpParams params) {
super(buffer, parser, params);
if (responseFactory == null) {
throw new IllegalArgumentException("Response factory may not be null");
}
this.responseFactory = responseFactory;
this.lineBuf = new CharArrayBuffer(128);
}
~~~~
# Parsing by hand: Java parser fragment, page 2
~~~~ {.java}
protected HttpMessage parseHead(
final SessionInputBuffer sessionBuffer)
throws IOException, HttpException, ParseException {
this.lineBuf.clear();
int i = sessionBuffer.readLine(this.lineBuf);
if (i == -1) {
throw new NoHttpResponseException("The target server failed to respond");
}
//create the status line from the status string
ParserCursor cursor = new ParserCursor(0, this.lineBuf.length());
StatusLine statusline = lineParser.parseStatusLine(this.lineBuf, cursor);
return this.responseFactory.newHttpResponse(statusline, null);
}
}
~~~~
# Commentary
Hand-written parsers can be quite efficient.
But...
# Commentary
Hand-written parsers can be quite efficient.
But...
# Domain specific languages
~~~~
%%{
machine http_parser_common;
scheme = ( alpha | digit | "+" | "-" | "." )* ;
absolute_uri = (scheme ":" (uchar | reserved )*);
path = ( pchar+ ( "/" pchar* )* ) ;
query = ( uchar | reserved )* %query_string ;
param = ( pchar | "/" )* ;
params = ( param ( ";" param )* ) ;
rel_path = ( path? %request_path (";" params)? ) ("?" %start_query query)?;
absolute_path = ( "/"+ rel_path );
Request_URI = ( "*" | absolute_uri | absolute_path ) >mark %request_uri;
Fragment = ( uchar | reserved )* >mark %fragment;
Method = ( upper | digit | safe ){1,20} >mark %request_method;
http_number = ( digit+ "." digit+ ) ;
HTTP_Version = ( "HTTP/" http_number ) >mark %http_version ;
Request_Line = ( Method " " Request_URI ("#" Fragment){0,1} " " HTTP_Version CRLF ) ;
field_name = ( token -- ":" )+ >start_field %write_field;
field_value = any* >start_value %write_value;
message_header = field_name ":" " "* field_value :> CRLF;
Request = Request_Line ( message_header )* ( CRLF @done );
main := Request;
}%%
~~~~
# What was that?
That last slide used a DSL named Ragel.
We squeezed most of a parser into one slide!
Problems:
* A completely new language.
* Generated code is hard to follow.
* Error reporting and recovery can be nasty.
Advantages:
* Concise.
* DSL-powered parsers (bison, antlr) can optimize parsers and find
certain classes of grammar bugs.
# Parsing in Haskell
Let's build a parser of our own.
The dumbest parser we can think of:
* Accept a string to match against,
* and another string as input,
* then tell us if the input string matches our first string.
What should this parser's type signature be?
# Parsing in Haskell
Let's build a parser of our own.
The dumbest parser we can think of:
* Accept a string to match against,
* and another string as input,
* then tell us if the input string matches our first string.
What should this parser's type signature be?
~~~~ {.haskell}
string :: String -> String -> Bool
~~~~
(Is this a *good* type signature??)
# Parsing a number
Suppose we want to parse a decimal number.
What should our type signature be?
# Parsing a number
Suppose we want to parse a decimal number.
What should our type signature be?
~~~~ {.haskell}
int :: String -> Int
~~~~
(Is *this* a good type signature?)
# A concrete example
~~~~ {.haskell}
import Data.List
string :: String -> String -> Bool
string = (==)
number :: String -> Int
number = read
~~~~
# This is a dead end
We can't do anything useful with these functions.
What are we missing?
# Not is a dead end
We can't do anything useful with these functions.
What are we missing?
A new perspective:
* A parser consumes as much input as it can.
* It converts the input as necessary.
* It returns the result of the conversion.
But most importantly:
* It *also* returns *the leftover input* that it could not consume.
# Abstractly
Given an input `s` and a desired result `a`, we could model a parser
via this type:
~~~~ {.haskell}
type Parser s a = s -> (a, s)
~~~~
What are we missing?
# Abstractly
Given an input `s` and a desired result `a`, we could model a parser
via this type:
~~~~ {.haskell}
type Parser s a = s -> (a, s)
~~~~
What are we missing?
* Parsers can fail.
~~~~ {.haskell}
type Parser s a = s -> Maybe (a, s)
~~~~
# Less crappy
~~~~ {.haskell}
import Data.Char
import Data.List
import Data.Maybe
type Parser s a = s -> Maybe (a,s)
string :: String -> Parser String String
string pat input =
case stripPrefix pat input of
Nothing -> Nothing
Just rest -> Just (pat, rest)
~~~~
Please supply a type signature and body for `number`. You have 2 minutes.
~~~~ {.haskell}
number = undefined
~~~~
Skeleton: [`http://cs240h.scs.stanford.edu/DumbP.hs`](http://cs240h.scs.stanford.edu/DumbP.hs)
# Your turn!
Build a small parser from `string` and `number`.
If you call:
~~~~ {.haskell}
version "HTTP/1.1\r\n"
~~~~
You should get back a result containing:
~~~~ {.haskell}
(1,1)
~~~~
You have 5 minutes.
# Horrible first version
Look at this staircase of `case` expressions!
~~~~ {.haskell}
versionDumb i0 =
case string "HTTP/" i0 of
Nothing -> Nothing
Just (_,i1) ->
case number i1 of
Nothing -> Nothing
Just (maj,i2) ->
case string "." i2 of
Nothing -> Nothing
Just (n,i3) ->
case number i3 of
Nothing -> Nothing
Just (min,i4) -> Just ((maj,min),i4)
~~~~
What's the pattern here?
# Noticing things
~~~~ {.haskell}
versionDumb i0 =
case string "HTTP/" i0 of
Nothing -> Nothing
Just (_,i1) ->
case number i1 of
{- ... -}
~~~~
We have:
* A parser is fed some input
* We do `case` analysis of the result
* On success, we pass the leftover input to the next parser
# Turn the boilerplate into a function
~~~~ {.haskell}
andThen :: Parser s a -> (a -> Parser s b) -> Parser s b
andThen parse next = \input ->
case parse input of
Nothing -> Nothing
Just (a, input') -> next a input'
~~~~
# Now what?
~~~~ {.haskell}
version2 =
string "HTTP/" `andThen` \_ ->
number `andThen` \maj ->
string "." `andThen` \_ ->
number `andThen` \min ->
{- ... give back (maj,min) -}
~~~~
From 12 lines to 4!
All that's missing:
* How do we construct the `(maj,min)` tuple?
# Handing back a result
We need to stuff our result into a `Parser` box:
~~~~ {.haskell}
stuff :: a -> Parser s a
stuff a = \input -> Just (a, input)
~~~~
Thus:
~~~~ {.haskell}
version2 =
string "HTTP/" `andThen` \_ ->
number `andThen` \maj ->
string "." `andThen` \_ ->
number `andThen` \min ->
stuff (maj,min)
~~~~
# Types
Look at this:
~~~~ {.haskell}
andThen :: Parser s a -> (a -> Parser s b) -> Parser s b
(>>=) :: Monad m =>
m a -> (a -> m b) -> m b
~~~~
And this:
~~~~ {.haskell}
stuff :: a -> Parser s a
return :: Monad m => a -> m a
~~~~
# So close!
# Making a "proper" monad
We **don't** want to write a `Monad` instance for this type:
~~~~ {.haskell}
type Parser s a = s -> (a, s)
~~~~
If we did, *every* function from `s` to `(a,s)` would be an instance
of our weird monad.
Instead of `type`, let's use `newtype`.
~~~~ {.haskell}
-- type Parser s a = s -> Maybe (a, s)
newtype Parser s a = P { runP :: s -> Maybe (a, s) }
~~~~
Our `Monad` instance:
~~~~ {.haskell}
instance Monad (Parser s) where
(>>=) = bind
return = shove
~~~~
# return
Let's look at our old parser (now named `OldParser`) and our new
`Parser` side by side.
~~~~ {.haskell}
stuff :: a -> OldParser s a
shove :: a -> Parser s a
stuff a = \input -> Just (a, input)
shove a = P $ \input -> Just (a, input)
~~~~
It should be clear that the *only* difference is some `newtype`
machinery.
# bind
~~~~ {.haskell}
andThen :: OldParser s a -> (a -> OldParser s b) -> OldParser s b
bind :: Parser s a -> (a -> Parser s b) -> Parser s b
andThen parse next = \input ->
case parse input of
Nothing -> Nothing
Just (a, input') -> (next a) input'
bind parse next = P $ \input ->
case runP parse input of
Nothing -> Nothing
Just (a, input') -> runP (next a) input'
~~~~
What should the `Functor` instance look like?
~~~~ {.haskell}
class Functor where
fmap :: (a -> b) -> f a -> f b
~~~~
# Functor
~~~~ {.haskell}
instance Functor (Parser s) where
fmap f parser = P $ \input ->
case runP parser input of
Nothing -> Nothing
Just (a, input') -> Just (f a, input')
~~~~
# Revisiting our smaller parser
Before:
~~~~ {.haskell}
version2 =
string "HTTP/" `andThen` \_ ->
number `andThen` \maj ->
string "." `andThen` \_ ->
number `andThen` \min ->
stuff (maj,min)
~~~~
Suppose we plumb `P` and `runP` into the right places in `string` and
`number`.
After:
~~~~ {.haskell}
version3 = do
string "HTTP/"
maj <- number
string "."
min <- number
return (maj,min)
~~~~
# Can we do better?
We needed our `Functor` instance so we can write this:
~~~~ {.haskell}
import Control.Applicative
import Control.Monad (ap)
instance Applicative (Parser s) where
pure = return
(<*>) = ap
~~~~
Our parser now becomes:
~~~~ {.haskell}
version4 = (,) <$>
(string "HTTP/" *> number <* string ".") <*>
number
~~~~
Nice!
# What next? A small trick!
Let's write less-polymorphic combinators to aid our notation:
~~~~ {.haskell}
(.*>) :: String -> Parser String b -> Parser String b
a .*> b = string a *> b
infixl 4 .*>
(<*.) :: Parser String a -> String -> Parser String a
a <*. b = a <* string b
infixl 4 <*.
~~~~
And finally:
~~~~ {.haskell}
version5 = (,) <$> ("HTTP/" .*> number <*. ".")
<*> number
~~~~
From 12 lines to two!
# Your turn: Alternative
How hard is it to build a parser that can choose another parse if the
first one fails?
~~~~ {.haskell}
class Applicative f => Alternative f where
empty :: f a
(<|>) :: f a -> f a -> f a
~~~~
Take 5 minutes to write your own.
~~~~ {.haskell}
instance Alternative (Parser s) where
{- ... -}
~~~~
Skeleton: [`http://cs240h.scs.stanford.edu/P.hs`](http://cs240h.scs.stanford.edu/P.hs)
Desired behaviour:
~~~~ {.haskell}
>>> let p = string "foo" <|> string "bar"
>>> runP p "foowhee"
Just ("foo","whee")
>>> runP p "quuxly"
Nothing
>>> runP p "barely"
Just ("bar","ely")
~~~~
# Alternative
My implementation took about one minute:
~~~~ {.haskell}
instance Alternative (Parser s) where
empty = P $ \_ -> Nothing
f <|> g = P $ \input ->
case runP f input of
Nothing -> runP g input
result -> result
~~~~
# Parsec
There's a long history of functional programmers writing
parsing libraries.
Parsec
([Leijen 2001](http://legacy.cs.uu.nl/daan/download/papers/parsec-paper.pdf))
was the first to be practical for real world use.
# Parsec and attoparsec
Inspired by Parsec, I wrote a specialized derivative named
[attoparsec](http://hackage.haskell.org/package/attoparsec).
How do the two differ?
* attoparsec is optimized for *fast* stream and file parsing.
* attoparsec focuses on the `ByteString` and `Text` types.
* Parsec is more general: it can parse `String` too (in fact
arbitrary token types).
* attoparsec does *not* attempt to give friendly error messages (no
file names or line numbers). It's aimed at data generated by machines.
* Parsec might be a better choice for e.g. parsing source files, where
the friendliness/performance tradeoff is a little different.
# Mise en scene
Typical network protocol problem:
* You receive a TCP segment off the network
* You need to decode a variable-length message that lacks message
length information (e.g. a JSON blob)
* How do you tell whether your packet/segment contains enough data to
parse the whole message?
More importantly:
* How messed-up does your parser become to accommodate this need?
# One approach
What Parsec does:
* Built as a *monad transformer* (stackable on top of another monad)
* Combines parsing with ability to perform IO (e.g. read more TCP
segments)
# Monad transformers
In general:
* The composition of two monads is *not* itself a monad.
* Monad transfomers allow us to "stack" a series of "transformers" on
top of a base monad, so that the stacked combination *is* a monad.
For our Parsec case, the base monad is `IO`, and the transformer
stacked on top is `ParsecT`.
Read more on Haskellwiki if you need to know.
# Monad transformers
In general:
* The composition of two monads is *not* itself a monad.
* Monad transfomers allow us to "stack" a series of "transformers" on
top of a base monad, so that the stacked combination *is* a monad.
For our Parsec case, the base monad is `IO`, and the transformer
stacked on top is `ParsecT`.
Read more on Haskellwiki **if you dare**.
# A simple attoparsec parser
Parse the `Request-Line` of an HTTP 1.1 request:
~~~~ {.haskell}
{-# LANGUAGE OverloadedStrings #-}
import Data.Attoparsec.ByteString.Char8 as A
import Control.Applicative
request = (,,) <$>
(verb <* skipSpace) <*>
(url <* skipSpace) <*>
(version <* endOfLine)
verb = "GET" <|> "POST"
url = takeTill A.isSpace
version = (,) <$> ("HTTP/" *> decimal) <*> ("." *> decimal)
~~~~
# Let's try this in ghci
~~~~ {.haskell}
>>> :load AttoHttp.hs
>>> parse request "GET /foo HTTP/1.1\r\n"
Done "" ("GET","/foo",(1,1))
~~~~
What happens on incomplete input?
~~~~ {.haskell}
>>> parse request "GET /foo HTTP/"
Partial _
~~~~
# Feeding more input
The `Partial` constructor has a parameter that is a function. We can
use `feed` to supply it with more input.
~~~~ {.haskell}
>>> let r = parse request "GET /foo HTTP/"
>>> r
Partial _
>>> r `feed` "1.1\r\n"
Done "" ("GET","/foo",(1,1))
~~~~
# How does this work?
# Basic continuation based parsing
Instead of returning a value:
* If a parse succeeds or fails, we *call a function*.
~~~~ {.haskell}
{-# LANGUAGE RankNTypes #-}
type ErrMsg = String
newtype ContP a = ContP {
runP :: forall r. -- wat
String -- input
-> (ErrMsg -> Either ErrMsg r) -- failure
-> (String -> a -> Either ErrMsg r) -- success
-> Either ErrMsg r -- result
}
~~~~
# Rank-1 types
For polymorphic Haskell functions, there is a type quantifier that is
not written.
~~~~ {.haskell}
id :: a -> a
~~~~
Really means:
~~~~ {.haskell}
id :: forall a. a -> a
~~~~
And `forall` is a type-level lambda, corresponding to the value-level
lambda "`\`".
So in the above signature, we say:
* "Caller, provide me a type and I'll bind it to the type variable `a`."
We call this a *rank-1 type* because the `forall` is present at the
outermost level, or rank.
# Rank-2 types
Our continuation-based parser has a rank-2 type.
~~~~ {.haskell}
newtype ContP a = ContP {
runP :: forall r. -- wat
String -- input
-> (ErrMsg -> Either ErrMsg r) -- failure
-> (String -> a -> Either ErrMsg r) -- success
-> Either ErrMsg r -- result
}
~~~~
The *caller* of `runP` cannot pick the type of `r`.
* Instead, the *callee* controls it.
# Continuations are tricky
Working with continuations is tough.
Here's a simple example:
~~~~ {.haskell}
instance Functor ContP where
fmap f p = ContP $ \bs0 fail0 succ0 ->
runP p bs0 fail0 $ \bs1 a -> succ0 bs1 (f a)
~~~~
Help me fumble through a `Monad` instance!
# Attoparsec internals: the types
~~~~ {.haskell}
newtype Parser t a = Parser {
runParser :: forall r. Input t -> Added t -> More
-> Failure t r
-> Success t a r
-> IResult t r
}
type Failure t r = Input t -> Added t -> More
-> [String] -> String -> IResult t r
type Success t a r = Input t -> Added t -> More
-> a -> IResult t r
newtype Input t = I {unI :: t} deriving (Monoid)
newtype Added t = A {unA :: t} deriving (Monoid)
-- | Have we reached EOF?
data More = Complete | Incomplete
deriving (Eq, Show)
data IResult t r = Fail t [String] String
| Partial (t -> IResult t r)
| Done t r
~~~~
# Types I
We have to somehow make `feed` a function that is possible to write.
If we are given more input, we use these types to track it.
~~~~ {.haskell}
newtype Input t = I {unI :: t} deriving (Monoid)
newtype Added t = A {unA :: t} deriving (Monoid)
~~~~
`Input` is all the input we've ever seen.
`Added` is the *extra* input we've been given via `feed`.
We keep track of the two separately because parsing can fail.
* When parsing fails and we try another branch, we *backtrack*.
When we backtrack, we throw away the `Input` state.
We tack all the data our caller `Added` back on again, so we won't
forget it.
# Types I.5
Suppose our caller runs out of input to give us on one branch.
But our parse fails.
We need to remember this so that after we backtrack, we won't ask for
more input on the other side of the branch.
~~~~ {.haskell}
data More = Complete | Incomplete
deriving (Eq, Show)
~~~~
# Types II
If a parse fails, we report an error message, along with a stack of
context information that might help with debugging.
~~~~ {.haskell}
type Failure t r = Input t -> Added t -> More
-> [String] -> String -> IResult t r
~~~~
If the parse succeeds, we call our successor continuation with our
result.
~~~~ {.haskell}
type Success t a r = Input t -> Added t -> More
-> a -> IResult t r
~~~~
The result is just a fancier `Either` with the possibility of saying
"feed me more before I answer you".
~~~~ {.haskell}
data IResult t r = Fail t [String] String
| Partial (t -> IResult t r)
| Done t r
~~~~
# Running and ending a parse
As is usually the case with monads, the user-visible "run this monad"
function is quite simple.
~~~~ {.haskell}
parse :: Monoid t => Parser t a -> t -> IResult t a
parse m s = runParser m (I s) (A mempty) Incomplete failK successK
~~~~
The only slight complication:
* We need to create "terminal continuations"
* i.e. continuations that *do not* chain up yet another continuation,
but instead return us to our usual mode of computation
~~~~ {.haskell}
failK :: Failure t a
failK i0 _a0 _m0 stack msg = Fail (unI i0) stack msg
successK :: Success t a a
successK i0 _a0 _m0 a = Done (unI i0) a
~~~~
# Lab 1
http://www.scs.stanford.edu/16wi-cs240h/labs/lab1.html
Write a simple Haskell version of the UNIX `tr` program.