Hi!
Office hours:
Wednesdays, 14:30-15:00
The room full of grad students next to Gates 290
Will not be present Jan 20
A whole lot of programming involves interacting with external sources of data:
Files containing junk
Network peers sending us junk
Our sad duty is to try to make sense of this stuff.
The very first lines of BNF from RFC 2616:
Request = Request-Line ; Section 5.1
*(( general-header ; Section 4.5
| request-header ; Section 5.3
| entity-header ) CRLF) ; Section 7.1
CRLF
[ message-body ] ; Section 4.3
(Never read an RFC? See RFC 2616 section 2 for a quick tour of IETF BNF syntax.)
Long ago, the ancients wrote all of their parsers by hand, because that was all they could do.
void Response::ProcessStatusLine( std::string const& line )
{
const char* p = line.c_str();
while( *p && *p == ' ' )
++p;
while( *p && *p != ' ' )
m_VersionString += *p++;
while( *p && *p == ' ' )
++p;
std::string status;
while( *p && *p != ' ' )
status += *p++;
while( *p && *p == ' ' )
++p;
while( *p )
m_Reason += *p++;
m_Status = atoi( status.c_str() );
if( m_Status < 100 || m_Status > 999 )
throw Wobbly( "BadStatusLine (%s)", line.c_str() );
if( m_VersionString == "HTTP:/1.0" )
m_Version = 10;
else if( 0==m_VersionString.compare( 0,7,"HTTP/1." ) )
m_Version = 11;
else
throw Wobbly( "UnknownProtocol (%s)", m_VersionString.c_str() );
m_State = HEADERS;
m_HeaderAccum.clear();
}
public class HttpResponseParser extends AbstractMessageParser {
private final HttpResponseFactory responseFactory;
private final CharArrayBuffer lineBuf;
public HttpResponseParser(
final SessionInputBuffer buffer,
final LineParser parser,
final HttpResponseFactory responseFactory,
final HttpParams params) {
super(buffer, parser, params);
if (responseFactory == null) {
throw new IllegalArgumentException("Response factory may not be null");
}
this.responseFactory = responseFactory;
this.lineBuf = new CharArrayBuffer(128);
}
protected HttpMessage parseHead(
final SessionInputBuffer sessionBuffer)
throws IOException, HttpException, ParseException {
this.lineBuf.clear();
int i = sessionBuffer.readLine(this.lineBuf);
if (i == -1) {
throw new NoHttpResponseException("The target server failed to respond");
}
//create the status line from the status string
ParserCursor cursor = new ParserCursor(0, this.lineBuf.length());
StatusLine statusline = lineParser.parseStatusLine(this.lineBuf, cursor);
return this.responseFactory.newHttpResponse(statusline, null);
}
}
Hand-written parsers can be quite efficient.
But…
Hand-written parsers can be quite efficient.
But…
%%{
machine http_parser_common;
scheme = ( alpha | digit | "+" | "-" | "." )* ;
absolute_uri = (scheme ":" (uchar | reserved )*);
path = ( pchar+ ( "/" pchar* )* ) ;
query = ( uchar | reserved )* %query_string ;
param = ( pchar | "/" )* ;
params = ( param ( ";" param )* ) ;
rel_path = ( path? %request_path (";" params)? ) ("?" %start_query query)?;
absolute_path = ( "/"+ rel_path );
Request_URI = ( "*" | absolute_uri | absolute_path ) >mark %request_uri;
Fragment = ( uchar | reserved )* >mark %fragment;
Method = ( upper | digit | safe ){1,20} >mark %request_method;
http_number = ( digit+ "." digit+ ) ;
HTTP_Version = ( "HTTP/" http_number ) >mark %http_version ;
Request_Line = ( Method " " Request_URI ("#" Fragment){0,1} " " HTTP_Version CRLF ) ;
field_name = ( token -- ":" )+ >start_field %write_field;
field_value = any* >start_value %write_value;
message_header = field_name ":" " "* field_value :> CRLF;
Request = Request_Line ( message_header )* ( CRLF @done );
main := Request;
}%%
That last slide used a DSL named Ragel.
We squeezed most of a parser into one slide!
Problems:
A completely new language.
Generated code is hard to follow.
Error reporting and recovery can be nasty.
Advantages:
Concise.
DSL-powered parsers (bison, antlr) can optimize parsers and find certain classes of grammar bugs.
Let’s build a parser of our own.
The dumbest parser we can think of:
Accept a string to match against,
and another string as input,
then tell us if the input string matches our first string.
What should this parser’s type signature be?
Let’s build a parser of our own.
The dumbest parser we can think of:
Accept a string to match against,
and another string as input,
then tell us if the input string matches our first string.
What should this parser’s type signature be?
string :: String -> String -> Bool
(Is this a good type signature??)
Suppose we want to parse a decimal number.
What should our type signature be?
Suppose we want to parse a decimal number.
What should our type signature be?
int :: String -> Int
(Is this a good type signature?)
import Data.List
string :: String -> String -> Bool
string = (==)
number :: String -> Int
number = read
We can’t do anything useful with these functions.
What are we missing?
We can’t do anything useful with these functions.
What are we missing?
A new perspective:
A parser consumes as much input as it can.
It converts the input as necessary.
It returns the result of the conversion.
But most importantly:
Given an input s
and a desired result a
, we could model a parser via this type:
type Parser s a = s -> (a, s)
What are we missing?
Given an input s
and a desired result a
, we could model a parser via this type:
type Parser s a = s -> (a, s)
What are we missing?
type Parser s a = s -> Maybe (a, s)
import Data.Char
import Data.List
import Data.Maybe
type Parser s a = s -> Maybe (a,s)
string :: String -> Parser String String
string pat input =
case stripPrefix pat input of
Nothing -> Nothing
Just rest -> Just (pat, rest)
Please supply a type signature and body for number
. You have 2 minutes.
number = undefined
Skeleton: http://cs240h.scs.stanford.edu/DumbP.hs
Build a small parser from string
and number
.
If you call:
version "HTTP/1.1\r\n"
You should get back a result containing:
(1,1)
You have 5 minutes.
Look at this staircase of case
expressions!
versionDumb i0 =
case string "HTTP/" i0 of
Nothing -> Nothing
Just (_,i1) ->
case number i1 of
Nothing -> Nothing
Just (maj,i2) ->
case string "." i2 of
Nothing -> Nothing
Just (n,i3) ->
case number i3 of
Nothing -> Nothing
Just (min,i4) -> Just ((maj,min),i4)
What’s the pattern here?
versionDumb i0 =
case string "HTTP/" i0 of
Nothing -> Nothing
Just (_,i1) ->
case number i1 of
{- ... -}
We have:
A parser is fed some input
We do case
analysis of the result
On success, we pass the leftover input to the next parser
andThen :: Parser s a -> (a -> Parser s b) -> Parser s b
andThen parse next = \input ->
case parse input of
Nothing -> Nothing
Just (a, input') -> next a input'
version2 =
string "HTTP/" `andThen` \_ ->
number `andThen` \maj ->
string "." `andThen` \_ ->
number `andThen` \min ->
{- ... give back (maj,min) -}
From 12 lines to 4!
All that’s missing:
(maj,min)
tuple?We need to stuff our result into a Parser
box:
stuff :: a -> Parser s a
stuff a = \input -> Just (a, input)
Thus:
version2 =
string "HTTP/" `andThen` \_ ->
number `andThen` \maj ->
string "." `andThen` \_ ->
number `andThen` \min ->
stuff (maj,min)
Look at this:
andThen :: Parser s a -> (a -> Parser s b) -> Parser s b
(>>=) :: Monad m =>
m a -> (a -> m b) -> m b
And this:
stuff :: a -> Parser s a
return :: Monad m => a -> m a
We don’t want to write a Monad
instance for this type:
type Parser s a = s -> (a, s)
If we did, every function from s
to (a,s)
would be an instance of our weird monad.
Instead of type
, let’s use newtype
.
-- type Parser s a = s -> Maybe (a, s)
newtype Parser s a = P { runP :: s -> Maybe (a, s) }
Our Monad
instance:
instance Monad (Parser s) where
(>>=) = bind
return = shove
Let’s look at our old parser (now named OldParser
) and our new Parser
side by side.
stuff :: a -> OldParser s a
shove :: a -> Parser s a
stuff a = \input -> Just (a, input)
shove a = P $ \input -> Just (a, input)
It should be clear that the only difference is some newtype
machinery.
andThen :: OldParser s a -> (a -> OldParser s b) -> OldParser s b
bind :: Parser s a -> (a -> Parser s b) -> Parser s b
andThen parse next = \input ->
case parse input of
Nothing -> Nothing
Just (a, input') -> (next a) input'
bind parse next = P $ \input ->
case runP parse input of
Nothing -> Nothing
Just (a, input') -> runP (next a) input'
What should the Functor
instance look like?
class Functor where
fmap :: (a -> b) -> f a -> f b
instance Functor (Parser s) where
fmap f parser = P $ \input ->
case runP parser input of
Nothing -> Nothing
Just (a, input') -> Just (f a, input')
Before:
version2 =
string "HTTP/" `andThen` \_ ->
number `andThen` \maj ->
string "." `andThen` \_ ->
number `andThen` \min ->
stuff (maj,min)
Suppose we plumb P
and runP
into the right places in string
and number
.
After:
version3 = do
string "HTTP/"
maj <- number
string "."
min <- number
return (maj,min)
We needed our Functor
instance so we can write this:
import Control.Applicative
import Control.Monad (ap)
instance Applicative (Parser s) where
pure = return
(<*>) = ap
Our parser now becomes:
version4 = (,) <$>
(string "HTTP/" *> number <* string ".") <*>
number
Nice!
Let’s write less-polymorphic combinators to aid our notation:
(.*>) :: String -> Parser String b -> Parser String b
a .*> b = string a *> b
infixl 4 .*>
(<*.) :: Parser String a -> String -> Parser String a
a <*. b = a <* string b
infixl 4 <*.
And finally:
version5 = (,) <$> ("HTTP/" .*> number <*. ".")
<*> number
From 12 lines to two!
How hard is it to build a parser that can choose another parse if the first one fails?
class Applicative f => Alternative f where
empty :: f a
(<|>) :: f a -> f a -> f a
Take 5 minutes to write your own.
instance Alternative (Parser s) where
{- ... -}
Skeleton: http://cs240h.scs.stanford.edu/P.hs
Desired behaviour:
>>> let p = string "foo" <|> string "bar"
>>> runP p "foowhee"
Just ("foo","whee")
>>> runP p "quuxly"
Nothing
>>> runP p "barely"
Just ("bar","ely")
My implementation took about one minute:
instance Alternative (Parser s) where
empty = P $ \_ -> Nothing
f <|> g = P $ \input ->
case runP f input of
Nothing -> runP g input
result -> result
There’s a long history of functional programmers writing parsing libraries.
Parsec (Leijen 2001) was the first to be practical for real world use.
Inspired by Parsec, I wrote a specialized derivative named attoparsec.
How do the two differ?
attoparsec is optimized for fast stream and file parsing.
attoparsec focuses on the ByteString
and Text
types.
Parsec is more general: it can parse String
too (in fact arbitrary token types).
attoparsec does not attempt to give friendly error messages (no file names or line numbers). It’s aimed at data generated by machines.
Parsec might be a better choice for e.g. parsing source files, where the friendliness/performance tradeoff is a little different.
Typical network protocol problem:
You receive a TCP segment off the network
You need to decode a variable-length message that lacks message length information (e.g. a JSON blob)
How do you tell whether your packet/segment contains enough data to parse the whole message?
More importantly:
What Parsec does:
Built as a monad transformer (stackable on top of another monad)
Combines parsing with ability to perform IO (e.g. read more TCP segments)
In general:
The composition of two monads is not itself a monad.
Monad transfomers allow us to “stack” a series of “transformers” on top of a base monad, so that the stacked combination is a monad.
For our Parsec case, the base monad is IO
, and the transformer stacked on top is ParsecT
.
Read more on Haskellwiki if you need to know.
In general:
The composition of two monads is not itself a monad.
Monad transfomers allow us to “stack” a series of “transformers” on top of a base monad, so that the stacked combination is a monad.
For our Parsec case, the base monad is IO
, and the transformer stacked on top is ParsecT
.
Read more on Haskellwiki if you dare.
Parse the Request-Line
of an HTTP 1.1 request:
{-# LANGUAGE OverloadedStrings #-}
import Data.Attoparsec.ByteString.Char8 as A
import Control.Applicative
request = (,,) <$>
(verb <* skipSpace) <*>
(url <* skipSpace) <*>
(version <* endOfLine)
verb = "GET" <|> "POST"
url = takeTill A.isSpace
version = (,) <$> ("HTTP/" *> decimal) <*> ("." *> decimal)
>>> :load AttoHttp.hs
>>> parse request "GET /foo HTTP/1.1\r\n"
Done "" ("GET","/foo",(1,1))
What happens on incomplete input?
>>> parse request "GET /foo HTTP/"
Partial _
The Partial
constructor has a parameter that is a function. We can use feed
to supply it with more input.
>>> let r = parse request "GET /foo HTTP/"
>>> r
Partial _
>>> r `feed` "1.1\r\n"
Done "" ("GET","/foo",(1,1))
Instead of returning a value:
{-# LANGUAGE RankNTypes #-}
type ErrMsg = String
newtype ContP a = ContP {
runP :: forall r. -- wat
String -- input
-> (ErrMsg -> Either ErrMsg r) -- failure
-> (String -> a -> Either ErrMsg r) -- success
-> Either ErrMsg r -- result
}
For polymorphic Haskell functions, there is a type quantifier that is not written.
id :: a -> a
Really means:
id :: forall a. a -> a
And forall
is a type-level lambda, corresponding to the value-level lambda “\
”.
So in the above signature, we say:
a
.”We call this a rank-1 type because the forall
is present at the outermost level, or rank.
Our continuation-based parser has a rank-2 type.
newtype ContP a = ContP {
runP :: forall r. -- wat
String -- input
-> (ErrMsg -> Either ErrMsg r) -- failure
-> (String -> a -> Either ErrMsg r) -- success
-> Either ErrMsg r -- result
}
The caller of runP
cannot pick the type of r
.
Working with continuations is tough.
Here’s a simple example:
instance Functor ContP where
fmap f p = ContP $ \bs0 fail0 succ0 ->
runP p bs0 fail0 $ \bs1 a -> succ0 bs1 (f a)
Help me fumble through a Monad
instance!
newtype Parser t a = Parser {
runParser :: forall r. Input t -> Added t -> More
-> Failure t r
-> Success t a r
-> IResult t r
}
type Failure t r = Input t -> Added t -> More
-> [String] -> String -> IResult t r
type Success t a r = Input t -> Added t -> More
-> a -> IResult t r
newtype Input t = I {unI :: t} deriving (Monoid)
newtype Added t = A {unA :: t} deriving (Monoid)
-- | Have we reached EOF?
data More = Complete | Incomplete
deriving (Eq, Show)
data IResult t r = Fail t [String] String
| Partial (t -> IResult t r)
| Done t r
We have to somehow make feed
a function that is possible to write.
If we are given more input, we use these types to track it.
newtype Input t = I {unI :: t} deriving (Monoid)
newtype Added t = A {unA :: t} deriving (Monoid)
Input
is all the input we’ve ever seen.
Added
is the extra input we’ve been given via feed
.
We keep track of the two separately because parsing can fail.
When we backtrack, we throw away the Input
state.
We tack all the data our caller Added
back on again, so we won’t forget it.
Suppose our caller runs out of input to give us on one branch.
But our parse fails.
We need to remember this so that after we backtrack, we won’t ask for more input on the other side of the branch.
data More = Complete | Incomplete
deriving (Eq, Show)
If a parse fails, we report an error message, along with a stack of context information that might help with debugging.
type Failure t r = Input t -> Added t -> More
-> [String] -> String -> IResult t r
If the parse succeeds, we call our successor continuation with our result.
type Success t a r = Input t -> Added t -> More
-> a -> IResult t r
The result is just a fancier Either
with the possibility of saying “feed me more before I answer you”.
data IResult t r = Fail t [String] String
| Partial (t -> IResult t r)
| Done t r
As is usually the case with monads, the user-visible “run this monad” function is quite simple.
parse :: Monoid t => Parser t a -> t -> IResult t a
parse m s = runParser m (I s) (A mempty) Incomplete failK successK
The only slight complication:
We need to create “terminal continuations”
i.e. continuations that do not chain up yet another continuation, but instead return us to our usual mode of computation
failK :: Failure t a
failK i0 _a0 _m0 stack msg = Fail (unI i0) stack msg
successK :: Success t a a
successK i0 _a0 _m0 a = Done (unI i0) a
http://www.scs.stanford.edu/16wi-cs240h/labs/lab1.html
Write a simple Haskell version of the UNIX tr
program.