Parsers
A parser is a function that
- converts unstructured data (e.g.
String
, array ofByte
,…) - into structured data (e.g. JSON object, Markdown, Video…)
type Parser = String -> StructuredObject
Every large software system contains a Parser
System | Parses |
---|---|
Compiler | Source code |
Shell Scripts | Command-line options |
Database | SQL queries |
Web Browser | HTML, CSS, JS, … |
Games | Level descriptors |
Routers | Packets |
How to build Parsers?
Two standard methods
Regular Expressions
- Doesn’t really scale beyond simple things
- No nesting, recursion
Parser Generators
- Specify grammar via rules
Expr : Var { EVar $1 }
| Num { ENum $1 }
| Expr Op Expr { EBin $1 $2 $3 }
| '(' Expr ')' { $2 }
;
- Tools like
yacc
,bison
,antlr
,happy
- convert grammar into executable function
Grammars Don’t Compose!
If we have two kinds of structured objects Thingy
and Whatsit
.
Thingy : rule { action }
;
Whatsit : rule { action }
;
To parse sequences of Thingy
and Whatsit
we must repeat ourselves:
Thingies : Thingy Thingies { ... }
EmptyThingy { ... }
;
Whatsits : Whatsit Whatsits { ... }
EmptyWhatsit { ... }
;
No way to:
- Define a generic parser for sequences
- Reuse this generic parser for
Thingy
andWhatsit
A New Hope: Parsers as Functions
Lets think of parsers directly as functions that
- Take as input a
String
- Convert a part of the input into a
StructuredObject
- Return the remainder unconsumed to be parsed later
data Parser a = P (String -> (a, String))
A Parser a
- Converts a prefix of a
String
- Into a structured object of type
a
and - Returns the suffix
String
unchanged
Parsers Can Produce Many Results
Sometimes we want to parse a String
like
"2 - 3 - 4"
into a list of possible results
Minus (Minus 2 3) 4), Minus 2 (Minus 3 4)] [(
So we generalize the Parser
type to
data Parser a = P (String -> [(a, String)])
EXERCISE
Given the definition
data Parser a = P (String -> [(a, String)])
Implement a function
runParser :: Parser a -> String -> [(a, String)]
= ??? runParser p s
QUIZ
Given the definition
data Parser a = P (String -> [(a, String)])
Which of the following is a valid oneChar :: Parser Char
that returns the first Char
from a string (if one exists)
-- A
= P (\s -> head s)
oneChar
-- B
= P (\s -> case s of
oneChar -> [('', s)]
[] :cs -> [c, cs])
c
-- C
= P (\s -> (head s, tail s))
oneChar
-- D
= P (\s -> [(head s, tail s)])
oneChar
-- E
= P (\s -> case s of
oneChar -> []
[] :cs -> [(c, cs)]) c
Lets Run Our First Parser!
>>> runParser oneChar "hey!"
'h', "ey")]
[(
>>> runParser oneChar "yippee"
'y', "ippee")]
[(
>>> runParser oneChar ""
[]
Failure to parse means result is an empty list!
EXERCISE
Your turn: Write a parser to grab first two chars
twoChar :: Parser (Char, Char)
= P (\cs -> ???) twoChar
When you are done, we should get
>>> runParser twoChar "hey!"
'h', 'e'), "y!")]
[((
>>> runParser twoChar ""
[]
>>> runParser twoChar "h"
[]
Composing Parsers
We can write twoChars
from scratch like so:
twoChar :: Parser (Char, Char)
= P (\cs -> case cs of
twoChar :c2:cs' -> [((c1, c2), cs')]
c1-> []) _
But this is sad: would be much better to reuse oneChar
!
twoChar :: Parser (Char, Char)
= pairP oneChar oneChar twoChar
QUIZ
twoChar :: Parser (Char, Char)
= pairP oneChar oneChar twoChar
What must the type of pairP
be?
A. Parser (Char, Char)
B. Parser Char -> Parser (Char, Char)
C. Parser a -> Parser a -> Parser (a, a)
D. Parser a -> Parser b -> Parser (a, b)
E. Parser a -> Parser (a, a)
A Pair Combinator
Lets implement pairP
!
pairP :: Parser a -> Parser b -> Parser (a, b)
= ??? pairP aP bP
High-level idea:
- Run
aP
on the input string - For each result
(a, s')
ofaP
, runbP
ons'
- For each result
(b, s'')
ofbP
, return((a, b), s'')
-- | We have just seen this "for-each" function in the previous lecture!
forEach :: [a] -> (a -> [b]) -> [b]
= concatMap f xs
forEach xs f
pairP :: Parser a -> Parser b -> Parser (a, b)
= P (\s -> forEach (runParser aP s) (\(a, s') ->
pairP aP bP ->
forEach (runParser bP s') (\(b, s'')
[((a, b), s'')]) ))
This works, but is was a bit of a pain to write!
Does this code look familiar?
It’s like doing mutable state + non-determinism the hard way!
- i.e. without monads
Also, take a second look at the Parser
type:
data Parser a = P (String -> [(a, String)])
data State s a = S (s -> (a, s))
Parser is a Monad!
Like State
and []
, Parser
is a monad!
We need to implement two functions
returnP :: a -> Parser a
bindP :: Parser a -> (a -> Parser b) -> Parser b
QUIZ
Which of the following is a valid implementation of returnP
data Parser a = P (String -> [(a, String)])
returnP :: a -> Parser a
= P (\s -> []) -- A
returnP a
= P (\s -> [(a, s)]) -- B
returnP a
= P (\s -> (a, s)) -- C
returnP a
= P (\s -> [(a, "")]) -- D
returnP a
= P (\s -> [(s, a)]) -- E returnP a
HINT: what did return
do for State
and []
?
Bind for Parsers
Next, lets implement bindP
bindP :: Parser a -> (a -> Parser b) -> Parser b
= ??? bindP aP fbP
High-level idea:
- Run
aP
on the input string - For each result
(a, s')
ofaP
, runbp = fbP a
ons'
- For each result
(b, s'')
ofbP
, return(b, s'')
In other words, this is just a generalization of pairP
!
bindP :: Parser a -> (a -> Parser b) -> Parser b
= P (\s ->
bindP aP fbP ->
forEach (runParser aP s) (\(a, s') ->
forEach (runParser (fbP a) s') (\(b, s'')
[(b, s'')]
)
) )
The Parser Monad
We can now make Parser
an instance of Monad
instance Monad Parser where
>>=) = bindP
(return = returnP
And now, let the wild rumpus begin!
Parser Combinators
Lets write lots of high-level operators to combine parsers!
Here’s a cleaned up pairP
pairP :: Parser a -> Parser b -> Parser (a, b)
= do
pairP aP bP <- aP
a <- bP
b return (a, b)
Failures are the Pillars of Success!
Surprisingly useful, always fails
- i.e. returns
[]
no successful parses
failP :: Parser a
= P (\_ -> []) failP
QUIZ
Consider the parser
satP :: (Char -> Bool) -> Parser Char
= do
satP p <- oneChar
c if p c then return c else failP
What is the value of
= runParser (satP (\c -> c == 'h')) "hello"
quiz1 = runParser (satP (\c -> c == 'h')) "yellow" quiz2
quiz1 |
quiz2 |
|
---|---|---|
A | [] |
[] |
B | [('h', "ello")] |
[('y', "ellow")] |
C | [('h', "ello")] |
[] |
D | [] |
[('y', "ellow")] |
Parsing Alphabets and Numerics
We can now use satP
to write
-- parse ONLY the Char c
charP :: Char -> Parser Char
= satP (\c' -> c == c')
charP c
-- parse ANY ALPHABET
alphaCharP :: Parser Char
= satP isAlpha
alphaCharP
-- parse ANY NUMERIC DIGIT
digitCharP :: Parser Char
= satP isDigit digitCharP
QUIZ
We can parse a single Int
digit
digitP :: Parser Int
= do
digitP <- digitCharP -- parse the Char c
c return (read [c]) -- convert Char to Int
What is the result of
= runParser digitP "92"
quiz1 = runParser digitP "cat" quiz2
quiz1 |
quiz2 |
|
---|---|---|
A | [] |
[] |
B | [('9', "2")] |
[('c', "at")] |
C | [(9, "2")] |
[] |
D | [] |
[('c', "at")] |
A Choice Combinator
Lets write a combinator p1 <|> p2
that
- returns the results of
p1
or, else if those are empty
- returns the results of
p2
E.g. <|>
lets us build a parser that produces an alphabet OR a numeric character
alphaNumCharP :: Parser Char
= alphaCharP <|> digitCharP alphaNumCharP
Which should produce
>>> runParser alphaNumCharP "cat"
'c', "at")]
[(
>>> runParser alphaNumCharP "2cat"
'2', "cat")] [(
QUIZ
(<|>) :: Parser a -> Parser a -> Parser a
<|> p2 = ??? -- produce results of `p1` if non-empty
p1 -- OR-ELSE results of `p2`
Which of the following implements (<|>)
?
do -- (A)
<- p1
r1s <- p2
r2s return (r1s ++ r2s)
do -- (B)
<- p1
r1s case r1s of
-> p2
[] -> return r1s
_
-- (C)
P (\s -> runParser p1 s ++ runParser p2 s)
-- (D)
P (\s ->
case runParser p1 s of
-> runParser p2 s
[] -> r1s
r1s )
A Simple Expression Parser
Let’s write a parser calc :: Parser Int
that parses and evaluates simple arithmetic expressions
- an expression is
a + b
,a - b
,a * b
,a / b
- where
a
andb
are single-digit integers
>>> runParser calc "8/2"
4,"")]
[(
>>> runParser calc "8+2cat"
10,"cat")]
[(
>>> runParser calc "8-2cat"
6,"cat")]
[(
>>> runParser calc "8*2cat"
16,"cat")] [(
-- 1. First, parse the operator
opP :: Parser (Int -> Int -> Int)
= plus <|> minus <|> times <|> divide
opP where
= do { _ <- charP '+'; return (+) }
plus = do { _ <- charP '-'; return (-) }
minus = do { _ <- charP '*'; return (*) }
times = do { _ <- charP '/'; return div }
divide
-- 2. Now parse the expression!
calc :: Parser Int
= do x <- digitP
calc <- opP
op <- digitP
y return (x `op` y)
QUIZ
What will quiz
evaluate to?
calc :: Parser Int
= do x <- digitP
calc <- opP
op <- digitP
y return (x `op` y)
= runParser calc "10+2" quiz
A. [(12,"")]
B. []
C. [(10, "+2")]
D. [(1, "0+2")]
E. Run-time exception
Next: Recursive Parsing
Its cool to parse individual Char
…
… but way more interesting to parse recursive structures!
"((2 + 10) * (7 - 4)) * (5 + 2)"
EXERCISE: Parsing Many Times
Often we want to repeat parsing some object
- E.g. to parse an
Int
we need to parse a digit many times
Implement a function manyP
that:
- applies a parser
p
as many times as it can - returns a list of the results
-- | `manyP p` repeatedly runs `p` to return a list of [a]
manyP :: Parser a -> Parser [a]
= ???
manyP
>>> runParser (manyP digitCharP) "2022cat"
"2022","cat")]
[(
>>> runParser (manyP digitCharP) "cat"
"","cat")] [(
-- | `manyP p` repeatedly runs `p` to return a list of [a]
manyP :: Parser a -> Parser [a]
= m1 <|> m0
manyP p where
= return []
m0 = do { x <- p; xs <- manyP p; return (x:xs) } m1
Parsing Numbers
Now, we can write an Int
parser as:
int :: Parser Int
= do { xs <- manyP digitChar; return (read xs) } int
… but is this correct?
QUIZ: Parsing Numbers
What will `quiz`` evaluate to?
int :: Parser Int
= do { xs <- manyP digitChar; return (read xs) }
int
= runParser int "cat" quiz
A. Type Error
B. [(0,"cat")]
C. []
D. Run-time exception
Parsing Numbers: Take 2
A number has one or more digits, not zero or more!
manyOneP :: Parser a -> Parser [a]
= do { x <- p; xs <- manyP p; return (x:xs) }
manyOneP p
int :: Parser Int
= do { xs <- manyOneP digitChar; return (read xs) }
int
>>> runParser int "2022cat"
2022,"cat")]
[(
>>> runParser int "cat"
-- as expected! []
Parsing Arithmetic Expressions
Now we can build a proper calculator!
calc :: Parser Int
= binExp <|> int
calc
binExp :: Parser Int
= do
binExp <- int
x <- opP
o <- calc
y return (x `o` y)
Works pretty well!
>>> runParser calc "11+22+33"
66,"")]
[(
>>> runParser calc "11+22-33"
0,"")] [(
WARNING
The real (<|>)
combinator in Parsec has different semantics:
It only runs p2
if p1
fails without consuming any input
- This makes parsers more efficient
- And helps with error reporting
To choose between objects that can have the same prefix, you need to use try
:
-- In Parsec
calc :: Parser Int
= try binExp <|> int calc
Otherwise, runParser calc "10"
would fail
- because
binExp
consumes the10
- so
int
is never tried
QUIZ
calc :: Parser Int
= binExp <|> int
calc
binExp :: Parser Int
= do
binExp <- int
x <- opP
o <- calc
y return (x `o` y)
What does quiz
evaluate to?
= runParser calc "10-5-5" quiz
A. [(0, "")]
B. []
C. [(10, "")]
D. [(10, "-5-5")]
E. [(5, "-5")]
Problem: Right-Associativity
binExp :: Parser Int
= do
binExp <- int
x <- opP
o <- calc
y return (x `o` y)
"10-5-5"
gets parsed as 10 - (5 - 5)
because
The calc
parser implicitly forces each operator to be right associative
doesn’t matter for
+
,*
but is incorrect for
-
,div
QUIZ
binExp :: Parser Int
= do
binExp <- int
x <- opP
o <- calc
y return (x `o` y)
What does quiz
get evaluated to?
= runParser calc "10*2+100" quiz
A. [(1020,"")]
B. [(120,"")]
C. [(120,""), (1020, "")]
D. [(1020,""), (120, "")]
E. []
The calc
parser implicitly forces each operator to be right associative
… at the same precedence level!
Left Associativity
How can we make the parser left associative
- i.e. parse “10-5-5” as
(10 - 5) - 5
?
Lets flip the order!
binExp :: Parser Int
= do
binExp <- calc -- now calc is on the left
x <- intOp
o <- int
y return (x `o` y)
But …
>>> runParser calc "2+2"
...
Infinite loop! calc --> binExp --> calc --> binExp --> ...
- without consuming any input :-(
Solution: Grammar Factoring
Any expression is a sum-of-products!
:== sum
expr sum :== (((prod "+" prod) "+" prod) "+" ... "+" prod)
:== (((atom "*" atom) "*" atom) "*" ... "*" atom)
prod :== "(" expr ")" | int atom
Or in Haskell:
= foldLP prod (plus <|> minus)
expr = foldLP atom (times <|> div)
prod = parensP expr <|> int atom
All that is left is to implement parensP
and foldLP
!
EXERCISE: Parsing Parenthetic Expressions
Implement the function parensP
:
-- | Parse what `p` parses, surrounded by parentheses:
parensP :: Parser a -> Parser a
= ???
parensP p
>>> runParser (parensP int) "(10)"
10,"")]
[(
>>> runParser (parensP int) "10"
[]
Fold Left Parser
Lets implement foldLP aP oP
as a combinator
vP
parses a singlea
valueoP
parses a binary operatora -> a -> a
foldLP vP oP
parses and returns the result((v1 o v2) o v3) o v4) o ... o vn)
But how?
grab the first
v1
usingvP
continue by
- either trying
oP
thenv2
… and recursively continue withv1 o v2
- or else (no more
o
) just returnv1
- either trying
foldLP :: Parser a -> Parser (a -> a -> a) -> Parser a
= do {v1 <- vP; continue v1}
foldLP vP oP where
= do { o <- oP; v2 <- vP; continue (v1 `o` v2) }
continue v1 <|> return v1
Lets make sure it works!
>>> runParser expr "10-5-5"
0,"")]
[(
>>> runParser expr "10*2+100"
120,"")]
[(
>>> doParse sumE2 "100+10*2"
120,"")] [(
Parser combinators
That was a taste of Parser Combinators
- Transferred from Haskell to many other languages.
Many libraries including Parsec used in your homework - oneOrMore
is called chainl
Read more about the theory - in these recent papers
Read more about the practice - in this recent post that I like JSON parsing from scratch