Tutorial
in which we will build an arithmetic expression parser step by step. We will cover every combinator most grammars need.
By the end of this tutorial you'll have a parser that takes strings
like 2 * (3 + 4) - 5 and returns 9.
The complete file is examples/arith.carp. We build it in five steps,
each runnable on its own.
0. setup
(load "git@github.com:carpentry-org/parsec@0.1.0")
Parser.parse is strict, it succeeds only if the parser consumes the
whole input. We'll use that throughout.
1. parse a single integer
The simplest grammar is just a number. The library ships with
Parser.Lexer.integer, which handles an optional sign and skips no
trailing whitespace.
(defn main []
(match (Parser.parse (Parser.Lexer.integer) "42")
(Result.Success n) (println* &n)
(Result.Error e) (IO.errorln &(Parser.format-error &e))))
When you run it you should see 42. Try -7, 12345. Try forty-two to
see the error.
2. handle whitespace
Real input has spaces. Parser.Lexer.lexeme runs a parser, then skips
trailing whitespace. So (Parser.Lexer.lexeme (Parser.Lexer.integer))
matches an integer followed by any whitespace.
We also want leading whitespace to be tolerated. The cleanest way is to wrap the top-level parser:
(defn arith []
(Parser.bind (Parser.Lexer.whitespace)
(fn [_] (Parser.Lexer.lexeme (Parser.Lexer.integer)))))
Parser.bind runs the first parser, then feeds the result into a
function that returns the next parser. We discard the whitespace's
result and yield the integer. This is the monadic chaining you'd write
as whitespace >> integer in Haskell, in case you already know Parsec.
3. multiplication
We want 2 * 3 to evaluate to 6. The grammar is:
term = integer ('*' integer | '/' integer)*
This translates to “an integer, then zero or more (operator, integer) pairs,
fold the pairs over the first integer”.
(defn apply-op [acc op rhs]
(cond
(Char.= op \*) (Int.* acc rhs)
(Char.= op \/) (Int./ acc rhs)
acc))
(defn mul-op []
(Parser.alt
(Parser.map (Parser.Lexer.symbol @"*") (fn [_] \*))
(Parser.map (Parser.Lexer.symbol @"/") (fn [_] \/))))
Parser.alt tries the first parser. If it fails without consuming,
it tries the second. Parser.Lexer.symbol matches a literal string
and skips trailing whitespace.
(Parser.map f p) runs p and applies f to its result. Here we throw
away the matched "*" and produce the Char \*.
The chaining looks like this:
(defn term []
(Parser.map
(Parser.seq
(Parser.Lexer.lexeme (Parser.Lexer.integer))
(Parser.many (Parser.seq (mul-op)
(Parser.Lexer.lexeme (Parser.Lexer.integer)))))
(fn [outer]
(let [first @(Pair.a &outer)
pairs (Pair.b &outer)]
(Array.reduce &(fn [acc p] (apply-op acc @(Pair.a p) @(Pair.b p)))
first
pairs)))))
If you test it, you should get:
"2 * 3" => 6
"2 * 3 * 4" => 24
"100 / 10" => 10
4. addition with precedence
For 1 + 2 * 3 to evaluate to 7 (not 9), addition has to be
lower precedence than multiplication. The standard trick is that an
expression is a sum of terms, where each term is a product of factors.
expr = term ('+' term | '-' term)*
term = factor ('*' factor | '/' factor)*
Notice the chainl1 shape repeats. Let's factor it out:
(defn chainl1 [head op next]
(Parser.map
(Parser.seq head (Parser.many (Parser.seq op next)))
(fn [outer]
(let [h @(Pair.a &outer)
pairs (Pair.b &outer)]
(Array.reduce &(fn [acc p]
(apply-op acc @(Pair.a p) @(Pair.b p)))
h
pairs)))))
Now term and expr are one-liners:
(defn term [] (chainl1 (factor) (mul-op) (factor)))
(defn expr [] (chainl1 (term) (add-op) (term)))
Here, add-op is just mul-op with \+ and \-.
5. parens, and the recursion problem
The remaining piece is factor:
factor = integer | '(' expr ')'
A factor is either a literal integer or a parenthesized expression.
The recursion path is factor -> expr -> term -> factor.
In Carp, this is a problem. If we try the obvious thing:
(defn factor []
(Parser.alt
(Parser.Lexer.lexeme (Parser.Lexer.integer))
(Parser.between (Parser.Lexer.symbol @"(")
(Parser.Lexer.symbol @")")
(expr)))) ; recurse into expr
we hang at construction. (factor) calls (expr) calls (term)
calls (factor) calls (expr)... Carp evaluates eagerly, so we
construct an infinite loop when building the parser tree.
The fix: Parser.recurse runs a parser via a stable static reference.
We declare the recursive position in a top-level def, fill it in
once at startup, and reference it from sub-parsers.
(def *expr* (the (Parser Int) (Parser.placeholder)))
(defn factor []
(Parser.alt
(Parser.Lexer.lexeme (Parser.Lexer.integer))
(Parser.Lexer.lexeme
(Parser.between (Parser.Lexer.symbol @"(")
(Parser.Lexer.symbol @")")
(Parser.recurse &*expr*)))))
(defn term [] (chainl1 (factor) (mul-op) (factor)))
(defn expr-impl []
(chainl1 (term) (add-op) (term)))
; Init *expr* at module load.
(def _init (do (set! *expr* (expr-impl)) 0))
Parser.placeholder is a parser that always fails (until you set!
the cell). The _init def has a side-effecting initializer that runs
once.
Now (arith) parses anything — primitives, parens, mixed precedence:
"1 + 2" => 3
"1 + 2 * 3" => 7
"(1 + 2) * 3" => 9
"10 - 4 - 3" => 3 (left-associative)
"2 * (3 + 4) - 5" => 9
What you've used
By the end of this tutorial you've touched all the combinators that most real grammars need:
- Primitives —
byte,Lexer.integer,Lexer.symbol,Lexer.lexeme,Lexer.whitespace - Chaining —
bind,seq,map - Choice —
alt - Repetition —
many - Bracketing —
between - Recursion —
recurseagainst aplaceholder-initialized cell
Look at examples/arith.carp for the assembled file, and at
examples/lisp.carp for the same recursion pattern applied to
s-expressions with a Box-recursive AST.
What's next
- Rich error messages:
Parser.label "<name>" preplaces the expected-set on failure.Parser.format-errorrenders aParseErrto a string. - Backtracking nuances:
Parser.trylets you backtrack after a partially-consumed alternative.Parser.stringis atomic by default and doesn't needtry. - Pitfalls covers the linear-types interactions you'll eventually run into.