parsec

Pitfalls

in which I discuss some dioms and gotchas you'll likely hit while building a grammar.

Prefer `recurse` over `lazy` for recursive grammars

Parser.lazy defers parser construction until run time, so it can close over self-references. The cost is that every invocation rebuilds the parser sub-tree, so for input of depth N, that's N allocations per parse.

Parser.recurse runs a parser through a stable cell and does not rebuild. Declare the recursive position in a top-level def, set it once, reference it from sub-parsers. See examples/lisp.carp for the pattern. Use lazy only when the grammar is shallow or short-lived or performance is less important than clarity.

A `lazy` thunk must call a sibling, not itself

(defn parens []
  (Parser.between '(' ')'
    (Parser.optional (Parser.lazy (fn [] (parens))))))

This loops at parse time. The self-call inside the thunk does not terminate. Split the recursive position into two functions:

(defn parens-content []
  (Parser.alt (Parser.lazy (fn [] (parens-pair)))
              (Parser.pure ())))

(defn parens-pair []
  (Parser.between '(' ')' (parens-content)))

recurse against a stable cell sidesteps this entirely.

Use `String.byte-slice`, not `String.slice`

In a custom combinator that extracts a substring, use String.byte-slice. The library does internally. String.slice walks the input twice (chars then from-chars), byte-slice is a direct memcpy. The cost difference is two orders of magnitude.

The tradeoff here is UTF-8.

Pass `(fn ...)` to combinators by value, not by reference

Combinators like Parser.map, Parser.bind, Parser.satisfy, and Parser.take-while take their function arguments by value:

(Parser.map (Parser.byte \a) (fn [c] (Char.to-int c)))    ; right
(Parser.map (Parser.byte \a) &(fn [c] (Char.to-int c)))   ; wrong

The &fn form compiles for capture-free closures (Carp hoists those to static functions) but produces a dangling reference once the closure captures a local. The failure mode is a runtime segfault, not a compile error.

This should be resolved in Carp eventually, but until then I want to tag it.

`Parser.parse` is strict

Parser.parse p input succeeds only if p consumes all of input. If you want to allow unconsumed trailing input, use Parser.parse-partial instead — it returns a Pair of the parsed value and the remaining input as a String.

Pitfalls

Prefer recurse over lazy for recursive grammars

A lazy thunk must call a sibling, not itself

Use String.byte-slice, not String.slice

Pass (fn ...) to combinators by value, not by reference

Parser.parse is strict

Prefer `recurse` over `lazy` for recursive grammars

A `lazy` thunk must call a sibling, not itself

Use `String.byte-slice`, not `String.slice`

Pass `(fn ...)` to combinators by value, not by reference

`Parser.parse` is strict