FootlessParser

A simple implementation of a parser combinator in Swift, supporting infinite lookahead, non-ambiguous parsing with error reporting.

After the introduction of Swift at WWDC 2014 the web was suddenly full of articles about Swift and Functional Programming. Having only experience with procedural and object oriented programming I was fascinated by this different way of thinking. It keeps things as generic as possible, which I like, and it is simple (but not easy). I read the Functional Programming in Swift book and everything was going well until one of the last chapters about “Parser Combinators”. I simply could not comprehend it.

Since reading about it didn’t work I decided to try Learning by Doing instead. So I made FootlessParser, which is quite different from the framework in the book:

  • It has error reporting, which makes it easier to write parsers with it.
  • I didn’t understand why each parse in the book returns a sequence of results, so FootlessParser returns just one result. Later I realised it was because the code in the book supports ambiguous parsing (it returns all possible ways of parsing the input), which means my decision made FootlessParser non-ambiguous. This is just as well as it makes things simpler.
  • It can parse any collection of anything, not just text. So each individual parser is generic both in its input and its output. It should be fairly easy to let it parse any sequence too.

Readme

In short, FootlessParser lets you define parsers like this:

let parser = function1 <^> parser1 <*> parser2 <|> parser3

function1 and parser3 return the same type.

parser will pass the input to parser1 followed by parser2, pass their results to function1 and return its result. If that fails it will pass the original input to parser3 and return its result.

Definitions

Parser

A function which takes some input (a sequence of tokens) and returns either the output and the remaining unparsed part of the input, or an error description if it fails.

Token

A single item from the input. Like a character from a string, an element from an array or a string from a list of command line arguments.

Parser Input

Most often text, but can also be an array or any collection of anything, provided it conforms to CollectionType.

Parsers

The general idea is to combine simple parsers into more complex ones. So char("a") creates a parser which checks if the next token from the input is an “a”. If it is it returns that “a”, otherwise it returns an error. You can then use operators and functions like zeroOrMore and optional to create ever more complex parsers. For more check out the full list of functions.

Operators

<^> (map)

function <^> parser1 creates a new parser which runs parser1. If it succeeds it passes the output to function and returns the result.

<*> (apply)

function <^> parser1 <*> parser2 creates a new parser which first runs parser1. If it succeeds it runs parser2. If that also succeeds it passes both outputs to function and returns the result.

The <*> operator requires its left parser to return a function and is normally used together with <^>. function must take 2 parameters of the correct types, and it must be curried, like this:

func function (a:A)(b:B) -> C

This is because <*> returns the output of 2 parsers and it doesn’t know what to do with them. If you want them returned in a tuple, an array or e.g. added together you can do so in the function before <^> .

If there are 3 parsers and 2 <*> the function must take 3 parameters, and so on.

<*

The same as the <*> above, except it discards the result of the parser to its right. Since it only returns one output it doesn’t need the <^> . Unless you want the output converted to something else.

*>

The same as <* , but discards the result of the parser to its left.

<|> (choice)

parser1 <|> parser2 <|> parser3

This operator tries all the parsers in order and returns the result of the first one that succeeds.

>>- (flatmap)

parser1 >>- ( o -> parser2 )

This does the same as the flatmap functions in the Swift Standard Library. It creates a new parser which first tries parser1. If it fails it returns the error, if it succeeds it passes the output to the function which uses it to create parser2. It then runs parser2 and returns its output or error.

Examples

CSV parser

let delimiter = "," as Character
let quote = "\"" as Character
let newline = "n" as Character

let cell = char(quote) *> zeroOrMore(not(quote)) <* char(quote)
    <|> zeroOrMore(noneOf([delimiter, newline]))

let row = extend <^> cell <*> zeroOrMore( char(delimiter) *> cell ) <* char(newline)
let csvparser = zeroOrMore(row)

Here a cell (or field) either:

  • begins with a “, then contains anything, including commas, tabs and newlines, and ends with a ” (both quotes are discarded)
  • or is unquoted and contains anything but a comma or a newline.

Each row then consists of one or more cells, separated by commas and ended by a newline. The extend function joins the cells together into an array. Finally the csvparser collects zero or more rows into an array.

To perform the actual parsing:

do {
    let output = try parse(csvparser, csvtext)
    // output is an array of all the rows, 
    // where each row is an array of all its cells.
} catch {
    // handle errors
}

XML tag

let opentag = char("<") *> oneOrMore(not(">")) <* char(">")
let closetag = { (tagname: String) in char("<") *> tokens(tagname) <* tokens("/>") }

let tag = tuple <^> opentag <*> oneOrMore(not("<")) >>- { (name, content) in
    return { _ in (name, content) } <^> closetag(name)
}

do {
    let (name, content) = try parse(tag, "<a>content<a/>")
    // name == "a", content == "content"
} catch {
    // handle errors
}

This one is more complex because to parse the closing tag you need the name of the opening tag. closetag is a closure which takes the name of the tag and returns a parser for the closing tag with that name.

In the tag parser everything before >>- forms a parser for the opening tag and any content before the closing tag, and returns the result as a (name, content) tuple. Then after the >>- there is a closure which takes name and content as input and returns a parser which parses the closing tag, discards the result and returns name and content instead.

Recursive expression

func add (a:Int)(b:Int) -> Int { return a + b }
func multiply (a:Int)(b:Int) -> Int { return a * b }

let nr = {Int($0)!} <^> oneOrMore(oneOf("0123456789"))

var expression: Parser<Character, Int>!

let factor = nr <|> lazy( char("(") *> expression <* char(")") )

var term: Parser<Character, Int>!
term = lazy( multiply <^> factor <* char("*") <*> term <|> factor )

expression = lazy( add <^> term <* char("+") <*> expression <|> term )

do {
    let result = try parse(expression, "(1+(2+4))+3")
} catch { }

expression parses expressions like "12", "1+2+3", "(1+2)", "12*3+1" and "12*(3+1)" and returns the result as an Int.

Unfortunately recursion forces us to commit two sins: mutable state and explicitly unwrapped optionals. All parsers which refer to themselves directly or indirectly must be pre-declared: var expression: Parser<Character, Int>!. And to avoid infinte recursion the definitions must use the lazy function.

Development

It was fascinating to see how Functional Programming lets you start from the bottom with small and simple functions, structures and operators, and combine them to build a framework capable of expressing pretty complex grammars.

In fact many of the functions and operators are so small and focused that you can tell how they must be implemented just by looking at the parameters and return type. Like the >>- operator for Parser:

(p: Parser<T,A>, f: A -> Parser<T,B>) -> Parser<T,B>

The only way this can return Parser<T,B> is by calling function f. And it must get its one parameter of type A from Parser p. It almost writes itself.

Self-imposed challenges

I started this project because I wanted to learn more about Functional Programming in general and parser combinators specifically. So I tried to follow these rules:

  • only pure functions (they only use their parameters).
  • no global state (with only pure functions there’s not much point really).
  • no mutable state.

The 2 first ones I follow anyway, but I had to break the last one because the “extend” functions have variables as they work with RangeReplaceableCollectionType, which only has mutable methods.

And for bonus points:

  • write the body of each function in one line (while still using sensible code formatting and line lengths).

This was possible surprisingly often, and I believe without sacrificing code readability.

Keeping things generic

FootlessParser can parse any collection of anything, not just text. And parsers which return several items use arrays. This makes text parsing cumbersome because you get back arrays of characters instead of strings. So all parsers which return arrays have specialised versions which take character parsers and return strings.

This means if you parse arrays of characters you will get back strings, which is probably not what you expect. But all in all I think it’s a good trade-off. There is in any case no way around this since ParserInput doesn’t know what type of collection its source is, nor should it.

The future

If I was going to make any major changes, I would start with removing the Result framework and throw errors instead when parsing fails. That should simplify the code and probably make it noticeably faster since it wouldn’t need to create a new enum for every parse.

Other possible improvements:

  • Better error reporting.
  • Make more parsers for speed testing.
  • Let parsers return lazy sequences instead of arrays. Right now a lot of arrays are created and thrown away. Combined with the ability to parse sequences this would let you create tokenisers whose output could then be parsed again. So for example the CSV parser described previously could be updated to also check that each row contains the same number of fields, and each column only has values of the same type.
  • Let ParserInput take advantage of the Standard Library, so instead of just returning tokens one by one it could return subsequences.
  • Make separate versions of Parser and ParserInput just for text. This should speed things up considerably since they can work directly with String.CharacterView. It would also fix the problem mentioned above where parsing arrays of characters return strings.

So ultimately FootlessParser would become less like Haskell and more like Swift 2.0. Which is a good thing. When in Rome etc.

Conclusion

I’m quite pleased with how this project turned out, as I learned a lot from it and that was always the main goal. But I’m not sure I would recommend it for use in production code as it is. For one thing it’s slow. Very slow. Parsing a 361 kB CSV file using the parser in the example above takes over 6 seconds, whereas naoty/SwiftCSV does it in about 0.2 seconds. Sure my parser handles quotes and multi-line fields, but 30 times faster is still a lot.

But for things like automation and prototyping it could definitely be useful.

Links