Khanlou | Regexes vs Combinatorial Parsing

December 3, 2019

Regexes vs Combinatorial Parsing

Recently, I’ve been working on a music app that needs to get a musical sequence (like a melody) from the server to the client. To do this, we use a format called ABC. (You can read about how ABC music notation works here.) We chose ABC because it’s more readable than MIDI, it’s pretty concise, and it’s a standard, so we wouldn’t be re-inventing some format over JSON.

ABC is a simple format, and it does a great job of representing music in strings. For example, here’s the little bit from Beethoven’s “Für Elise” that everyone knows:

e ^d e d e B =d c A2

The letters correspond to specific pitches, the characters before the letters represent accidentals (sharps and flats), and the numbers after them represent the duration that that note should be played for. ABC has a lot more features (it supports rests, ties, beams, barlines, and so on), but this simple example gives you a sense for how the format works.

The app needs to transform the string into some kind of struct that the sequencer knows how to play. In short, we need to parse this string.

Regular Expressions

At first blush, the problem seems easy enough. Split on spaces, and write some regular expressions to convert each note into some kind of value. For note parsing, the expression looks something like this:

^(\\^+|_+|=)?([a-gA-G])(,+|'+)?(\\d*)(\\/)?(\\d*)$

There are 3 main problems with this regular expression.

It’s inscrutable

It is nearly impossible to determine what this regex does by looking at it. Especially for something like ABC, which uses punctuation marks for things like accidentals (^, _, and =) as well as octave adjustments (' and ,), it’s really hard to tell which are literal characters that we are trying to match, and which are regex control characters that tell the pattern to match zero or more of this one, optionally match that one, and so on.

One thing I will grant here is that despite their inscrutability, regexes are one of the tersest languages you can write.

It’s not composable

Another problem with regular expressions is that the constituent components don’t compose easily. For example, our Beethoven sequence above doesn’t include any rests, but they typically look like just like a note with a z instead of the letter of the pitch. So for example, a half rest might look like z/2.

Ideally, we’d like to be able to reuse the duration pattern, but it’s currently trapped inside that giant regex. In order to compose these regexes together, we have to break them apart first, and then apply some string concatenation:

let accidentalPattern = "(\\^+|_+|=)?"
let letterPattern = "([a-gA-G])"
let octaveAdjustmentPattern = "(,+|'+)?"
let durationPattern = "(\\d*)(\\/)?(\\d*)"
let restPattern = "z"

let noteRegex = try! NSRegularExpression(pattern: accidentalPattern + letterPattern + octaveAdjustmentPattern + durationPattern)

let restRegex = try! NSRegularExpression(pattern: restPattern + durationPattern

This is really the only way to get composability with regexes. And of course, because these are just strings that we’re concatenating, there’s no way to know if our concatenation represents a valid transformation of the underlying regular expression because it isn’t checked at compile time.

It requires additional conversion

Lastly, even once we have successfully composed our regex pattern, compiled it into a regular expression, and parsed a string, we still don’t actually have anything useful! We have a bunch of capture groups that we need to operate on to get our final result. For example, the first capture group in the regex gives us our accidental string. This can either be one more carets (^) (sharp), one more underscores (_) (flat), or an equals sign (=) (natural), but the regex doesn’t tell us which it is, so we need to parse it further.

let numAccidentals = match.captureGroups[0].text.count
let accidentalChar = match.captureGroups[0].text.first

switch accidentalChar {
case "^":
    return .accidental(numAccidentals)
case "_":
    return .accidental(-numAccidentals)
case "=":
    return .natural
case "":
    return Accidental.none
default:
    return nil
}

Requiring a manual step where we have to convert the matched substring into our domain object is kind of annoying.

Combinatorial Parsing

My first clue that we needed to look closer at this problem was a function signature. Our strategy for parsing sequences was to consume small pieces of the front of a string. For example, for a note, we’d have something like this:

func consumeNote(abcString: Substring) -> (note: Note?, remainder: Substring)

A function that returns a tuple of (T?, Substring) looked very familiar. I’d seen it referenced in the Pointfree code about combinatorial parsing:

extension Parser {
    run(_ str: String) -> (match: A?, rest: Substring)
}

This is called a combinatorial parser because it relates to a somewhat obscure part of computer science called combinators. I don’t understand them fully, but I will kick it over to Daniel Steinberg to share some very cool examples.

I figured if we’re trying to parse text and return something in this format anyway, maybe the combinatorial parser could be a good fit. So I did a little research on the topic and started the process of reimplementing our parsing with this new strategy. The crux of the parser is a struct with a single function:

struct Parser<A> {
    let run: (inout Substring) -> A?
}

You give it a string (a substring in this case, for performance reasons) and it consumes characters from the front of the buffer to see if it can construct a generic A from them. If that succeeds, it returns the A. If it fails, it returns nil, ensuring no modifications to the input parameter.

Using parser combinators, it’s possible to write concise, type-safe, composable code that parses strings just as effectively as regexes. For example, here’s how you check for a literal:

static func literal(_ string: String) -> Parser<Void> {
    return Parser<Void>(run: { input in
        if input.hasPrefix(string) {
            input.removeFirst(string.count)
            return ()
        } else {
            return nil
        }
    })
}

If the input has a prefix of the string we’re looking for, drop that many characters off the front and return a valid value (Void in this case). If not, return nil — no match. I won’t go into detail on all the other helpers here — you can find a lot of them in the Pointfree sample code.

Using this helper and others, you can rewrite the accidental parsing from above into something like this:

let accidentalParser: Parser<Accidental> = Parsers.oneOf([
    Parsers.repeating("^").map({ .accidental($0.count) }),
    Parsers.repeating("_").map({ .accidental(-$0.count) }),
    Parsers.literal("=").map({ .natural }),
    Parsers.literal("").map({ .none }),
])

This avoids basically every pitfall that the regex approach has.

First, it’s very readable. You can see what literal characters it’s matching because they’re still strings, but all the regex control characters have changed into named functions, which are approachable for people who aren’t regex experts.

Second, it’s very composable. This parser is made up of smaller parsers, like .repeating or .literal, and is itself composed into bigger parsers, like those for notes, rests, and so on. Here’s the accidental parser in the note parser.

var noteParser: Parser<Note> {
    return Parsers.zip(accidentalParser, pitchParser, durationParser)
        .map({ accidental, pitch, duration in
            return Note(accidental: accidental, pitch: pitch, duration: duration)
        })
}

(zip matches all of the parsers that are given to it, in order, and fails if any of them fail.) In this example, it’s also really great to see how all the types line up cleanly. You get a lot of assurance from the compiler that everything is in order.

The last benefit it has over regex is that it produces the actual value that you’re expecting, an Accidental, as opposed to substring capture groups that you have to process further. Especially using the map operation, you can transform simple results, like a substring or Void into domain specific results, like Accidental or Note.

The only real place that this loses out to regexes is terseness, and when you add the transformation code that regexes require, it’s basically a wash.

Now, we’ve completely handrolled a parsing solution from scratch, in pure Swift, using only tools like substrings and blocks. Surely regexes, which are highly tuned and written in C, should be able to massively outperform this code in terms of speed?

Wrong. By using this new approach, we were able to get 25% faster parsing across the board. This shocked me, to be honest. I will usually take a small speed hit if it makes the code nicer (assuming it’s not a critical path, not noticable by the user, etc), but in this case, that wasn’t even a concern, since the nicer code was also the faster code. Win-win! I think a lot of the speed benefit comes from using substrings. Trimming off the front of a substring is effectively free, which makes a lot of these operations super fast.

There are still a few areas which I’d like to learn more about, like unparsing (what if we want to turn our sequence struct back into an ABC string), and how parsers perform when the needle is in an arbitrary place in the haystack, but it’s going to be pretty hard to get me to go back to regexes after these results.