How to parse a C-like language with lexy

Introduction: what is the difference between a parser combinator and a parser generator?

Parsing is not easy.

What is parsing?

Parsing is (on average) building a parse tree from a text input. When one wants to build a programming language, parsing is usually called the front-end. The backend can be many other things: outputting an executable, using JIT, intermediate representation, web assembly, executing your program directly with an interpreter, or compiling it into a bytecode format... this article will only cover front-end/parsing.

What is a parser generator?

A parser generator will output a source code which will be your parser. Parser generators are usually popular (example: ANTLR), but they are a bit complicated to use in practice.

What is a parser combinator?

A parser combinator is given a grammar (usually BNF or eBNF or variants) and a text input, and will output a parse tree directly.

Parser combinators are simpler to use in general, but handling BNF grammars is not trivial and not optimal.

What is lexy?

Lexy is a parser combinator, but it doesn't use a grammar format like BNF: instead, the grammar is written in C++ directly, with templates. A big advantage is performance.

Its author explained Lexy in a 1hr zoom presentation, and also provides extensive tutorials and documentation.


What sort of language are you parsing?

I had quite an atypical project:

Important note: I did not finish this project, I only use lexy for parsing.


Python indentation

Python indentation cannot be parsed by a parser combinator, because it is not context-free.

Solving this problem is trivial enough: generate indent, dedent and new line tokens with python code:

for (diffprev, diffnext, line, spaces) in lines:
    lines_tokens.append((
        line,
        diffprev, diffnext,
        1 if diffprev == 1 else 0, # indent
        diffnext if diffnext else 0, # dedent
        spaces,
        diffnext >= 0
        # True
            if diffnext != None and line
            else False, # endline
        ))

Full indent module

This transforms the left side into the right side:

if nope:                             ➡️   if nope:
    hum = cally(1,2)                 ➡️   __INDENT__
    duh = (43, fcall(2.4,(abc, 5)))  ➡️       hum = cally(1,2) __ENDLINE__
    while 1:                         ➡️       duh = (43, fcall(2.4,(abc, 5))) __ENDLINE__
        break                        ➡️       while 1:
                                     ➡️       __INDENT__
    return 32+2*4                    ➡️           break __ENDLINE__
                                     ➡️
                                     ➡️       __DEDENT__
                                     ➡️       return 32+2*4 __ENDLINE__
                                     ➡️
                                     ➡️   __DEDENT__

I am planning to rewrite this part in C++.


Lexy grammar

How does lexy work?

There are several core concepts in lexy:

To quote the tutorial:

A grammar consists of productions. Each production defines a rule, which controls what is being parsed, and produces a value. In lexy, productions are structs with a static constexpr member called rule. The rule is defined by the DSL: lexy provides simple rules that can be composed to parse more complex things. A grammar consists of multiple of those structs; [...]

Again, I recommend to read the tutorial, and play around with the repl, which allows to test lexy without compiling it.

The grammar

Right now, my grammar fits in about 350 lines:

floats and ints

Lexy cannot differentiate floating points from integer out-of-the-box: .0, 0. and 0.0 are problematic. To remedy this, a number rule is introduced to differentiate.

expressions

Expressions are common constructs in programming languages. Lexy has lexy::expression_production which does one part of the job:

struct expression : lexy::expression_production {

parenthesized expressions

Since I like python, I want to parse tuples and translate them as structs. The idea is weird and my language doesn't work yet, but I can still parse those parenthesized expression if I want to!

struct expression : lexy::expression_production {
    struct paren_expr {
        static constexpr auto rule =
            dsl::parenthesized.list(dsl::p<struct expression>, dsl::sep(dsl::comma));
    };

atom

The atom is how the expression will be parsed. p<> means production:

static constexpr auto atom = [] {
    return
        dsl::p<paren_expr>
        | dsl::p<identifier>
        | dsl::p<string_literal>
        | dsl::p<number>
        | dsl::error<expected_operand>;
}();

differentiate between functions and tuples

Function and tuples are similar: (1,3) is a tuple, but thing(1,3) is a function call. To differentiate, we use a dsl::postfix_op and a prefix : dsl::prefix_op:

struct call : dsl::postfix_op {
    static constexpr auto op = dsl::op(
        dsl::parenthesized.list(dsl::p<struct expression>, dsl::sep(dsl::comma)));
    using operand = dsl::atom;
};

struct prefix : dsl::prefix_op {
    static constexpr auto op = dsl::op(dsl::lit_c<'-'>);
    using operand = call;
};

math operators

What comes next is just a suite of the common math operators, with their precedence: product sum bit_shift inequality equality bit_and bit_xor bit_or bool_and bool_or equal.

statements

Remember the indent and dedent tokens? Here they are:

struct compound_stmt {
    static constexpr auto rule =
        lexy::dsl::brackets(LEXY_LIT("__INDENT__"),
            LEXY_LIT("__DEDENT__")).list(dsl::recurse<statement>);
};

C semantics have this particularity between statements and expression, so there is the "expression statement":

struct expression_stmt {
    static constexpr auto rule =
        dsl::p<expression> +LEXY_LIT("__ENDLINE__");
};

Combining statements:

struct statement {
    static constexpr auto whitespace = dsl::ascii::space;
    static constexpr auto rule =
        dsl::p<compound_stmt>
        | dsl::p<if_stmt>
        | dsl::p<while_stmt>
        | dsl::p<for_stmt>
        | dsl::p<jump_stmt>
        | dsl::else_ >> dsl::p<expression_stmt>
        ;
};

And we're done! That is the grammar!

Of course, you can look at the code for the full grammar.

The parse tree

That code:

func sunrungunbunpunfunjunction(roubidou:char, i: int) (float, char):
    if nope:
        hum = cally(1,2)
        duh = (43, fcall(2.4,(abc, 5)))
    while 1:
        continue
        return
        return 32-2*4%3

The code is gibberish and makes no sense: all we did is parse it, but I want to use a C compiler as a backend, so for now, it doesn't matter.

Lexy doesn't output to json, so I wrote some code to output json. I added a step to make it a bit more compact, but it still has 17 times as many line as the original source:

{
 "func_decl": [{
   "func_signature": [
    "func",
    {
     "identifier": [
      "sunrungunbunpunfunjunction"
     ]},
    {
     "func_args": [{
       "decl_ltr": [{
         "identifier": [
          "roubidou"
         ]},
        {
         "base_type": [
          "char"
         ]}]},
      {
       "decl_ltr": [{
         "identifier": [
          "i"
         ]},
        {
         "base_type": [
          "int"
         ]}]}]},
    {
     "return_values": [{
       "base_type": [
        "float"
       ]},
      {
       "base_type": [
        "char"
       ]}]}]},
  {
   "compound_stmt": [{
     "statement": [{
       "if_stmt": [
        "if",
        {
         "expression": [{
           "identifier": [
            "nope"
           ]}]},
        {
         "compound_stmt": [{
           "statement": [{
             "expression_stmt": [{
               "expression": [{
                 "expression::equal": [{
                   "identifier": [
                    "hum"
                   ]},
                  "=",
                  {
                   "expression::call": [{
                     "identifier": [
                      "cally"
                     ]},
                    {
                     "expression": [{
                       "number": [
                        "1"
                       ]}]},
                    {
                     "expression": [{
                       "number": [
                        "2"
                       ]}]}]}]}]}]}]},
          {
           "statement": [{
             "expression_stmt": [{
               "expression": [{
                 "expression::equal": [{
                   "identifier": [
                    "duh"
                   ]},
                  "=",
                  {
                   "expression::paren_expr": [{
                     "expression": [{
                       "number": [
                        "43"
                       ]}]},
                    {
                     "expression": [{
                       "expression::call": [{
                         "identifier": [
                          "fcall"
                         ]},
                        {
                         "expression": [{
                           "number": [
                            "2",
                            ".",
                            "4"
                           ]}]},
                        {
                         "expression": [{
                           "expression::paren_expr": [{
                             "expression": [{
                               "identifier": [
                                "abc"
                               ]}]},
                            {
                             "expression": [{
                               "number": [
                                "5"
                               ]}]}]}]}]}]}]}]}]}]}]}]}]}]},
    {
     "statement": [{
       "while_stmt": [
        "while",
        {
         "expression": [{
           "number": [
            "1"
           ]}]},
        {
         "compound_stmt": [{
           "statement": [{
             "jump_stmt": [{
               "continue_stmt": [
                "continue"
               ]}]}]},
          {
           "statement": [{
             "jump_stmt": [{
               "return_stmt_value": [
                "return"
               ]}]}]},
          {
           "statement": [{
             "jump_stmt": [{
               "return_stmt_value": [
                "return",
                {
                 "expression": [{
                   "expression::sum": [{
                     "number": [
                      "32"
                     ]},
                    "-",
                    {
                     "expression::product": [{
                       "expression::product": [{
                         "number": [
                          "2"
                         ]},
                        "*",
                        {
                         "number": [
                          "4"
                         ]}]},
                      "%",
                      {
                       "number": [
                        "3"
                       ]}]}]}]}]}]}]}]}]}]}]}]}

Walking the parse with python 3.10's new match case

Python 3.10 adds a match-case syntax, which is quite useful for walking a parse tree.

PEP 636 – Structural Pattern Matching: Tutorial

For example, here is how we match func_signature:

match entry:
    case {'func_signature':['func', idtf, func_args, return_values]}:
        walk(return_values)
        walk(idtf)
        walk(func_args)

parse tree walker

It generates this output, which looks a bit more like C code, but we are not there yet!

sunrungunbunpunfunjunctionroubidou char;i int;{
  if(nope) {
    hum=cally(1, 2);
    duh=(43, fcall(2.4, (abc, 5)));
  }
  while(1) {
    continue;
    return;
    return 32-2*4%3;
  }
}