Parsing is not easy.
Parsing is (on average) building a parse tree from a text input. When one wants to build a programming language, parsing is usually called the front-end. The backend can be many other things: outputting an executable, using JIT, intermediate representation, web assembly, executing your program directly with an interpreter, or compiling it into a bytecode format... this article will only cover front-end/parsing.
A parser generator will output a source code which will be your parser. Parser generators are usually popular (example: ANTLR), but they are a bit complicated to use in practice.
A parser combinator is given a grammar (usually BNF or eBNF or variants) and a text input, and will output a parse tree directly.
Parser combinators are simpler to use in general, but handling BNF grammars is not trivial and not optimal.
Lexy is a parser combinator, but it doesn't use a grammar format like BNF: instead, the grammar is written in C++ directly, with templates. A big advantage is performance.
Its author explained Lexy in a 1hr zoom presentation, and also provides extensive tutorials and documentation.
I had quite an atypical project:
A "cousin" of C in terms of semantics
python indentation
"quality of life" things like a string type, list, dict/hash table, vector2/3/matrix, and other things that are lacking in C
compile that language to C:
this is what Typescript/cppfront/Nim are doing: the output is C code in this case
I target video game programming, so performance is a concern.
C is fast enough to compile
I had no ambition to get involved with a backend: C is "good enough", and it allows to easily reuse other existing C code, which is a bonus.
C semantics are "known territory", I don't want to innovate, I aim for a "conservative" language. In my mind this would improve adoption.
Important note: I did not finish this project, I only use lexy for parsing.
Python indentation cannot be parsed by a parser combinator, because it is not context-free.
Solving this problem is trivial enough: generate indent, dedent and new line tokens with python code:
for (diffprev, diffnext, line, spaces) in lines:
lines_tokens.append((
line,
diffprev, diffnext,
1 if diffprev == 1 else 0, # indent
diffnext if diffnext else 0, # dedent
spaces,
diffnext >= 0
# True
if diffnext != None and line
else False, # endline
))
This transforms the left side into the right side:
if nope: ➡️ if nope:
hum = cally(1,2) ➡️ __INDENT__
duh = (43, fcall(2.4,(abc, 5))) ➡️ hum = cally(1,2) __ENDLINE__
while 1: ➡️ duh = (43, fcall(2.4,(abc, 5))) __ENDLINE__
break ➡️ while 1:
➡️ __INDENT__
return 32+2*4 ➡️ break __ENDLINE__
➡️
➡️ __DEDENT__
➡️ return 32+2*4 __ENDLINE__
➡️
➡️ __DEDENT__
I am planning to rewrite this part in C++.
There are several core concepts in lexy:
To quote the tutorial:
A grammar consists of productions. Each production defines a rule, which controls what is being parsed, and produces a value. In lexy, productions are structs with a static constexpr member called rule. The rule is defined by the DSL: lexy provides simple rules that can be composed to parse more complex things. A grammar consists of multiple of those structs; [...]
Again, I recommend to read the tutorial, and play around with the repl, which allows to test lexy without compiling it.
Right now, my grammar fits in about 350 lines:
Lexy cannot differentiate floating points from integer out-of-the-box: .0
, 0.
and 0.0
are problematic. To remedy this, a number rule is introduced to differentiate.
Expressions are common constructs in programming languages. Lexy has lexy::expression_production
which does one part of the job:
struct expression : lexy::expression_production {
Since I like python, I want to parse tuples and translate them as structs. The idea is weird and my language doesn't work yet, but I can still parse those parenthesized expression if I want to!
struct expression : lexy::expression_production {
struct paren_expr {
static constexpr auto rule =
dsl::parenthesized.list(dsl::p<struct expression>, dsl::sep(dsl::comma));
};
The atom is how the expression will be parsed. p<> means production:
static constexpr auto atom = [] {
return
dsl::p<paren_expr>
| dsl::p<identifier>
| dsl::p<string_literal>
| dsl::p<number>
| dsl::error<expected_operand>;
}();
Function and tuples are similar: (1,3)
is a tuple, but thing(1,3)
is a function call. To differentiate, we use a dsl::postfix_op
and a prefix : dsl::prefix_op
:
struct call : dsl::postfix_op {
static constexpr auto op = dsl::op(
dsl::parenthesized.list(dsl::p<struct expression>, dsl::sep(dsl::comma)));
using operand = dsl::atom;
};
struct prefix : dsl::prefix_op {
static constexpr auto op = dsl::op(dsl::lit_c<'-'>);
using operand = call;
};
What comes next is just a suite of the common math operators, with their precedence: product
sum
bit_shift
inequality
equality
bit_and
bit_xor
bit_or
bool_and
bool_or
equal
.
Remember the indent and dedent tokens? Here they are:
struct compound_stmt {
static constexpr auto rule =
lexy::dsl::brackets(LEXY_LIT("__INDENT__"),
LEXY_LIT("__DEDENT__")).list(dsl::recurse<statement>);
};
C semantics have this particularity between statements and expression, so there is the "expression statement":
struct expression_stmt {
static constexpr auto rule =
dsl::p<expression> +LEXY_LIT("__ENDLINE__");
};
Combining statements:
struct statement {
static constexpr auto whitespace = dsl::ascii::space;
static constexpr auto rule =
dsl::p<compound_stmt>
| dsl::p<if_stmt>
| dsl::p<while_stmt>
| dsl::p<for_stmt>
| dsl::p<jump_stmt>
| dsl::else_ >> dsl::p<expression_stmt>
;
};
And we're done! That is the grammar!
Of course, you can look at the code for the full grammar.
That code:
func sunrungunbunpunfunjunction(roubidou:char, i: int) (float, char):
if nope:
hum = cally(1,2)
duh = (43, fcall(2.4,(abc, 5)))
while 1:
continue
return
return 32-2*4%3
The code is gibberish and makes no sense: all we did is parse it, but I want to use a C compiler as a backend, so for now, it doesn't matter.
Lexy doesn't output to json, so I wrote some code to output json. I added a step to make it a bit more compact, but it still has 17 times as many line as the original source:
{
"func_decl": [{
"func_signature": [
"func",
{
"identifier": [
"sunrungunbunpunfunjunction"
]},
{
"func_args": [{
"decl_ltr": [{
"identifier": [
"roubidou"
]},
{
"base_type": [
"char"
]}]},
{
"decl_ltr": [{
"identifier": [
"i"
]},
{
"base_type": [
"int"
]}]}]},
{
"return_values": [{
"base_type": [
"float"
]},
{
"base_type": [
"char"
]}]}]},
{
"compound_stmt": [{
"statement": [{
"if_stmt": [
"if",
{
"expression": [{
"identifier": [
"nope"
]}]},
{
"compound_stmt": [{
"statement": [{
"expression_stmt": [{
"expression": [{
"expression::equal": [{
"identifier": [
"hum"
]},
"=",
{
"expression::call": [{
"identifier": [
"cally"
]},
{
"expression": [{
"number": [
"1"
]}]},
{
"expression": [{
"number": [
"2"
]}]}]}]}]}]}]},
{
"statement": [{
"expression_stmt": [{
"expression": [{
"expression::equal": [{
"identifier": [
"duh"
]},
"=",
{
"expression::paren_expr": [{
"expression": [{
"number": [
"43"
]}]},
{
"expression": [{
"expression::call": [{
"identifier": [
"fcall"
]},
{
"expression": [{
"number": [
"2",
".",
"4"
]}]},
{
"expression": [{
"expression::paren_expr": [{
"expression": [{
"identifier": [
"abc"
]}]},
{
"expression": [{
"number": [
"5"
]}]}]}]}]}]}]}]}]}]}]}]}]}]},
{
"statement": [{
"while_stmt": [
"while",
{
"expression": [{
"number": [
"1"
]}]},
{
"compound_stmt": [{
"statement": [{
"jump_stmt": [{
"continue_stmt": [
"continue"
]}]}]},
{
"statement": [{
"jump_stmt": [{
"return_stmt_value": [
"return"
]}]}]},
{
"statement": [{
"jump_stmt": [{
"return_stmt_value": [
"return",
{
"expression": [{
"expression::sum": [{
"number": [
"32"
]},
"-",
{
"expression::product": [{
"expression::product": [{
"number": [
"2"
]},
"*",
{
"number": [
"4"
]}]},
"%",
{
"number": [
"3"
]}]}]}]}]}]}]}]}]}]}]}]}
match case
Python 3.10 adds a match-case syntax, which is quite useful for walking a parse tree.
PEP 636 – Structural Pattern Matching: Tutorial
For example, here is how we match func_signature:
match entry:
case {'func_signature':['func', idtf, func_args, return_values]}:
walk(return_values)
walk(idtf)
walk(func_args)
It generates this output, which looks a bit more like C code, but we are not there yet!
sunrungunbunpunfunjunctionroubidou char;i int;{
if(nope) {
hum=cally(1, 2);
duh=(43, fcall(2.4, (abc, 5)));
}
while(1) {
continue;
return;
return 32-2*4%3;
}
}