This is fantastic work. I believe learning to write a compiler is similar to lea...

shakna · on Jan 8, 2020

> I'm curious why BNF was chosen vs EBNF. I'm new to parsers and grammar. Isn't EBNF easier/simpler to write complex rules?

I can't speak to the specific reasons the OP chose, but I can speak to generalities.

Generally speaking, every parser engine out there that supports EBNF supports different extensions of EBNF [+]. They aren't portable from one to another, and that can lock you in. So if you get frustrated with the engine at some point and want to ditch it, it becomes harder to.

BNF does make it harder to write complex rules, but if your goal is a self-hosted compiler, like this one appears to be, you might use BNF for the host compiler, and then choose something more expressive now you can use the language you've built.

There are tradeoffs in everything. BNF tends to be faster, and more portable with more documentation. EBNF handles more complex cases, but you might have to learn one specific tool rather than a standard you can use everywhere.

[+] There is an EBNF standard. However, parser/lexer engines still extend it in their own incompatible ways. ISO tried to standardise it, and just ended up adding to the chaos, and now even they don't recommend their own.

chenhan · on Jan 8, 2020

For the curious, the ISO standard is ISO/IEC 14977. It is available here: https://standards.iso.org/ittf/PubliclyAvailableStandards/

Direct link to the standard: https://standards.iso.org/ittf/PubliclyAvailableStandards/s0...

But of course, as shanka commented, it isn't a standard that is followed practically. This is just a demonstration of https://xkcd.com/927/ in real life.

cwzwarich · on Jan 8, 2020

> I'm curious why BNF was chosen vs EBNF.

He wrote a BNF grammar, but his parser is closer to a handwritten parser for an EBNF grammar. Generally speaking, when people write recursive descent parsers by hand in a procedural style, they parse sequences directly rather than using a recursive descent following the grammar.

e19293001 · on Jan 8, 2020

Compiler Construction Using Java, JavaCC, and Yacc Book by Anthony J. Dos Reis

Uses similar approach with much more details on theory and implementation. I learned so much from this book. It teaches you to write a hand written compiler and by using tools like javacc/yacc as well. But I really love how the author explains the knots and bolts on creating a hand written compiler. Get this book if you're interested in creating your own parser/compiler/interpreter. Improve your programming skills by answering the exercises at the end of each chapter. I promise, you'll learn so much from this book. Good luck.

pjmlp · on Jan 8, 2020

JavaCC was my favorite way of playing with compilers back in early 2000, sadly it appears not to be developed any longer, with a couple of forks floating around the Internet.

Any Idea what is the actual state?

e19293001 · on Jan 8, 2020

github repo is very active: https://github.com/javacc/javacc

pjmlp · on Jan 8, 2020

Thanks. I wasn't sure if that was the right one.

kazinator · on Jan 8, 2020

ISO C doesn't use EBNF, so why would you use it in constructing a compiler.

The ISO C grammar is even factored out to eliminate ambiguities without resorting to precedence rules; it has nodes like "multiplicative-expression", "additive-expression" and such.

You would have to convert that to EBNF and maintain it.

Complex rules are not required in C and they are actually harmful in parser construction, because their output is complex, and has to essentially be parsed again by the semantic actions.

For instance, suppose that in a Yacc-like parser generator, we have a * (star) operator on the right hand side of rules for "zero or more of". Say we have some "decl *" in a rule corresponding to $4. What should the type of that be? It has to be a list of some kind. So now Yacc has to provide a list structure for accessing into these collections. That list structure won't match what the parser wants to build, so the semantic action will be doing silly things like walking over the list of items using the parser generator's data structure, to convert those items to the AST nodes it actually wants.

The parsers generated by classic "Yacc-style" parser generators do not have to construct AST's at all; that's just one possible use case. A parser can directly evaluate as it is parsing, for instance; a classic parser generator can produce an expression evaluator that doesn't allocate any nodes. The parser skeleton works with a push-down stack and some tables. It invokes programmer-defined rule bodies, which access the symbols corresponding to the rule elements, producing semantic values that bubble up as reductions are made. The semantic actions never have to deal with any sort of complex data structure dictated by the parser generator.

There are going to be issues of parser generator syntax under EBNF. Under BNF, everything in the right hand side of a rule, except for the | operator, is a symbol. We can treat the variants separated by | as separate rules with their own semantic action bodies. Those bodies can refer to the rule symbols simply as $1, $2, $3 .... Simple counting of whitespace-delimited items confirms what number refers to what element of the rule. It's not obvious how this simple ergonomics can be carried over into a version of EBNF augmented with semantic actions. If there is a net loss of readability, then EBNF ironically becomes a burden rather than a boon.

Lastly, where do you see BNF in the project, other than the README.md files? The code uses recursive-descent hand-written parsing techniques.

If you're writing a parser by hand, and documenting the grammar informally, you don't want EBNF, because that increases the distance between your parsing code and its documentation. With a (carefully crafted) BNF spec, we have a shot at a achieving decent traceability between the spec and the hand-written recursive descent parsing.

fjfaase · on Jan 8, 2020

With a Yacc-like parser, it is indeed hard to define what the AST should be, but because AST generation is most of the time rather straightforward, you could automate it, by simply specifying some string at the end of a rule and have a generic mechanism for dealing with the option, plus and star-operators. For an example of such a grammar for C have a look at: https://www.iwriteiam.nl/c_gr.txt and for a parser which can interpret this, have a look at: https://www.iwriteiam.nl/MM.html

Only a few people are working on developing production compilers, while much more people will have to work on simple parsers at some point during their working life. If performance is not crucial, I believe it is better to use a parsing system that is easy to use and does supports EBNF.