Lately I found myself working on a pet project for which I needed a parser for an assembly-like programming language for a real architecture. Since I wanted to use this opportunity to learn something new, I decided to use a combination of tools, i.e. Flex (for lexing) and Bison (for parsing)... obviously in C++ :)
However I found myself facing a pretty interesting problem for which I could not find a workaround, therefore I set myself with the goal of filling that void with this post.
The problem is easily explained, in the language I am parsing (let's call it X for brevity), there is a large amount of opcodes, e.g. add, sub, bra. These operations have different properties (e.g. accepting different number and type of parameter) and I wanted to have a centralized specification for those properties which I can pull-out as needed throughout my project. This is a typical case where X-macros should be used. However since flex and bison specification files are not a C-like language, I cannot use the preprocessor capabilities. Before we jump to the problem I wanted to recap what X-Macros are and why they are awesome.
Let us consider the following example:
By properly defining the two macros in the opcodes.def file, and thus take advantage of the C preprocessor, we can reshape the information in different ways. As an example consider a function which determines if an argument is valid for a given operation:
As you expect the output of the code above is:
Is 'sub(_,_,_,reg)' valid? false
Is 'add(_,reg,_)' valid? true
Is 'add(_,imm,_)' valid? false
Is 'sub(_,_,imm)' valid? true
Is 'sub(_,_,lab)' valid? false
Cool eh? and think how easy is going to be adding a new opcode or change their semantics. This is quite standard practice in compiler-related project, beware that working with macros can get messy... but what not to love here? If you start to get weird errors by your compiler, sometimes it is instructive to run only through the preprocessor and have a look at the generated code. This can be done in GCC using the -E option (checkout an example for the code above here).
In my experience many programmers consider these meta-programming techniques (i.e., macros and templates) harmful and therefore to avoid like the plague. Of course everyone is entitled to his opinion but in a problem like this one, X-macros can save you an huge amount of time. The generated code is also a lot faster and more memory efficient then any implementation based on lookup tables.
Anyhow I am digressing... I hope all those macro haters have had enough by now and they left this post so that we can start to get things to the next level. Without all those 'C is better than C++' holding us back, can move on solving the actual problem which inspired this post. The idea is to use the the properties of X-macros inside the lexer and parser specification for a language parser. Since we are working in C++, the natural choice of tools is Flex (for lexing) and Bison (for parsing) [another choice could be Boost.Spirit, but I never felt the tool to be mature enough to be practical useful, feel free to comment on this]. For those who are not aware of parser generator tools, these are (meta-)programs which take a language specification (usually in a EBNF form) and produce code to determine whether an input stream matches that specification.
This is not going to be a tutorial on how to use Flex and Bison, there are other tutorials/blogs you can check out for that (Google is your friend). I start by setting up a basic flex and bison specification for parsing our language X. X allows 1 instruction per line, where each instruction is composed by an operator and a list of comma-separated arguments. Labels are allowed as well. For example the following is valid X program:
add r1, r2, r3
sub r1, r1, #10
bra r1, exit
...
exit:
...
Follows the Flex and Bison specification files for the X language:
and
In order to compile this code we need a Makefile which triggers the generation of the lexer and parser. We are going to base our setup on cmake, checkout the CMakeLists.txt file for details. You can test out the code so far by checking out the following github repository (refer to the no_autogen branch).
You may start to notice the problem now. Our opcodes, which we defined in the opcodes.def file, need to be defined (again and again) both in flex and bison specifications. Since these are not C/C++ files, we cannot rely on the preprocessor for generating these rules automatically. This makes adding new opcodes to the X language very cumbersome since many files need to be modified. However there is my solution to the problem.
The idea is to generate temporary C/C++ files used to reshape the information contained in the opcodes.def file in a form that it is suitable for flex and bison. After that we replace the content generated by X-Macros in the language specification. The workflow is not particularly difficult, however we the main effort would be on automatizing this entire process so that specification files are refreshed when needed (i.e. when we lunch make).
Let's start with the Flex specification file (lexer.ll). My solution of the problem is the following. We define a placeholder, i.e. @@{{OPCODES}}@@ (you can choose any name for the placeholder, however make sure that it is unique within your flex specification), which we are going to replace with some content automatically generated by the preprocessor. We also rename the file into lexer.in.ll which we are now going to call the template file. Next thing we need to do is to add to CMake the necessary action for generating and replacing the placeholder with content.
Similar thing can be done for the bison specification file. Note that here things are a little bit more complex since we need to declare opcodes both in the token section (line 28 of the parser.y file) and later in the grammar itself within the parsing rules (line 53 of the parser.y file). The full CMakeLists.txt file is available here.
In order to make replacement of placeholders with the preprocessor output I wrote a simple python script (i.e. xmacro_patcher.py). The script does remove empty lines and comments produced by the preprocessor and collapse the lines if needed. I actually spent lot of time finding out a way of doing this without the need of an additional script. I found a command line which will do the same work but unfortunately the way CMake treats the command line arguments in the COMMAND section made it impossible to use (damn you CMake). Anyway, in the case anyone is curious
And that's really it. We can now add new operators to our opcodes.def, press make... and our parser will now accept the broader language.
C++ <3
Is 'sub(_,_,_,reg)' valid? false
Is 'add(_,reg,_)' valid? true
Is 'add(_,imm,_)' valid? false
Is 'sub(_,_,imm)' valid? true
Is 'sub(_,_,lab)' valid? false
Cool eh? and think how easy is going to be adding a new opcode or change their semantics. This is quite standard practice in compiler-related project, beware that working with macros can get messy... but what not to love here? If you start to get weird errors by your compiler, sometimes it is instructive to run only through the preprocessor and have a look at the generated code. This can be done in GCC using the -E option (checkout an example for the code above here).
In my experience many programmers consider these meta-programming techniques (i.e., macros and templates) harmful and therefore to avoid like the plague. Of course everyone is entitled to his opinion but in a problem like this one, X-macros can save you an huge amount of time. The generated code is also a lot faster and more memory efficient then any implementation based on lookup tables.
Anyhow I am digressing... I hope all those macro haters have had enough by now and they left this post so that we can start to get things to the next level. Without all those 'C is better than C++' holding us back, can move on solving the actual problem which inspired this post. The idea is to use the the properties of X-macros inside the lexer and parser specification for a language parser. Since we are working in C++, the natural choice of tools is Flex (for lexing) and Bison (for parsing) [another choice could be Boost.Spirit, but I never felt the tool to be mature enough to be practical useful, feel free to comment on this]. For those who are not aware of parser generator tools, these are (meta-)programs which take a language specification (usually in a EBNF form) and produce code to determine whether an input stream matches that specification.
This is not going to be a tutorial on how to use Flex and Bison, there are other tutorials/blogs you can check out for that (Google is your friend). I start by setting up a basic flex and bison specification for parsing our language X. X allows 1 instruction per line, where each instruction is composed by an operator and a list of comma-separated arguments. Labels are allowed as well. For example the following is valid X program:
add r1, r2, r3
sub r1, r1, #10
bra r1, exit
...
exit:
...
Follows the Flex and Bison specification files for the X language:
and
In order to compile this code we need a Makefile which triggers the generation of the lexer and parser. We are going to base our setup on cmake, checkout the CMakeLists.txt file for details. You can test out the code so far by checking out the following github repository (refer to the no_autogen branch).
You may start to notice the problem now. Our opcodes, which we defined in the opcodes.def file, need to be defined (again and again) both in flex and bison specifications. Since these are not C/C++ files, we cannot rely on the preprocessor for generating these rules automatically. This makes adding new opcodes to the X language very cumbersome since many files need to be modified. However there is my solution to the problem.
The idea is to generate temporary C/C++ files used to reshape the information contained in the opcodes.def file in a form that it is suitable for flex and bison. After that we replace the content generated by X-Macros in the language specification. The workflow is not particularly difficult, however we the main effort would be on automatizing this entire process so that specification files are refreshed when needed (i.e. when we lunch make).
Let's start with the Flex specification file (lexer.ll). My solution of the problem is the following. We define a placeholder, i.e. @@{{OPCODES}}@@ (you can choose any name for the placeholder, however make sure that it is unique within your flex specification), which we are going to replace with some content automatically generated by the preprocessor. We also rename the file into lexer.in.ll which we are now going to call the template file. Next thing we need to do is to add to CMake the necessary action for generating and replacing the placeholder with content.
Similar thing can be done for the bison specification file. Note that here things are a little bit more complex since we need to declare opcodes both in the token section (line 28 of the parser.y file) and later in the grammar itself within the parsing rules (line 53 of the parser.y file). The full CMakeLists.txt file is available here.
In order to make replacement of placeholders with the preprocessor output I wrote a simple python script (i.e. xmacro_patcher.py). The script does remove empty lines and comments produced by the preprocessor and collapse the lines if needed. I actually spent lot of time finding out a way of doing this without the need of an additional script. I found a command line which will do the same work but unfortunately the way CMake treats the command line arguments in the COMMAND section made it impossible to use (damn you CMake). Anyway, in the case anyone is curious
awk -vf2="$(while read x; do echo $x; done)" '/$1/{print;print f2;next}1' $2
This like will read content from stdin into variable 'f2', then match the input file $2 with the regex provided in $1 and append the content of f2 at that location. The patched file will be retuned on the standard output.
C++ <3