Categories
Binary World Code Work

First Steps with Boost::Spirit

Imagine there were some expres­sion pseudo language you need to convert into JavaS­cript or some other, pro­pri­et­ary language. That was the task at hand some months ago for which I wrote a Regular Expres­sions-based parser. At first it sounded like a good idea: no nested function calls and all quite simple boolean expres­sions. But then there came nested function calls the need for more elaborate trans­form­a­tions than VARIABLE=5 to VARIABLE===5.

This is when you realise that a parser cannot really be avoided any more. But what if Lexx and Yacc are not your favourite toolset? And ANTLR does not have the best possible support coverage for C++ (but still is a very convenien parser generator)? Ah right, we’re talking C++, so that’s the baseline here.

After a bit of digging and Duck­DuckGo’ing, I found Boost Spirit. Yes, there’s a parser generator in Boost, and even a header-only imple­ment­a­tion! After well more than 10 years using Boost, I was quite surprised. Boost Spirit makes it easy to create attribute grammars and even gen­er­at­ors. So perfect for the task at hand!

What’s the task again? Well, it’s the following: create a trans­form­a­tion from an expres­sion pseudo language to JavaS­cript. The language looks a bit like the following:

( FIELD1 = EMPTY OR FIELD1 > 1 ) AND FIELD2 ONENOTIN ( EMPTY )
NUMFIELD1 = 0 AND ( NUMFIELD2 = EMPTY ODER NUMFIELD3 = EMPTY ODER NUMFIELD4 = EMPTY )
FIELD5 IN SomeSetOfValues AND NUMFIELD5 IN ( 1; 2 )
NUMFIELD6 IN ( 2 ) AND NUMFIELD7 NONEIN ( 88, 89, 90, 91, 92, 93 ) AND NUMFIELD8 = 1
(NUMFIELD9 / ( NUMFIELD10 * NUMFIELD10 ) ) * 10000 >= 30 AND FIELD6 ONEIN ( 10 ) NUMFIELD11 > 0 AND FIELD7 NONEIN ( 20 ) AND NOT FIELD8 ALLIN ( EMPTY )
someFunction( calculateSomethingElse( FIELD9 ), FIELD10 ) > someOtherFunction( FIELD11 )

Quite weird, right? Also note the different list sep­ar­at­ors and some strange checks for the empty set. But that’s not what the language parser needs to take of. Well, even though the language is quite simple and easy to read, it’s at least as difficult as a math­em­at­ic­al expres­sion parser. But that’s basically it.

So let’s begin. I’ll outline the main learnings and what is needed for a Boost Spirit-based grammar. In a second article, I might also try to introduce the generator part. But let’s for now focus on the gen­er­a­tion of an abstract syntax tree (AST) which can be post­pro­cessed if required.

I’ve decided to split the grammar into the following main parts:

  • The Expres­sion rule starts with an addition term.
  • Term rules are math­em­at­ic­al terms made up of the four main oper­a­tions (add, sub, mul, div), e.g. NUMFIELD10 * NUMFIELD10.
  • Equation rules combine math­em­at­ic­al equations (lt, le, gt, ge, eq), set oper­a­tions, and negation, e.g. NUMFIELD7 NONEIN ( 88, 89, 90, 91, 92, 93 ).
  • Rule rules are basically boolean expres­sions (and, or), e.g. ( FIELD1 = EMPTY OR FIELD1 <> 1 ).

Then we also have the basic atoms like variable names, function names, numbers, lists, strings, dates, and other atoms. I cannot post the complete imple­ment­a­tion here but I try to walk you through the basics to get started. Boost Spirit exploits C++ operator over­load­ing to a great extend and so creates it’s own domain-specific language (DSL) which emulates the EBNF form. Once you got used to how it behaves, it’s really straight­for­ward to write even complex (attribute) grammars.

So now, it’s get started. What’s required? We’ll include a bunch of headers which will add the func­tion­al­ity we need:

#include <boost/config/warning_disable.hpp>

#include <boost/spirit/include/qi.hpp>

#include <boost/spirit/include/phoenix_core.hpp>
#include <boost/spirit/include/phoenix_operator.hpp>
#include <boost/spirit/include/phoenix_fusion.hpp>
#include <boost/spirit/include/phoenix_stl.hpp>

#include <boost/fusion/include/adapt_struct.hpp>

#include <boost/variant/recursive_variant.hpp>

#include <array>
#include <vector>

The boost::phoenix headers are required for attribute binding. Boost Fusion makes C/C++ struc­tures first class citizens in Boost Spirit.

The first couple of defines cover the atoms in our language.

using qi::int_;
using qi::uint_parser;
using qi::lit;
using qi::lexeme;

using namespace qi::labels;

using ascii::space;
using ascii::char_;
using ascii::digit;

using phoenix::at_c;

qi::rule< Iterator, std::string(), ascii::space_type > SIGN;
qi::rule< Iterator, variable(), ascii::space_type > VARIABLE;
qi::rule< Iterator, std::string(), ascii::space_type > NAME;
qi::rule< Iterator, std::string(), ascii::space_type > NUMBER;
qi::rule< Iterator, std::string(), ascii::space_type > LIST_ELEMENT;
qi::rule< Iterator, std::string(), ascii::space_type > STRING;

SIGN %= -( lit('+') | lit('-') ); // Number sign
VARIABLE = lexeme[ char_("a-zA-Z_") [at_c<0>(_val) = _1] >> *( char_("a-zA-Z_0-9") [at_c<0>(_val) += _1] ) ]; // Typical variable names [a-zA-Z_][a-zA-Z_0-9]*
NAME %= lexeme[ char_("a-zA-Z_") >> *( char_("a-zA-Z_0-9") ) ];
NUMBER %= -SIGN >> lexeme[ +digit >> -( char_(',')[_val += '.'] >> +digit ) ]; // Numbers with German decimal point (comma)
LIST_ELEMENT %= lexeme[ +(char_ - '(' - ')' - ';' - ',')];
STRING = lit('\'') >> lexeme[ +(char_ - '\'') ] >> lit('\'');
ATOM %= VARIABLE | NUMBER | STRING;

The lexeme parts make sure that there is no separator involved. The strange [at_c] attrib­utes are where the captures are assigned to the AST data struc­tures. In order to mark the variable, for example, as a special entity, I’ve decided to provide a dedicated structure for it:

nsmespace myns {
  struct variable { std::string name; };
}

Simple but efficient. In the post­pro­cessing and gen­er­a­tion steps this will be handled dif­fer­ently than other names such as function names. But we cannot just use that structure in boost::spirit just like that. Currently, there’s real no notion of “first element” or “n‑th element” of the structure, This is, however used by the odd at_c<n>(...) func­tion­al­ity: it binds the output attribute of the respect­ive capture to the given field. The field may either be a variable (syntax then reads like ref(v) where v is the variable), a member variable of a structure, or even a member function (the at_c<n>(...) syntax). One could also bind functions or lambdas. We really do have some kind of flex­ib­il­ity here!

The char_ represent a character, digit a digit. The under­score was intro­duced purely because of name clashes with the basic types.

But what is required for the structure variable to be usable in boost::spirit? Not more than the following piece of code defined in the *global* namespace:

BOOST_FUSION_ADAPT_STRUCT(
  myns::variable,
  (std::string, name)
)

It’s basically the same for all the other struc­tures you’ll need. If you have a vector of strings (i.e. std::vector< std::string >) named values you’ll just need to define (std::vector< myns::expression_value >, values). If you want to define a recursive structure using boost::recursive_wrapper or a variant using boost::variant, you just do the same:

nsmespace myns
{
  struct rule;
  boost::variant< boost::recursive_wrapper< rule >, equation_value > rule_value;
  struct rule {
    rule() : operation(-1) {}
    int operation;
    std::vector< rule_value > equations;
  };
}

BOOST_FUSION_ADAPT_STRUCT(
  myns::rule,
  (int, operation)
  (std::vector< myns::rule_value >, equations)
)

And how do the remaining rules look like? Basically just like the atom defin­i­tions above. I’ve decided to split up the ADD/SUB and MUL/DIV rules as my AST forms an non-binary expres­sion tree (i.e. for one operation, more than two elements are listed like, e.g. +(a b c)) and wrapping ADD and SUB grammars in one rule would not work here. We need to generate a new node the the AST for each operation.

my_expression = my_term_add_sub [push_back(at_c<1>(_val), _1)];

my_term_add_sub %= my_term_add | my_term_sub | my_term_mul_div;
my_term_add = // a+b, ...
my_term_mul_div [push_back(at_c<1>(_val), _1)]
  >> +(
    lit('+') [at_c<0>(_val) = 0]
    >> (
      my_term_sub [push_back(at_c<1>(_val), _1)] |
      my_term_mul_div [push_back(at_c<1>(_val), _1)]
    )
  );
my_term_sub = // The same as for add but reverse (don't call my_term_sub but my_term_add)

my_term_mul_div %= my_term_mul | my_term_div | my_term;
my_term_mul = // a*b, ...
my_term [push_back(at_c<1>(_val), _1)]
  >> +(
    lit('*') [at_c<0>(_val) = 2]
    >> (
      my_term_div [push_back(at_c<1>(_val), _1)] |
      my_term [push_back(at_c<1>(_val), _1)]
    )
  );
my_term_div = // The same as for add but reverse (don't call my_term_div but my_term_mul)

my_term_set =
  lit("LEER") |
  lit('(') >> lit("LEER") >> lit(')') |
  (
    lit('(')
    >> LIST_ELEMENT [push_back(_val, _1)] % char_(",;")
    >> lit(')')
  );
my_term_sub_expression =
  lit('(')
  >> my_expression [_val = _1]
  >> lit(')');
my_term_function =
  NAME [at_c<0>(_val) = _1]
  >> lit('(')
  >> my_equation [push_back(at_c<1>(_val), _1)] % char_(",;")
  >> lit(')');
my_term %=
  my_term_function |
  my_term_sub_expression |
  my_term_set |
  ATOM;

The AST expres­sion tree is modelled as follows:

namespace myns
{
  struct equation;
  typedef boost::recursive_wrapper< equation > equation_value;
  struct func {
    std::string name;
    std::vector< equation_value > arguments;
  };

  struct expression;
  typedef boost::variant< boost::recursive_wrapper< expression >,
    std::vector< std::string >,
    variable,
    func,
    std::string > expression_value;

  struct expression {
    expression() : operation(-1) {}
    int operation;
    std::vector< expression_value > values;
  };

  struct equation {
    equation() : negate(false), operation(-1) {}
    bool negate;
    expression_value left;
    int operation;
    expression_value right;
  };
}

BOOST_FUSION_ADAPT_STRUCT(
  qslang::func,
  (std::string, name)
  (std::vector< qslang::equation_value >, arguments)
)
BOOST_FUSION_ADAPT_STRUCT(
  qslang::expression,
  (int, operation)
  (std::vector< qslang::expression_value >, values)
)
BOOST_FUSION_ADAPT_STRUCT(
  qslang::equation,
  (bool, negate)
  (qslang::expression_value, left)
  (int, operation)
  (qslang::expression_value, right)
)

I’ll leave this uncom­men­ted. It’s basically EBNF with attrib­utes and the data struc­tures should be fairly self-explan­at­ory (haha; I’m just lazy, agreed). Sadly, I cannot post the complete source and maybe it’s not even the best way to model the grammar. But it should help inter­ested coders getting the idea.