First Steps with Boost::Spirit

First Steps with Boost::Spirit

Imagine there were some expression pseudo language you need to convert into JavaScript or some other, proprietary language. That was the task at hand some months ago for which I wrote a Regular Expressions-based parser. At first it sounded like a good idea: no nested function calls and all quite simple boolean expressions. But then there came nested function calls the need for more elaborate transformations than VARIABLE=5 to VARIABLE===5.

This is when you realise that a parser cannot really be avoided any more. But what if Lexx and Yacc are not your favourite toolset? And ANTLR does not have the best possible support coverage for C++ (but still is a very convenien parser generator)? Ah right, we’re talking C++, so that’s the baseline here.

After a bit of digging and DuckDuckGo‘ing, I found Boost Spirit. Yes, there’s a parser generator in Boost, and even a header-only implementation! After well more than 10 years using Boost, I was quite surprised. Boost Spirit makes it easy to create attribute grammars and even generators. So perfect for the task at hand!

What’s the task again? Well, it’s the following: create a transformation from an expression pseudo language to JavaScript. The language looks a bit like the following:

  1. ( FIELD1 = EMPTY OR FIELD1 > 1 ) AND FIELD2 ONENOTIN ( EMPTY )
  2. NUMFIELD1 = 0 AND ( NUMFIELD2 = EMPTY ODER NUMFIELD3 = EMPTY ODER NUMFIELD4 = EMPTY )
  3. FIELD5 IN SomeSetOfValues AND NUMFIELD5 IN ( 1; 2 )
  4. NUMFIELD6 IN ( 2 ) AND NUMFIELD7 NONEIN ( 88, 89, 90, 91, 92, 93 ) AND NUMFIELD8 = 1
  5. (NUMFIELD9 / ( NUMFIELD10 * NUMFIELD10 ) ) * 10000 >= 30 AND FIELD6 ONEIN ( 10 ) NUMFIELD11 > 0 AND FIELD7 NONEIN ( 20 ) AND NOT FIELD8 ALLIN ( EMPTY )
  6. someFunction( calculateSomethingElse( FIELD9 ), FIELD10 ) > someOtherFunction( FIELD11 )

Quite weird, right? Also note the different list separators and some strange checks for the empty set. But that’s not what the language parser needs to take of. Well, even though the language is quite simple and easy to read, it’s at least as difficult as a mathematical expression parser. But that’s basically it.

So let’s begin. I’ll outline the main learnings and what is needed for a Boost Spirit-based grammar. In a second article, I might also try to introduce the generator part. But let’s for now focus on the generation of an abstract syntax tree (AST) which can be postprocessed if required.

I’ve decided to split the grammar into the following main parts:

  • The Expression rule starts with an addition term.
  • Term rules are mathematical terms made up of the four main operations (add, sub, mul, div), e.g. NUMFIELD10 * NUMFIELD10.
  • Equation rules combine mathematical equations (lt, le, gt, ge, eq), set operations, and negation, e.g. NUMFIELD7 NONEIN ( 88, 89, 90, 91, 92, 93 ).
  • Rule rules are basically boolean expressions (and, or), e.g. ( FIELD1 = EMPTY OR FIELD1 <> 1 ).

Then we also have the basic atoms like variable names, function names, numbers, lists, strings, dates, and other atoms. I cannot post the complete implementation here but I try to walk you through the basics to get started. Boost Spirit exploits C++ operator overloading to a great extend and so creates it’s own domain-specific language (DSL) which emulates the EBNF form. Once you got used to how it behaves, it’s really straightforward to write even complex (attribute) grammars.

So now, it’s get started. What’s required? We’ll include a bunch of headers which will add the functionality we need:

  1. #include < boost/config/warning_disable.hpp >
  2.  
  3. #include < boost/spirit/include/qi.hpp >
  4.  
  5. #include < boost/spirit/include/phoenix_core.hpp >
  6. #include < boost/spirit/include/phoenix_operator.hpp >
  7. #include < boost/spirit/include/phoenix_fusion.hpp >
  8. #include < boost/spirit/include/phoenix_stl.hpp >
  9.  
  10. #include < boost/fusion/include/adapt_struct.hpp >
  11.  
  12. #include < boost/variant/recursive_variant.hpp >
  13.  
  14. #include < array >
  15. #include < vector >

The boost::phoenix headers are required for attribute binding. Boost Fusion makes C/C++ structures first class citizens in Boost Spirit.

The first couple of defines cover the atoms in our language.

  1. using qi::int_;
  2. using qi::uint_parser;
  3. using qi::lit;
  4. using qi::lexeme;
  5.  
  6. using namespace qi::labels;
  7.  
  8. using ascii::space;
  9. using ascii::char_;
  10. using ascii::digit;
  11.  
  12. using phoenix::at_c;
  13.  
  14. qi::rule< Iterator, std::string(), ascii::space_type > SIGN;
  15. qi::rule< Iterator, variable(), ascii::space_type >    VARIABLE;
  16. qi::rule< Iterator, std::string(), ascii::space_type > NAME;
  17. qi::rule< Iterator, std::string(), ascii::space_type > NUMBER;
  18. qi::rule< Iterator, std::string(), ascii::space_type > LIST_ELEMENT;
  19. qi::rule< Iterator, std::string(), ascii::space_type > STRING;
  20.  
  21. SIGN         %= -( lit('+') | lit('-') ); // Number sign
  22. VARIABLE      = lexeme[ char_("a-zA-Z_") [at_c<0>(_val) = _1] >> *( char_("a-zA-Z_0-9") [at_c<0>(_val) += _1] )  ];  // Typical variable names [a-zA-Z_][a-zA-Z_0-9]*
  23. NAME         %= lexeme[ char_("a-zA-Z_") >> *( char_("a-zA-Z_0-9") ) ];
  24. NUMBER       %= -SIGN >> lexeme[ +digit >> -( char_(',')[_val += '.'] >> +digit ) ]; // Numbers with German decimal point (comma)
  25. LIST_ELEMENT %= lexeme[ +(char_ - '(' - ')' - ';' - ',')];
  26. STRING        = lit('\'') >> lexeme[ +(char_ - '\'') ] >> lit('\'');
  27. ATOM         %= VARIABLE | NUMBER | STRING;

The lexeme parts make sure that there is no separator involved. The strange [at_c] attributes are where the captures are assigned to the AST data structures. In order to mark the variable, for example, as a special entity, I’ve decided to provide a dedicated structure for it:

  1. nsmespace myns
  2. {
  3.     struct variable
  4.     {
  5.         std::string name;
  6.     };
  7. }

Simple but efficient. In the postprocessing and generation steps this will be handled differently than other names such as function names. But we cannot just use that structure in boost::spirit just like that. Currently, there’s real no notion of “first element” or “n-th element” of the structure, This is, however used by the odd at_c<n>(...) functionality: it binds the output attribute of the respective capture to the given field. The field may either be a variable (syntax then reads like ref(v) where v is the variable), a member variable of a structure, or even a member function (the at_c<n>(...) syntax). One could also bind functions or lambdas. We really do have some kind of flexibility here!

The char_ represent a character, digit a digit. The underscore was introduced purely because of name clashes with the basic types.

But what is required for the structure variable to be usable in boost::spirit? Not more than the following piece of code defined in the *global* namespace:

  1. BOOST_FUSION_ADAPT_STRUCT(
  2.     myns::variable,
  3.     (std::string, name)
  4. )

It’s basically the same for all the other structures you’ll need. If you have a vector of strings (i.e. std::vector< std::string >) named values you’ll just need to define (std::vector< myns::expression_value >, values). If you want to define a recursive structure using boost::recursive_wrapper or a variant using boost::variant, you just do the same:

  1. nsmespace myns
  2. {
  3.     struct rule;
  4.     boost::variant< boost::recursive_wrapper< rule >, equation_value > rule_value;
  5.     struct rule
  6.     {
  7.         rule() : operation(-1) {}
  8.         int operation;
  9.         std::vector< rule_value > equations;
  10.     };
  11. }
  12. BOOST_FUSION_ADAPT_STRUCT(
  13.     myns::rule,
  14.     (int, operation)
  15.     (std::vector< myns::rule_value >, equations)
  16. )

And how do the remaining rules look like? Basically just like the atom definitions above. I’ve decided to split up the ADD/SUB and MUL/DIV rules as my AST forms an non-binary expression tree (i.e. for one operation, more than two elements are listed like, e.g. +(a b c)) and wrapping ADD and SUB grammars in one rule would not work here. We need to generate a new node the the AST for each operation.

  1. my_expression =
  2.            my_term_add_sub                [push_back(at_c<1>(_val), _1)];
  3.  
  4. my_term_add_sub %= my_term_add | my_term_sub | my_term_mul_div;
  5. my_term_add = // a+b, ...
  6.            my_term_mul_div                [push_back(at_c<1>(_val), _1)]
  7.         >> +(
  8.                   lit('+')                [at_c<0>(_val) = 0]
  9.                >> (
  10.                          my_term_sub      [push_back(at_c<1>(_val), _1)]
  11.                       |  my_term_mul_div  [push_back(at_c<1>(_val), _1)]
  12.  
  13.                   )
  14.             );
  15. my_term_sub = // The same as for add but reverse (don't call my_term_sub but my_term_add)
  16.  
  17. my_term_mul_div %= my_term_mul | my_term_div | my_term;
  18. my_term_mul = // a*b, ...
  19.            my_term                        [push_back(at_c<1>(_val), _1)]
  20.         >> +(
  21.                   lit('*')                [at_c<0>(_val) = 2]
  22.                >> (
  23.                          my_term_div      [push_back(at_c<1>(_val), _1)]
  24.                       |  my_term          [push_back(at_c<1>(_val), _1)]
  25.                   )
  26.             );
  27. my_term_div = // The same as for add but reverse (don't call my_term_div but my_term_mul)
  28.  
  29. my_term_set =
  30.            lit("LEER")
  31.         |  lit('(') >> lit("LEER") >> lit(')')
  32.         |  (
  33.                    lit('(')
  34.                 >> LIST_ELEMENT           [push_back(_val, _1)] % char_(",;")
  35.                 >> lit(')')
  36.            );
  37. my_term_sub_expression =
  38.            lit('(')
  39.         >> my_expression                  [_val = _1]
  40.         >> lit(')');
  41. my_term_function =
  42.            NAME                           [at_c<0>(_val) = _1]
  43.         >> lit('(')
  44.         >> my_equation                    [push_back(at_c<1>(_val), _1)] % char_(",;")
  45.         >> lit(')');
  46. my_term %=
  47.            my_term_function
  48.         |  my_term_sub_expression
  49.         |  my_term_set
  50.         |  ATOM;

The AST expression tree is modelled as follows:

  1. namespace myns
  2. {
  3.     struct equation;
  4.  
  5.     typedef boost::recursive_wrapper< equation > equation_value;
  6.  
  7.     struct func
  8.     {
  9.         std::string name;
  10.         std::vector< equation_value > arguments;
  11.     };
  12.  
  13.     struct expression;
  14.     typedef boost::variant<
  15.                 boost::recursive_wrapper< expression >,
  16.                 std::vector< std::string >,
  17.                 variable,
  18.                 func,
  19.                 std::string
  20.             > expression_value;
  21.  
  22.     struct expression
  23.     {
  24.         expression() : operation(-1) {}
  25.         int operation;
  26.         std::vector< expression_value > values;
  27.     };
  28.  
  29.     struct equation
  30.     {
  31.         equation() : negate(false), operation(-1) {}
  32.         bool             negate;
  33.         expression_value left;
  34.         int              operation;
  35.         expression_value right;
  36.     };
  37. }
  38.  
  39. BOOST_FUSION_ADAPT_STRUCT(
  40.     qslang::func,
  41.     (std::string, name)
  42.     (std::vector< qslang::equation_value >, arguments)
  43. )
  44. BOOST_FUSION_ADAPT_STRUCT(
  45.     qslang::expression,
  46.     (int, operation)
  47.     (std::vector< qslang::expression_value >, values)
  48. )
  49. BOOST_FUSION_ADAPT_STRUCT(
  50.     qslang::equation,
  51.     (bool, negate)
  52.     (qslang::expression_value, left)
  53.     (int, operation)
  54.     (qslang::expression_value, right)
  55. )

I’ll leave this uncommented. It’s basically EBNF with attributes and the data structures should be fairly self-explanatory (haha; I’m just lazy, agreed). Sadly, I cannot post the complete source and maybe it’s not even the best way to model the grammar. But it should help interested coders getting the idea.

Leave a Reply

Your email address will not be published. Required fields are marked *