4 Procedural Macros

So far only the high level syntax for procedural macros are shown. ZL also allows for defining the macro-transformer function directly. This chapter will present the low-level interface, and the associated API. Many of the API function can also be used with the higher-level procedural macro form. The chapter ends with an extended example to get a better idea of how many of the API components work.

4.1 Low Level Procedural Macros

Figure 4.1 demonstrates the essential parts of any procedural macro. The macro is defined as a function that takes a syntax object (and optionally an environment), and returns a transformed syntax object. Syntax is created using the syntax form. The match_f function is used to decompose the input while the replace function is used to rebuild the output. Finally, make_macro is used to create a macro from a function. More interesting macros use additional API functions to take action based on the input. Figure 4.2 defines the key parts of the macro API, which we describe in the rest of this section.

  Syntax * or(Syntax * p) {
    Match * m = match_f(NULL, syntax (x, y), p);
    return replace(syntax
                     {({typeof(x) t = x; t ? t : y;});},
                   m, new_mark());
  }
  make_macro or;

Figure 4.1: Procedural macro version of or macro from Section 2.2.

Types: UnmarkedSyntax, Syntax, Match, and Mark

Syntax forms:
new_mark() — returns Mark *
syntax (...)|{...}|ID — returns UnmarkedSyntax *
raw_syntax (...) — returns UnmarkedSyntax *
make_macro ID [ID];

Callback functions:
Match * match_f(Match * prev, UnmarkedSyntax * pattern, Syntax * with)
Syntax * match_var(Match *, UnmarkedSyntax * var);
Syntax * replace(UnmarkedSyntax *, Match *, Mark *)
size_t ct_value(Syntax *, Environ *)

Figure 4.2: Basic macro API.

Syntax is created using the syntax and raw_syntax forms. The different forms create different types of code fragments. In most cases, the syntax {...} form can be used, such as when a code fragment is part of the resulting expansion; the braces will not be in the resulting syntax. If an explicit list is needed, for example, when passed to match_f as in Figure 4.1, then the syntax (...) form should be used (in which the commas are part of the syntax used to create the list). Neither of these forms create syntax directly, however; for example, syntax {x + y;} is first parsed as ("{}" "x + y;") before eventually becoming (plus x y). When it is necessary to create syntax directly, the syntax ID form can be used for simple identifiers. For more complicated fragments the raw_syntax form can be used in which the syntax is given in S-expression form.

The match_f function decomposes the input. It matches pattern variables (the second parameter) with the arguments of the macro (the third parameter). If it is successful, it prepends the results to prev (the first parameter) and returns the new list. If prev is NULL, then it is treated as an empty list. In the match pattern a _ can be used to mean “don’t care.”

The replace function is used to rebuild the output. It takes a syntax object (the first parameter, and generally created with syntax), replaces the pattern variables inside it with the values stored in the Match object (the second parameter), and returns a new Syntax object.

The final argument to replace is the mark, which is used to implement hygiene. A mark captures the lexical context at the point where it is created. Syntax objects created with syntax do not have any lexical information associated with them, and are thus unmarked (represented with the type UnmarkedSyntax). It is therefore necessary for replace to attach lexical information to the syntax object by using the mark created with the new_mark primitive (the third parameter to replace).

Match variables exist only inside the Match object. When it is necessary to access them directly, for example, to get a compile-time value, match_var can be used; it returns the variable as a Syntax object, or NULL if the match variable does not exist. If the compile-time value of a syntax object is needed, ct_value can be used, which will expand and parse the syntax object and return the value as an integer.

Once the function for a procedural macro is defined, it must be declared as a macro using make_macro.

4.2 Macro Transformer Arguments

The macro transformer can take one of four forms. So far only the most basic of forms was shown, that is a transformer that takes a single argument which is the syntax object to be transformed:

The passed in syntax object is the arguments to the macro call and not the call itself. When the call itself is needed a two argument form can be used:

In the case of a syntax macro call and args point to the same object. In the case of a function-call macro call has the form

The macro transformer can also accept an environment in either if the above forms:

ZL will also accept a transformer that expects a non-const environment, but that form is deprecated as directly modifying the environment in the transformer can lead to unpredictable results.

4.3 Macro API

The next couple of sections will detail various aspects of the Macro API. The API has both a class-like form and a procedure form; these sections presents the class-like form. The mapping from the class API to the raw API is straightforward. The general scheme is that the object name prepends the method in all lower case with an underscore separating it from the method name. The object is then passed in as the first parameter. For example, the method:

You may notice that many functions have seemly pointless default parameters. In reality these are expected to bind to a local fluid binding, which is used by high-level procedural macros and the quasiquotes. The details of how both use these arguments will be given the next chapter. For now, it is sufficient to know that when using high-level procedural macros the mark and envision symbols will be defined for you.

4.4 The Syntax Object

Type UnmarkedSyntax

Type Syntax, subtype of UnmarkedSyntax, with methods:

: Syntax * num_parts(unsigned)
: Syntax * part(unsigned)
: Syntax * flag(UnmarkedSyntax *)
: bool simple()
: bool eq(UnmarkedSyntax *)
: Syntax * stash_ptr(void *) (static method)
: void * extract_ptr()

Figure 4.3: Syntax object API.

There are two syntax-objects types, UnmarkedSyntax and Syntax. The difference between the two is the first represents a syntax object that has not been marked (see 4.1) yet, while the second one has. A Syntax object will automatically convert to a UnmarkedSyntax. But in order to go from UnmarkedSyntax to Syntax the syntax object needs be marked, which is generally done via replace.

Internally UnmarkedSyntax and Syntax are the same type. The distinction in the API is to avoid invalid use of unmarked syntax objects.

A syntax object consists of one or more parts, and optional flags. The first part has special meaning and is used to identify the syntax, provided that it is simple. A simple syntax object is basically¹ a syntax object with just one part, and no flags. Internally it is represented slightly differently. Parts other than the first are considered arguments.

Syntax objects can also have any number of optional flags. A flag is a named argument and is retrieved by name, rather than position. A flag itself is just a normal syntax object with the first part used to name the flag. Flags can be tested for existence using the Syntax’s flag method (which returns NULL if the flag does not exist) or matched with the match family of functions (see 4.6). Flags are primarily used when parsing declarations and can be created in macros by using the raw_syntax primitive. For example the following syntax object:

contains two flags, where flag1 is a flag without any value associated with it while flag2 is a flag with a value. Flags can also be passed into function call macros in which are just another name for the already described keyword arguments.

Syntax objects can also contain other types of objects embedded within them. A syntax object of such form is considered an entity. The most common types of objects are parsed syntax either in the form of an AST node or a symbol. However, it is also possible to embed arbitrary objects such as pointers in a syntax object using the stash_ptr and extract_ptr methods. These methods are most commonly used in combinations with Symbol properties, which will be described in 7.4.

Sometimes it is useful to get information on the syntax object without having to use match_f. For this ZL provides a number of methods to directly access the syntax object and get basic information. The part and num_parts method can be used for direct access. The eq and simple method can be used to get basic properties on the syntax object. The eq method tests if the syntax object is equal to another, taking into account that the first one may be marked. The simple method tests if the syntax object is simple as previously described.

4.5 The Syntax List

A syntax list is a syntax object whose first part is a @. It represents a list of syntax objects (which can include flags). Lists have the effect of being spliced into the parent syntax object.

Syntax lists can be used as values for macro identifiers, in which case the results are spliced in. Macros can return syntax lists, but the results are not automatically spliced in. Rather when a list of elements is parsed any @ are flattened as the list is read in. It is an error to return a syntax list in a nonlist context.

Type SyntaxList, subtype of Syntax, with constructor:

: SyntaxList * new_syntax_list()

and methods:

: int empty()
: void append(Syntax *)
: void append_flag(Syntax *)
: SyntaxEnum * elements()

Type SyntaxEnum with methods:

: Syntax * next()
: SyntaxEnum * clone()

Figure 4.4: Syntax list API.

Syntax lists are created using the new_syntax_list function. Elements are then appended to the list using the append or append_flag method. The empty method returns true if the list has 0 elements. The elements method is used to iterate through the elements and return a SyntaxEnum. The next method of SyntaxEnum returns the next element in the list or NULL if there is none, while the clone method returns a copy of the SyntaxEnum.

4.6 Matching and Replacing

Type Match with methods:

: Syntax * var(UnmarkedSyntax *)
: SyntaxEnum * varl(UnmarkedSyntax *)

and related functions:

: Match * match_f(Match * prev, UnmarkedSyntax * pattern, Syntax * with)
: Match * match_parts_f(Match *, UnmarkedSyntax * pattern, Syntax * with)
: Match * match_local(Match *, ...)

Callback function:

: Syntax * replace(UnmarkedSyntax *, Match *, Mark *)

Figure 4.5: Match and replace API.

The match_f and replace functions have already been described. The var method is identical to the previously described match_var function. The varl method is like var except that it returns an an enumeration for iterating through the elements of a syntax object that is also a list. The fact that it results an enumeration rather than a list is deliberate, since syntax lists are mutable objects, and the results from a match are not.

The match_f function matches the arguments of a syntax objects, which excludes the first part (generally the name of a syntax object). The match_parts_f, by contrast, matches the complete syntax object by matching the parts of the syntax object.

When it is necessary to build syntax directly from syntax objects, the match_local function provides a convenient way to do so. It takes in a match object and a list of syntax objects, terminated by NULL. It will assign a numeric match variable in the form of $NUM with the first one being $1.

4.7 Match Patterns

A pattern to be matched against is expected to either be a simple list of the form syntax (a, b, ...) or fully parsed, i.e., created with raw_syntax. The difference is that pattern variables matched with the former will need to be reparsed while patterns variables matched with latter do not.

The syntax () form is designed to be used when matching parameters passed in via a function-call macro. The pattern contains a list of the following (with some restrictions on order):

ID matches a normal parameter. The second item, “ID = VALUE”, is used for giving parameter default values if they are omitted. A _ can be used any place an identifier will be used when the value is irrelevant. Parameters can also be optional if they are after the special @ instruction, in which case they will simply be omitted from the match list. The @ID form will match any remaining parameters and store them in a syntax list. Flags (otherwise known as keyword arguments) can also be matched with any of the :FLAG forms. Flags, in the current implementation, are always optional; however, any matched flags will not appear in the syntax list matched with @ID.

For example, in the pattern (X, _, Y, Z = 9, :flag1, :flag2 _, :flag3 F2 = 8, @, A, B, @REST), the first three positional parameters are required, but we do not care about the value of the second. In addition the flags flag1 and flag2 are required, with the second one also requiring a value. In addition to the required parameters, the next three positional parameters will be stored in Z, A, and B respectively. If the fourth parameter is not given it will get the default value 9, while if the other two are not given they will simply not be present in the match list. Any additional parameters passed in will be stored in REST as a list. Finally, the flag flag3 may also be given, but if it is not, it will assume the default value 8.

A pattern can also be specified in raw_syntax form, which is designed to be used with syntax macros. In the raw_syntax form a pattern can represent anything that a match list can. In addition, it is possible to match the subparts of an expression using (pattern (WHAT ...)). For example, to match the list of declarations inside of a class body which is represented as (class foo ({...} decl1 decl2)) into the pattern variable body, the (_ _ (pattern ({...} @body))) pattern can be used.

It is also possible to use the raw_syntax form with function-call macros; however, when doing so it is important to know that the macro parameters are not parsed. For example if f is a function-call macro, the parameter of the call f(x+2) is passed in as (parm "x+2"). When using the syntax forms for matching, ZL’s normal parsing process (see 3, C.2) parses the string at the right time. But the raw_syntax form skips this step. Thus, it it necessary to manually instruct ZL to parse the parameter passed in by using (reparse ID). For example, to match the parameter in the f macro above use:

4.8 Creating Marks

Marks (see 4.1) are used to implement lexical scope, and the API is listed in Figure 4.6. The new_mark primitive is actually a macro that calls the callback function new_mark_f and uses the primitive environ_snapshot() to capture the environment.

Type EnvironSnapshot with related syntax form:

: environ_snapshot() — returns EnvironSnapshot *

Type Mark with related function:

: Mark * new_mark_f(EnvironSnapshot *)

and macros:

: macro new_mark(es = NULL) {new_mark_f(es ? es : environ_snapshot();}
: macro new_empty_mark() {new_mark_f(0);}

Figure 4.6: Mark API.

4.9 Partly Expanding Syntax

In complex syntax macros, it is often necessary to decompose the parts passed in. However, in most cases, those parts are not yet expanded; thus it is necessary to expand them first. To support this expansion ZL provides a way to partly expanded a syntax object in the same way it will internally; the API is shown in Figure 4.7.

Callback functions:

: Syntax * partly_expand(Syntax *, Position pos, Environ *)
: SyntaxEnum * partly_expand_list(SyntaxEnum *, Position pos, Environ *)

and enum Position with possible values:

: NoPos, OtherPos, TopLevel, FieldPos, StmtDeclPos, StmtPos, ExpPos

Figure 4.7: Expander API.

The pos parameter tells ZL what position the syntax object is in; the values of the Position enum can be bitwise or’ed together. This parameter will affect how the expansion and, if necessary, reparsing is done. Common values are TopLevel for declarations, StmtPos for statements, and ExpPos for expressions. The Environ parameter is the environment as passed into the macro.

If the parts of a syntax object represent a list of some kind, it is best to use partly_expand_list. The function partly_expand_list is like partly_expand, except that it expects a list of elements in the form of an SyntaxEnum, and it automatically flattens any Syntax Lists (ie @) found inside the list. The elements of the list are expanded as they are iterated through, rather than all at once when the function is called.

4.10 Compile-Time Reflection

Often it is necessary to do more than just decompose syntax. Sometimes, it is necessary to get compile-time information on the syntax objects or the environment itself—for example, to get numerical value of an expression as was done in with fix_size in Section 4.12 or to check if a symbol exists as is done in foreach in Section 2.5. Figure 4.8 shows some of the available API functions for compile-time reflection.

Callback functions:

: unsigned ct_value(Syntax *, const Environ * = environ)
: bool symbol_exists(UnmarkedSyntax * sym, Syntax * where,
Mark * = mark, const Environ * = environ)
: Environ * temp_environ(const Environ *)
: Syntax * pre_parse(Syntax *, Environ *)

Figure 4.8: Compile time reflection API.

The ct_value function (which was used in the fix_size example) takes a syntax object, expands the expression, parses the expansion, and evaluates the parsed expression as an integer to determine its value. An error is thrown if the expression passed in is not a compile time constant.

To see if a symbol exists in the current environment or an object that is a user type (as was done in the foreach example), the symbol_exists function can be used. The first argument is the symbol to check for. The second argument is the user type to check that the symbol exists in; if it is NULL then the current environment will be checked instead. The third argument provides the context in which to look up the current symbol, and finally the last argument is the environment to use.

Sometimes in order to get compile-time information it is necessary to add additional symbols to the environment. For this the temp_environ and pre_parse functions are used, as was done in the fix_size macro. The temp_environ function creates a new temporary environment while pre_parse parses a declaration just enough to get basic information on it, and then adds it the the environment. The creation of a temporary environment avoids affecting the outside environment with any temporary objects added with pre_parse.

4.11 Misc API Functions

Sometimes it is necessary to create syntax on the fly, such as creating syntax from a number that is computed at run time. The string_to_syntax function, shown in Figure 4.9, converts a raw string to a syntax object.

Callback functions:

: UnmarkedSyntax * string_to_syntax(const char *)
: const char * syntax_to_string(UnmarkedSyntax *)
: void dump_syntax(UnmarkedSyntax *)
: Syntax * error(Syntax *, const char *, ...)

Figure 4.9: Misc API functions.

The string passed in is the same as given for the syntax form, which can be specified at run time.

The syntax_to_string function does the reverse, which is primarily useful for checking an identifier for a literal value. It is also useful for debugging to see the results of a complex macro. However, for large syntax objects the dump_syntax function is more efficient. For complex syntax objects the output of both functions is designed to be human readable and as such the output is not suitable for reparsing with string_to_syntax.

The error function is used to return an error condition. It creates a syntax object that results in an error when it is parsed. The first argument is used to determine the location where the error will be reported; the location associated with this syntax object is used as the location of the error.

4.12 An Extended Example

To get a better idea of how procedural macros work, this section gives the code of a macro that fixes the size of a class. Fixing the size of a class is useful because changing the size often breaks binary compatibility, which forces code using that class to be recompiled. Additional examples of how ZL can be used to mitigate the problem of binary compatibility are given in our previous work [3].

1  Syntax * parse_myclass(Syntax * p, Environ * env) {
2    Mark * mark = new_mark();
3    Match * m = match_f
4        (0, raw_syntax(name @ (pattern ({...} @body))
5                         :(fix_size fix_size) @rest), p);
6    Syntax * body = match_var(m, syntax body);
7    Syntax * fix_size_s = match_var(m, syntax fix_size);
8
9    if (!body || !fix_size_s) return parse_class(p, env);
10
11    size_t fix_size = ct_value(fix_size_s, env);
12
13    m = match(m, syntax dummy_decl,
14              replace(syntax {char dummy;}, NULL, mark));
15    Syntax * tmp_class = replace(raw_syntax
16        (class name ({...} @body dummy_decl) @rest),
17                                 m, mark);
18    Environ * lenv = temp_environ(env);
19    pre_parse(tmp_class, lenv);
20    size_t size = ct_value
21        (replace(syntax(offsetof(name, dummy)), m, mark),
22         lenv);
23
24    if (size == fix_size)
25      return replace(raw_syntax
26                       (class name ({...} @body) @rest),
27                     m, mark);
28    else if (size < fix_size) {
29      char buf[32];
30      snprintf(buf, 32, "{char d[%u];}", fix_size - size);
31      m = match(m, syntax buf,
32              replace(string_to_syntax(buf), NULL, mark));
33      return replace(raw_syntax
34                      (class name ({...} @body buf) @rest),
35                     m, mark);
36    } else
37      return error(p,"Size of class larger than fix_size");
38  }
39  make_syntax_macro class parse_myclass;

Figure 4.10: Macro to fix the size of a class. All ... in this figure are literal.

The macro to fix the size of the class is shown in Figure 4.10. To support this macro the grammar has been enhanced to support fixing the size. The syntax for the new class form is:

which will allow a macro to fix the size of the class C to 20 bytes. The enhancement involved modifying the STRUCT_UNION_PARMS production to support the fix_size construct using the new_syntax form (see Section 3.3):

Most of the syntax is already described in Section 3.2. The only new thing is :<>, which constructs a property to be added to the parent syntax object, which in this case is class. The {...} (in which the ... are literal) is the name of the syntax object for the class body.

The macro in Figure 4.10 redefines the built-in class macro. It works by parsing the class declaration and taking its size. If the size is smaller than the required size, an array of characters is added to the end of the class to make it the required size.

The details are as follows. Lines 2–7 decompose the class syntax object to extract the relevant parts of the class declaration. A @ by itself in a pattern makes the parts afterward optional. The pattern form matches the sub-parts of a syntax object; the first part of the object (the {...} in this case) is a literal² to match against, and the other parts of the object are pattern variables. A @ followed by an identifier matches any remaining parameters and stores them in a syntax list; thus, body contains a list of the declarations for the class. Finally, :(fix_size fix_size) matches an optional keyword argument; the first fix_size is the keyword to match, and the second fix_size is a pattern variable to hold the matched argument.

If the class does not have a body (i.e., a forward declaration) or a declared fix_size, then the class is passed on to the original class macro in line 9. Line 11 compiles the fix_size syntax object to get an integer value.

Lines 13–22 involve finding the original size of the class. Due to alignment issues the sizeof operator cannot be used, since a class such as “class D {int x; char c;}” has a packed size of 5 on most 32 bit architectures, but sizeof(D) will return 8. Thus, to get the packed size a dummy member is added to the class. For example, the class D will become “class D {int x; char c; char dummy;}” and then the offset of the dummy member with respect to the class D is taken. This new class is created in lines 13–17. Here, the @ before the identifier in the replacement template splices in the values of the syntax list.

To take the offset of the dummy member of the temporary class, it is necessary to parse the class and get it into an environment. However, we do not want to affect the outside environment with the temporary class. Thus, a new temporary environment is created in line 18 using the temp_environ macro API function. Line 19 then parses the new class and adds it to the temporary environment. The pre_parse API function partly expands the passed-in syntax object and then parses just enough of the result to get basic information about symbols.

With the temporary class now parsed, lines 20–22 get the size of the class using the offsetof primitive.

Lines 24–37 then act based on the size of the class. If the size is the same as the desired size, there is nothing to do and the class is reconstructed without the fix_size property (lines 24–27). If the class size is smaller than the desired size, then the class is reconstructed with an array of characters at the end to get the desired size (lines 28–35). (The string_to_syntax API function simply converts a string to a syntax object.) Finally, an error is returned if the class size is larger than the desired size (lines 36–37).

The last line declares the function parse_myclass as a syntax macro for the class syntax form.

Chapter 4Procedural Macros

4.1 Low Level Procedural Macros

4.2 Macro Transformer Arguments

4.3 Macro API

4.4 The Syntax Object

4.5 The Syntax List

4.6 Matching and Replacing

4.7 Match Patterns

4.8 Creating Marks

4.9 Partly Expanding Syntax

4.10 Compile-Time Reflection

4.11 Misc API Functions

4.12 An Extended Example

Chapter 4
Procedural Macros