3 Parsing and Expanding

The macros shown so far have been fairly simply. Writing more sophisticated macros, such as those required to implement classes, requires some knowledge of parsing and macro expansion in ZL. This chapter gives the necessary background material, while the next chapter details how to write such macros.

3.1 Parsing Overview

To deal with C’s idiosyncratic syntax while also allowing the syntax to be extensible, ZL does not parse a program in a single pass. Instead, it uses an iterative-deepening approach to parsing. The program is first separated into a list of partly parsed declarations by a Packrat [11, 12] parser that effectively groups tokens at the level of declarations, statements, grouping curly braces, and parentheses. Each declaration is then parsed. As it is being parsed and macros are expanded, subparts, such as code between grouping characters, are further separated.

ZL’s iterative-deepening strategy is needed because ZL does not initially know how to parse any part of the syntax involved with a macro. When ZL encounters something that looks like a function call, such as f(x + 2, y), it does not know if it is a true function call or a macro use. If it is a macro use, the arguments could be expressions, statements, or arbitrary syntax fragments, depending on the context in which they appear in the expansion. Similarly, ZL cannot directly parse the body of a macro declaration, as it does not know the context in which the macro will ultimately be used.

More precisely, the ZL parsing process involves three intertwined phases as shown in Figure 3.1.

[Picture]

Figure 3.1: Overview of ZL’s parsing process.

In the first phase raw text, such as (x+2), is parsed. Raw text is converted into an intermediate form known as a syntax object, which can still have raw-text components. (Throughout this paper we show syntax objects as S-expressions, such as ("()" "x+2").) In the second phase, the syntax object is expanded as necessary and transformed into other syntax objects by expanding macros until a fixed point is reached. In the third phase, the fully expanded syntax object is compiled into an AST.

[Picture]

Figure 3.2: How ZL compiles a simple program. The body of f is reparsed and expanded as it is being compiled.

The top box contains a simple program as raw text, which is first parsed. The result is a syntax list (internally represented as a @) of stmt’s where each stmt is essentially a list of tokens, as shown in the second box. Each statement is then expanded and compiled in turn, and is added to the top-level environment (which can be thought of as an AST node). The third box in the figure shows how this is done, which requires recursive parsing and expansion. The first stmt is compiled into the funf, while the body of the function is left unparsed. Next, fun is compiled into an AST (shown as a rounded rectangle). During the compilation, the body is expanded. Since it is raw text, this process involves parsing it further, which results in a block. Parsing the block involves expanding and compiling the subparts. Eventually, all of the subparts are expanded and compiled, and the fully parsed AST is added to the top-level environment. This process is repeated for the function main, after which the program is fully compiled.

3.2 The Parser

The ZL grammar is specified through a PEG [12], but with a few extensions to the usual PEG notation, and a Packrat [11] parser is used to convert strings of characters to syntax objects. A simplified version of ZL’s initial grammar is shown in Figure 3.3.

  TOP = <top> SPACING {STMT}+;

  STMT = <<mid PARM>> {MID} ";"
       / <if> "if" "(" {EXP} ")" {STMT} ("else" {STMT})?
       / <while> "while" "(" {EXP} ")" {STMT}
       / <break> "break" ";"
       / <return> "return" {EXP} ";"
       / {BLOCK}
       # other statements ...
       / {CUSTOM_STMT}
       / <stmt> ({TOKEN_}+ {PAREN} {BRACE} / {TOKEN}+ ";");
  CUSTOM_STMT = !__;

  EXP = <exp> {TOKEN}+;

  BLOCK = <block> "{" {STMT}* "}";

  TOKEN_ =  <<mid PARM>> {MID} / {STRUCT_UNION} /
            {BRACK} / {CONST} / {ID} / {SYM} /
  TOKEN = TOKEN_ / PAREN;

  PAREN =  <<reparse ()>> "(" {RAW_TOKEN*} ")";
  BRACE =  <<reparse {}>> "{" {RAW_TOKEN*} "}";
  BRACEL = <<reparse @{}>> "{" {RAW_TOKENS} "}";
  BRACK =  <<reparse []>> "[" {RAW_TOKEN*} "]";

  CONST = <f> ...  / <l> ... / # float, numeric literal
          <s> ... / <c> ...    # string, character
  ID = <<mid>> {MID} / {[@$\a_][\a_\d]*} SPACING;
  SYM = {’...’ / ’==’ / ’+’ / ...} SPACING;

  STRUCT_UNION = <%> {"struct"/"union"/"class"} {ID/}
                     {STRUCT_UNION_PARMS} {<{...}> {BRACEL} }?;
  STRUCT_UNION_PARMS = (:<public> ":" "public" {ID})?;

  RAW_TOKEN = STRING / CHAR / SYM / BRACE / PAREN /
              BRACK / COMMENT / [^\)\]\}];

  STRING = ’"’ (’\\’_/[^"])+ ’"’ SPACING;
  CHAR   = ’\’’ (’\\’_/[^’])+ ’\’’ SPACING;

  SPACING = [\s]* COMMENT?;

  COMMENT = ...;

Figure 3.3: Simplified PEG grammar.

For readers not familiar with PEGs, the two most important things to note are that PEGs work with characters rather than tokens, and the / operator defines a prioritized choice. A prioritized choice is similar to the | operator used in Backus-Naur Form, except that it unconditionally uses the first successful match. For example, given the rule “A = ’a’ / ’ab’” the string ab will never match because the first choice is always taken. The PEG specification more closely resembles regular expression syntax (as used in grep) than it does Backus-Naur Form. The (), [], ?, *, +, and _ (otherwise known as .) operators are all used in the same manner as they are in regular expressions.

Anything between single quotes is a literal string. The double quote is like the single quote, except that special rules make them behave similarly to tokens. For example, "for" will match the for in for(, but it will not match the prefix of foreach.

The {} and <> are extensions to the standard PEG syntax and are used for constructing syntax objects in the obvious ways. The special <<reparse>> operator creates a syntax object with a raw-text component that will be parsed later. The special <<mid>> operator and MID production are explained later in Section C.2.

The CUSTOM_STMT production serves as a hook to allow easily adding new statements to the grammar. The __ operator always matches and the ! inverts the results; thus, !__ never matched. This production therefor does nothing by default.

3.3 Extending the Parser

ZL currently provides very basic support for expanding the parser by rewriting productions. Like macros, parser extensions in ZL are given in source code using the new_syntax form (as oppose to using a separate file). For example, the following add a new foreach statement form (from Section 2.4 in the tutorial)

by adding a new prioritized choice to the special CUSTOM_STMT production in which the special _cur production refers to the previous version of the CUSTOM_STMT. The other common way to add new syntax is to add a new sequence item; for example,

The new_syntax form works by creating a new parser, which is then used for any text, within the current scope, that has not yet been parsed. This has not precisely defined yet, but for the most part it will work as expected. A consequence of this rule is that redefining fundamental aspects of the grammar (such as strings, comments, deliminators) is unlikely to work.

3.4 Built-in Macros

The grammar serves to separate individual statements and declarations, and to recognize forms that are convenient to recognize using a Packrat parser. As such, it creates syntax objects that need additional processing before they can be compiled into an AST. The expander has several built-in macros for this purpose: stmt, exp, (), [], and {}.

The stmt macro recognizes declarations and expressions. It first tries the declarations expander, which is a handwritten parser designed to deal with C’s idiosyncratic syntax for declarations. If the declarations expander fails, then the expression expander is tried, which is an operator-precedence parser [9]. The exp macro is like the stmt macro, but only the expression expander is tried.

The macros (), [], and {} are used for reparsing strings. The () and [] macros reparse the string as an expression using the EXP production in the grammar, where as the {} generally reparses the string as a block using the BLOCK production.

Chapter 3Parsing and Expanding

3.1 Parsing Overview

3.2 The Parser

3.3 Extending the Parser

3.4 Built-in Macros

Chapter 3
Parsing and Expanding