What is Yacc Tool

The parser generator Yacc

This page is the HTML version of the elaboration of my Yacc presentation in the software development tools seminar.


I. What is Yacc?

I.1 What is Yacc?

YACC is a parser generator, i.e. a program with which you can create parsers. It was developed by S. C. Johnson in the early 1970s and is now available on all UNIX systems. The abbreviation YACC stands for "yet another compiler-compiler", which indicates that YACC is also used for the development of compilers, but it can be used wherever programs have to read text input. This is the case in many applications, e.g. when reading in configuration files or when implementing communication protocols (e.g. TCP / IP or HTTP). YACC shows its strength above all where these inputs turn out to be very versatile and different.
As input, YACC expects a file that contains a so-called grammar (similar to the Backus-Naur form). This grammar represents the description of the set of rules for the parser that you want to generate. The parser can recognize structures that follow these rules and execute defined actions. YACC generates a C source program from this grammar, which contains the complete parser.

I.2 How can you use Yacc?

A simple example of the use of YACC: Assume you have created a grammar that is located in the file grammar.y (the file extension .y is the convention for YACC source files). You call YACC with yacc grammar.y and you get the output file y.tab.c, the C source code of the parser for the grammar grammar.y. This can then be translated with any C compiler, e.g. with cc y.tab.c. The executable file a.out is then obtained, which contains the finished, executable parser.


II. The syntax of Yacc

II.1 Basic functionality of the parser

The parser generated by YACC works with different types of symbols. These are divided into terminal and non-terminal symbols, and terminal symbols can in turn either be literals or tokens.

Literals are single characters, i.e. letters, numbers or other characters. In contrast to literals, tokens can also consist of several characters or represent a whole class of characters such as digits; they are, so to speak, a more general form of terminal symbols than literals. Terminal symbols are the smallest units that the parser can handle.
Non-terminal symbols consist of one or usually several terminal symbols. For example, a terminal symbol could be a number or a mathematical operator, and a non-terminal symbol could be an entire expression made up of numbers and operators.
The parser consists of two important and a few less important parts that are implemented as functions.
The first function is yylex (), which does the job of a scanner. The scanner reads the input from the parser character by character and converts it into a stream of tokens. It does not deliver all tokens at once per call, but only the next token. All tokens must of course be defined beforehand. If a character cannot be identified as a token, it is transferred directly as a literal (it is generally advisable to define all or as many input symbols as possible as tokens; this delegates the task of literal recognition from the parser to the scanner. This has the advantage of that literals declared as tokens only have to be changed once in the scanner, instead of having to check and change every occurrence of the literal in the grammar). At the end of the input there is the predefined end token $ end, which has the value 0 or a negative value. The scanner function reads in characters, recognizes the corresponding token or literal from them and returns its type as an integer value. The actual content of the token, for example the value of the number for a number token or a string for a text token, is transferred in the global variable yylval.
You can write the yylex () function yourself as C source text in the YACC program, or generate it with the help of LEX (see also Chapter V and the previous lecture on LEX).
The second important function is yyparse (), the actual parser. This is the part of the program that is generated by YACC. This function repeatedly calls the scanner in order to read in the next token, the so-called lookahead token. The grammar rules to be specified in the parser grammar are applied to the token stream that is gradually created in this way. Such a rule could for example read in plain text: "An expression consists of a number or an expression, an operator and another expression." The tokens are a number and an operator, the expression is a non-terminal symbol. If they occur in the sequence number operator number, they can be reduced to expression operator expression and again to expression according to the rule, i.e. the token number operator number is replaced by the token expression in two steps. At the same time, C actions can be defined that are executed when the rule is processed. All of this will become more apparent later with an example.
You also need a main program that simply calls the parser yyparse (). This is done in C in the main () function. You can write it yourself or include it from the YACC library by compiling the C source text generated by YACC with the C compiler option -ly, e.g. cc y.tab.x -ly. There are also some other functions such as the error handling function yyerror (), which will be discussed in Chapter II.6.
The scanner divides the text to be read into terminal symbols, i.e. into tokens and literals. The parser uses the grammar rules to combine these into non-terminal symbols, i.e. reduce them until the entire input has been processed and only one symbol (the start symbol) remains or an error has occurred. This procedure is called the bottom-up principle, since the grammar rules are worked through from the bottom up. The reverse method is accordingly called a top-down strategy. There, non-terminal symbols are replaced until only terminal symbols remain. However, compared to the bottom-up method, this has the disadvantage that several tokens have to be read in at once and what is known as "backtracking" may be necessary. (More on this in the lecture on PCCTS). The start symbol must be non-terminal, and all other symbols must be accessible from it, i.e. every other symbol must be able to be reduced to the start symbol in some form alone or together with other symbols. In addition, all non-terminal symbols must be defined by grammar rules, so they must appear at least once on the left side of a rule.

II.2 Structure of a Yacc source program

A YACC source program consists of a declaration part, a part with the actual grammar rules and a part with additional functions. In addition to the actual YACC declarations, the declaration section also optionally contains some C declarations.The three sections are separated from one another by double percent signs (%%):

The declarations and the function part are optional, so a YACC source program consists at least of the grammar rules:

II.3 grammar rules

The most important part of a YACC source program are the grammar rules. Such a grammar rule consists of a non-terminal symbol on the left, followed by a colon (:) in the middle and one or more rule alternatives on the right, separated by a pipe character (|). Each alternative rule can also be written down individually (as a single rule). These rule alternatives consist of a string of further non-terminal and terminal symbols, with literals being placed in quotation marks and tokens always being capitalized by convention. The parser then “knows” that it can reduce such a sequence of symbols in the token stream to the symbol in front of the colon. The rule is ended with a semicolon (;). The empty rule alternative is also allowed, the parser can then reduce "nothing" to a symbol. The grammar rule mentioned above would formally read:

For each rule alternative, you can optionally specify C actions that are carried out after this rule alternative has been processed. These completely normal C statements are simply written in curly brackets after the rule alternative. There are also some variables available that start with the dollar sign. The variable $$ denotes the value of the symbol that precedes the colon, the variable $ 1 denotes the value of the first symbol of the rule alternative, the variable $ 2 the value of the second symbol, etc. Here is a somewhat more complicated example, the grammar rules for a simple " Calculator:

This means that an input consists of "nothing" or an input and a line (i.e. the symbols input and line, if they occur one after the other, can be reduced to the symbol input during parsing, as can "nothing at all"). The line should contain the expression to be calculated. Since the calculator should not always be closed after one line, the higher-level input symbol is introduced. A line consists either of just a line end character ('\ n') or an expression and a line end character. An expression consists either of just a number or an expression, an operator (as a literal) and another expression, or an expression in parentheses. The number is a token, i.e. a terminal symbol that cannot be further described by rules.
As you can see, symbols can also be defined recursively in the rules; this is even very often the case. Of course, as with all recursions, it is important that the recursion is also resolved. This would not be the case in the example if the rule alternative NUMBER were missing in the rules for the expression symbol.
Additional C-Actions are also used here in the rules. For example, if the rule has been successfully processed for line, the variable $ 1 (the value of the first symbol after the colon, i.e. the value of expression here) is displayed on the screen as a floating point number with ten places after the comma. This means that the result of the calculation is output on an input line of the pocket calculator. The value of Expression is calculated in the respective rules below: If the expression only consists of a number, Expression gets the value of Number with the instruction $$ = $ 1. (The statement part could also be omitted here, since the assignment $$ = $ 1 is carried out implicitly for every rule.) If the rule alternative Expression: Expression '+' Expression is used, the new expression gets the value of the first expression plus the value of the second Expression (ie the value of the first plus the value of the third symbol, $$ = $ 1 + $ 3), correspondingly for the other rule alternatives.
The parser receives this (numerical) value of a symbol from the scanner via the global variable yylval.

II.4 Declarations

All symbols that are not defined by grammar rules and are not literals (i.e. all tokens) must be declared. This is done with the% token keyword in the YACC declarations. For the calculator example one would have to declare at least the token ZAHL:

Several tokens can also be declared at once:

Further definition options for tokens can be found in Chapter IV.1 (left / right / non-associativity of operators).
In addition, you might want the calculator to not only calculate with whole numbers, because the standard data type of the tokens in YACC is integer. However, this can be specified with the YYSTYPE macro in the C declarations, for example as follows:

The YYSTYPE macro assigns a different data type to the tokens, in this case the type double. This means that the value or content of the token, which is addressed via the variable yylval, is a floating point number. C include statements can also be specified here.

II.5 functions

In addition, you have to specify the mentioned functions yylex () and possibly main (). They are not required by YACC, but by the C compiler if you want to generate the executable parser. As already mentioned, the main () function can also be integrated via the YACC library (C compiler option -ly). The yylex () function can also be developed with LEX (see Chapter V). If you do without it, you write it down in the third part of the YACC source program according to declarations and grammar rules in C, for example:

Tokens are identified internally as numbers above 255, literals by their ASCII value (number less than or equal to 255). However, this is usually of no concern for the user of YACC, as these numbers are set automatically by YACC.
The yylex () function in the above example ignores spaces and tabs, returns the token NUMBER for numbers, writes its value to the global variable yylval and returns all other characters directly as an ASCII value, i.e. as a literal.

II.6 Error handling

If an error occurs while parsing a text, the parser generated by YACC is terminated by default with the error message "syntax error". It is output using the yyerror () function. This is also included with the C compiler option -ly, or you can specify it yourself. It could then simply look like this:

However, this behavior is not acceptable for many YACC applications. For example, our pocket calculator does not want it to simply break off after an incorrect entry, but rather it should resume its work after recognizing the error and allow a new entry. This is exactly what the following extension of the grammar rule for line does:

A new rule alternative is simply inserted, which takes effect if the predefined token error and a newline character occur. Error means that an error occurred while parsing the token stream. For example, if the parser encounters the token stream expression '+' '+' '\ n', expression '+' '+' does not meet any grammar rules and is therefore reduced to the token error. Then the YACC function yyerrok () is called. This calls the yyerror () function to output the error message and then returns to the last token or literal read in without terminating the parser. The last literal read in is then '\ n', which fulfills an alternative rule for line.
It doesn't matter where yyerrok () is used, however. If it is placed incorrectly, the desired effect will not occur in the best case, in the worst case the parser will start at a point that again produces an error and thus produces an endless loop. If, for example, the instruction yyerrok () is placed directly after error (i.e. without further literals or tokens), then after executing yyerrok () it is set up again directly in front of error, which in turn calls yyerrok (), etc. You can also use yyerrok in write another rule alternative if you are sure that you can start again there without errors.
Unfortunately, it is also not that easy to change the text of the error message or to output error messages that depend on the particular situation. These changes would have to be made "manually" in the yyparse () function generated by YACC, in some cases even in main (). This is of course very cumbersome, and it would go beyond the scope of this here to go into it. In my opinion, this deficiency is a real weakness of YACC.


III. How does the parser created by Yacc work?

III.1 Procedure

The parser works with a stack, operations on this stack and states that are stored on this stack. At the same time, a value stack is created that records the (numerical) values ​​of the symbols. The most important operations on the stack are the shift and reduce operations, as well as the goto, accept and error operations.
With a shift operation, the look-ahead token is read and the resulting new status is placed on the stack. The look-ahead token is then deleted.
A reduce operation is carried out when the parser has found a grammar rule that matches the current state; the look-ahead token may also have to be queried. Then the right side of the rule is replaced by the left side, i.e. the state on the stack is changed.
A goto operation causes a new state to be put on the stack, similar to a shift operation. However, the lookahead token is not deleted. In the event that an error occurs, i.e. if the parser determines that it cannot process the input token according to the specification, there is the error operation. This leads to an error message as described in Chapter II.6. If no error was found and the look-ahead token contains the end marker $ end (token '\ 0'), the accept operation is carried out.
In the following table, the states on the stack are represented by the symbols processed.As a simplification, the goto operation is not taken into account. The "Step" column specifies the next operation, the "Rule" column the grammar rule according to which the next step is reduced. The "Value stack" column shows the content of the value stack. The crossbar (#) means that no value is passed in yylval for the corresponding symbol (e.g. operators and other literals), i.e. yylval is indefinite. The greater than sign (>) indicates (only for reduce operations) the position from which the value stack can be accessed in the rules with the variables $ 1, $ 2 etc. The parse process of the input "3 * (4 + 2) \ n", i.e. "NUMBER '*' '(' NUMBER '+' NUMBER ')' '\ n'" would proceed as follows:

StackInput token stringrulestepValue stack
 NUMBER '*' '(' NUMBER '+' NUMBER ')' '\ n' $ end shift0
NUMBER'*' '(' NUMBER '+' NUMBER ')' '\ n' $ end5reduce0,> 3
Expression '*' '(' NUMBER '+' NUMBER ')' '\ n' $ end shift0,3
Expression '*''(' NUMBER '+' NUMBER ')' '\ n' $ end shift0,3,#
Expression '*' '('NUMBER '+' NUMBER ')' '\ n' $ end shift0,3,#,#
Expression '*' '(' NUMBER'+' NUMBER ')' '\ n' $ end5reduce0,> 4
Expression '*' '(' expression'+' NUMBER ')' '\ n' $ end shift0,3,#,#,4
Expression '*' '(' expression '+'NUMBER ')' '\ n' $ end shift0,3,#,#,4,#
Expression '*' '(' Expression '+' NUMBER')' '\ n' $ end5reduce0,3,#,#,4,#,> 2
Expression '*' '(' Expression '+' Expression')' '\ n' $ end6reduce 0,3,#,#,> 4,#,2
Expression '*' '(' expression')' '\ n' $ end shift0,3,#,#,6
Expression '*' '(' Expression ')''\ n' $ end10reduce0,3,#,> #,6,#
Expression '*' expression'\ n' $ end8reduce0,> 3,#,6
Expression'\ n' $ end shift0,18
Expression '\ n'$ end4reduce0,> 18,#
row$ end2reduce0,> 18
Enter line$ end1reduce0
Enter $ end  accept0

First the first token NUMBER is shifted onto the stack. This can then be reduced to an expression according to grammar rule 5. Tokens are then shifted again until another NUMBER can be reduced to an expression. This then continues until line 10, where expression '+' expression is reduced to an expression according to the grammar rule. Line 16 is also particularly interesting, where "Nothing" is reduced to an input according to grammar rule 1. The syntax recognition is then completed in line 18. The look-ahead token already contains $ end, but it is not queried again until line 18. This depends on the implementation of the parser in states, which is explained in the next section.

III.2 The file y.output

The y.output file can be created with the YACC option -v. It contains all possible states that the parser can assume. For each state, the so-called configuration (the state description), the available operations and the operations that are carried out in the event of a reduction are described. This is useful to understand how the generated parser works, for example for troubleshooting or for general understanding, because the actual sequence of parsing an expression can be reconstructed from this file.
To the example grammar

YACC creates the following file y.output:

The "state" 0 is always the start state and it always contains the start symbol. The line "$ accept: _expression $ end" describes the configuration: The underscore (_) indicates the current processing position, i.e. before an expression is recognized. Possible operation here is a shift operation to state 2 in the event that the look-ahead token contains an IDENTIFIER, otherwise an error operation. State 2 is after recognition of an IDENTIFIER. The only possible operation here is a reduce according to rule 2. Then the process jumps back to state 0, where the next step, since something has been reduced to an expression, is state 1 on the stack with a goto operation. It continues like this until the entire input has been processed, as shown in the following table for the input IDENTIFIER '-' IDENTIFIER:

stepStack (state)Look ahead tokensurgery
10IDENTIFIERshift 2
20,2 reduce 2
30 goto 1
40,1-shift 3
50,1,3IDENTIFIERshift 4
60,1,3,4 reduce 1
 0,1$ endaccept

IV. Ambiguities and Conflicts

IV.1. Left / right / non-associativity of operators

It is often desirable to define the associativity of tokens or operators, i.e. the "binding". In our example, the calculator, there is, for example, the operator "/" (divided). If it occurs several times in a row in an entry, for example in "1/2/3", it is not clear whether "(1/2) / 3" or "1 / (2/3)" is meant here. Usually, if you don't explicitly use parentheses, you mean the first alternative. This is left-associative because first the left and then the right divided symbol is evaluated. In a YACC grammar, this type of evaluation is achieved by declaring% left '/' (generally% left TOKEN or% left LITERAL). Correspondingly, legal associativity is declared with% right TOKEN or% right LITERAL. It is also possible to declare tokens as non-associative (% unassoc TOKEN or% unassoc LITERAL). This is useful for comparison operators in programming languages, for example. The entry "x

IV.2. Priority of operators

Another problem arises with the calculator example: that of the priority of the operators. The input "1 + 2 * 3" could be evaluated as "1+ (2 * 3)" as well as "(1 + 2) * 3". In order to achieve the usual rule "point calculation before line calculation", one must ensure that the YACC grammar is correctly declared. The order of precedence of operators (i.e. the order of the evaluation of tokens for the parser) is defined by the order of the declaration in the YACC grammar. The further back in the grammar a token is declared, the greater its priority. For example, the declaration

the well-known point calculation can be achieved before line calculation, and the operators are all defined as left-associative at the same time.
Sometimes the problem arises that a literal is used as a character for different operators. For example, the "-" (minus) sign is used for both subtraction and negation. If you want to define different priorities for the operations, a context-dependent priority is necessary. This can be solved in YACC for the example (and also in general) as follows:

The auxiliary token NEG is introduced here, which has a higher priority than all other operators. In the grammar rule for the negation, this is assigned the same priority as NEG through the keyword% prec (for "precedence").

IV.3. reduce / shift - conflicts

Another type of conflict occurs, for example, in the following (incomplete) example grammar:

When typing

there are two options for the parser to proceed after processing if y then win:

  • The parser could apply the first rule and perform a reduce operation. In the example, else would belong to the outer if x then ...
  • However, the parser could also apply the second rule and perform a shift operation. In the example, else would belong to the inner if y then win.
Such a conflict is therefore generally called a reduce / shift conflict. By default, YACC performs a shift operation, but issues a warning when the YACC source code is compiled. If you are sure that the shift operation is the one you want, you can suppress this warning with the declaration% expect n: If YACC encounters exactly n reduce / shift conflicts in the grammar, no warnings are issued. Often it is also possible to solve reduce / shift conflicts with the help of the associativity declarations (the question of associativity is also a reduce / shift conflict!). In general, it is best to formulate the grammar clearly, but in many cases (as in this example) this is an absolutely non-trivial task and sometimes not possible.

IV.4. reduce / reduce - conflicts

There is also a conflict in the following (incomplete) example grammar:

There are two options for the parser to parse a word:

  • either the parser can reduce a word to a perhaps word and then to a sentence,
  • or the parser can reduce "nothing" to a sentence and reduce this back to a sentence together with the word.
Such a conflict is generally called a reduce / reduce conflict, because there are two possibilities for a reduce operation. By default, YACC applies the rule that is above in the grammar. However, it is better to formulate the grammar clearly from the start, in the case of the example, for example:

In this example it seems trivial as normally nobody would use the first solution anyway. However, in more complicated cases, even with this conflict it can get difficult.


V. Cooperation with the Lex scanner

Instead of writing the yylex () function in C "by hand" yourself, it is particularly useful for scanners that are supposed to do a little more to create them with the help of LEX.
LEX creates the file lex.yy.c from the LEX source file (see the lecture on LEX), which contains the yylex () function, the scanner, already discussed. This should be included in the functional part of the YACC grammar with the C statement #include "lex.yy.c". Alternatively, you can omit the include statement, but then you must also specify the C source text generated by LEX when compiling. The procedure is: You create the scanner with LEX, the actual parser with YACC, and compile everything with a C compiler to form the finished parser (see also figure), for example like this:

The C compiler option -ll integrates the LEX standard library into the program.
For example, a LEX program that describes the scanner for our pocket calculator could look like this:

(Note: This scanner cannot read in floating point numbers. A LEX program that does this can be found, for example, in [4] on page 72)


Appendix A: The Calculator Grammar Complete


Appendix B: References

  1. Gottfried Staubach. Unix tools for text pattern processing. Springer, 1989.
  2. Charles Donelly and Richard M. Stallman. Bison - the YACC-compatible parser generator. Program
  3. Documentation, version 1.20, Free Software Foundation, December 1992.
  4. Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman. Compiler: Principles, Techniques, and Tools. Addison Wesley, 1986.
  5. AT. Schreiner and M. G. Friedmann. Building Compilers with Unix - An Introduction. Carl Hanser, Munich, 1985.

© 1996, 2004 Eike Meinders