c++ – Development of a C language parser

Question:

The lab was asked to write a parser for the C language:

Write a parser that detects the largest number of errors for the following grammar (this grammar is a simplified version of the C grammar):

< program >: < type >   ’main’   ‘(‘   ‘)’   ‘{‘   < statement >   ‘}’
< type >: ‘int’
 | ‘bool’
 | ‘void’
<statement>: 
  | < declaration > ‘;’
  | ‘{‘ < statement > ‘}’
  | < for >   < statement >
  | < if >      < statement >
  | < return >
< declaration >: < type >   < identifier >   < assign >
< identifier >: < character >< id_end >
< character >: ‘a’ | ‘b’ | ‘c’ | ‘d’ | ‘e’ | ‘f’ | ‘g’ | ‘h’ | ‘i' | ‘j’ | ‘k’ | ‘l’ | ‘m’ | ‘n’ | ‘o’ | ‘p’ | ‘q’ | ‘r’ | ‘s’ | ‘t’ | ‘u’ | ‘v’ | ‘w’ | ‘x’ | ‘y’ | ‘z’ | ‘A’ | ‘B’ | ‘C’ | ‘D’ | ‘E’ | ‘F’ | ‘G’ | ‘H’ | ‘I’ | ‘J’ | ‘K’ | ‘L’ | ‘M’ | ‘N’ | ‘O’ | ‘P’ | ‘Q’ | ‘R’ | ‘S’ | ‘T’ | ‘U’ | ‘V’ | ‘W’ | ‘X’ | ‘Y’ | ‘Z’ | ‘_’
  <id_end>:
| <character><id_end>
<assign>:
  | ‘=’ <assign_end>
<assign_end>: <identifier>
  | <number>
<number>: <digit><number_end>
<digit>: ‘0’ | ‘1’ | ‘2’ | ‘3’ | ‘4’ |  ‘5’ | ‘6’ | ‘7’ | ‘8’ | ‘9’
<number_end>:
  | <digit><number_end>
<for>: ‘for’ ‘(‘ <declaration> ‘;’ <bool_expression> ‘;’ ‘)’
<bool_expression>: <identifier>   <relop>    <identifier>
  | <number>      <relop>    <identifier>
<relop>:  ‘<’ | ‘>’ | ‘==’ | ‘!=’
<if>: ‘if’ ‘(‘ <bool_expression> ‘)’
<return>: ‘return’ <number> ‘;’

I kind of understood the first 4 lines (this is a description of the main). What's next? What code options can this pattern implement? Maybe someone already has some developments in the C ++ language? I didn’t find anything on the net (because I don’t know what exactly I need, is it stupid to somehow analyze the code?). In any case, a code example would be helpful.

Answer:

@Alerr , the yacc or bison sources won't help you at all right now. Read Wirth.

You can also look for the source codes of "calculators". They are quite common and usually they use the recursive descent algorithm to translate an expression, for example, into reverse Polish notation (it will also work here).

I think recursive parsing will just be in the next lab.

In principle, there is nothing complicated (here you have already been told about this) here. First, write a lexical analyzer. This is a function that reads input and returns a token – a structure that describes that we have read an identifier, a keyword, a number, an operation sign, a bracket, etc.

At the same time, part of your grammar will go into this lexical analyzer and the grammar will be simplified.

Next, just write the feature set according to the grammar. Well, perhaps we need to figure out how to report errors, make a decision, end parsing at the first error, or try to "recover" and parse further (this is more difficult), etc.

I'll start some strongly pseudocode, I think it will become clearer

int programm () {
    int rc;
    lexem_t lx = get_lexem();

    if (lx.type == KEYWORD && 
          (lx.subtype == INT || lx.subtype == VOID || lx.subtype == BOOL)) {
       lx = get_lexem();
       if (lx.type == IDENT && strcmp(lx.value, "main") == 0) {
          // очевидно, в нормальном Си тут нужно вызывать разбор списка параметров
          // но в нашей грамматике должна идти пара '(' ')'
          if ((rc = parentheses()) == OK) {
             lx = get_lexem();
             if (lx.type == SPEC && lx.char_value == '{')
                if ((rc = statement()) == OK) { // уффф! почти добрались до конца
                   lx = get_lexem();
                   if (lx.type == SPEC && lx.char_value == '}')
                       return OK; // Ура!!!
                   rc = error_message(lx, ...); // ждали '}', а прочли ???
                }
                return rc;
             }
             rc = error_message(lx, ...); // ждали '{', а прочли ???
          }
          return rc;
      // далее в том же духе обработка ошибок, просто в этом "редакторе" мало строк помещается, набирать с indent трудно
    ...
    } // конец if (lx.type == KEYWORD ...
    return error_message(lx, ....); // ждали int | void | bool, а прочли ???
}

I hope it's clear in general terms.

One more thing, you will almost certainly need a function similar to ungetc(c) , but for a token. Don't forget to take this into account when thinking about data structures.

Scroll to Top