Question:
The source text of Java programs consists of a collection of spaces, identifiers, literals, comments, operators, delimiters, and keywords.
What happens in the compiler with each of the selected concepts? Is something being omitted or modified in some way?
Answer:
A common practice when writing a compiler is to split it into parts. Traditionally, the first part is lexical analysis, dividing the source text into lexemes. This means that the code is read as a sequence of characters, and is represented as a sequence of tokens .
A token consists of a token type and a value (packaged in a single class).
In this case, usually spaces (which are not part of character/string literals) are discarded, identifiers are turned into an "Identifier" type token with a value equal to the string with the name of the identifier. Literals are also turned into tokens. Comments usually do not pass the lexical analysis stage and are simply discarded. Separators, like parentheses and punctuation marks, form each token type of its own. Well, for keywords, they are also usually distinguished by a separate type of token.
Example:
Original text
public class Example { // пример
public static void main(String[] args) {
System.out.println(/* этот текст будет напечатан*/"hello world");
}
}
produces the following sequence of lexical tokens:
[public-keyword] [class-keyword] [ident "Example"] [separator-left-brace] [public-keyword]
[static-keyword] [void-keyword] [ident "main"] [separator-left-paren] [ident "String"]
[separator-left-brack] [separator-right-brack] [ident "args"] [separator-right-paren]
[separator-left-brace] [ident "System"] [separator-dot] [ident "out"] [separator-dot]
[ident "println"] [separator-left-paren] [string-literal "hello world"]
[separator-right-paren] [separator-semicolon] [separator-right-brace] [separator-right-brace]
Further compilation phases will break it down into class definitions, functions, and operations, match names, bind names to objects, check for meaningfulness, optimize, and compile to bytecode.
Lexical analysis is the simplest phase of compilation.
Yes, it is theoretically possible (and sometimes necessary ) to build compilers in which lexical analysis is essentially combined with subsequent compilation phases. In principle, nothing forces the authors of the compiler to single out a separate phase of lexical analysis, but still this is a common practice.