terminology – What is lexical analysis?

Question:

I was looking at the source code of a popular php library, called Twig (it's a template engine, with its own syntax), and I came across classes, interfaces and methods, such as Lexical , Lex and LexInterface .

I did some research and realized that it was the term lexical analysis .

Although I understood some things, I was confused on others.

For example, I'm used to seeing the term Parser or Parsing when it comes to transforming a given piece of data into another piece of data.

  • What would Lexical Analysis be?

  • Are Lexical Analysis and Parsing/Parser the same things, or are they actually different things?

Sorry if I'm confused by the question, but I believe the community will help me with a good and insightful answer.

Answer:

Definition

Lexical analysis is the process performed on a text, say a computer program or else a markup language such as HTML, which breaks it down into lexemes and converts them into a sequence of tokens , which are used to feed a parser . A program that performs lexical analysis is often called a lexer , scanner or tokenizer .

tokens / lexemes

Lexemes are syntactic units relevant to the lexer context, for example for a lexer aimed at a certain programming language some lexemes could be: 1, "hello", for, ==, variableName, function.

A token is a structure that categorizes a lexeme, it contains an abstract name that represents the lexeme's group and a possible value of this if the group is not unique. Example of some tokens (the token is represented in the format < token, option-value> and "#" represents beginning of comments):

< digit, 1 > # qualquer valor numérico
< literal, "olá" > # valores contidos entre aspas duplas
< for > # os caracteres f,o,r
< comparison, == > # os símbolos <, >, <=, >=, == e !=
< id, variableName > # sequência de letras que representa uma variável
< function > # as letras f,u,n,c,t,i,o,n

difference between parsing

Lexers and parsers are closely linked yet they are distinct concepts. The lexer is specialized in extracting the relevant portions of the text and transforming them into structures with "meaning", in this case the tokens. The parser, on the other hand, has the function of analyzing the syntactic structure of a text, for example, to say whether in a certain programming language the expression "olá" 1 == "outroliteral" is syntactically valid or invalid. The connection between them is that the structures produced by the lexer are what the parser uses as a source to perform the parsing, that is, the parser works with tokens. This is not mandatory, nothing prevents you from building a parser that does the parser on top of plain text, however the separation of the two tasks brings some advantages:

  1. Design simplification. A parser that also does the work of a lexer is significantly more complex.
  2. Optimization. By separating the two tasks (lexing and parsing) you have more freedom to apply specific optimization techniques for each task.

a lexer in practice

Theory is essential, however nothing beats real code to aid comprehension. Here I show a handcrafted javascript lexer that handles a subset of CSS, more specifically it handles this example:

h1 {
    color: red;
    font-size: 30px;
}

body {
    background-color: yellow;
}

div {
    margin: 10px;
    padding: 10px;
}

The code can be run and will show the list of tokens generated after processing our target CSS:

 // nosso CSS var css = "h1 { \n\ color: red; \n\ font-size: 30px; \n\ } \n\ \n\ body { \n\ background-color: yellow; \n\ } \n\ \n\ div { \n\ margin: 10px; \n\ padding: 10px; \n\ } \n\ "; /** * Lista que define nossos tokens e as respectivas expressões regulares que os encontram no texto. */ var TOKENS = [ {name: 'EMPTY', regex: /^(\s+)/ }, {name: 'RULE_SET_START', regex: /^({)/ }, {name: 'RULE_SET_END', regex: /^(})/ }, {name: 'RULE_DEFINITION', regex: /^(:)/ }, {name: 'RULE_END', regex: /^(;)/ }, {name: 'SELECTOR', regex: /^([a-zA-Z]+[a-zA-Z0-9]*)(?=\s*{)/ }, {name: 'RULE', regex: /^([az][-az]+)(?=\s*:)/ }, {name: 'RULE_VALUE', regex: /^(\w+)(?=\s*(;|}))/ } ]; var text = css; var outputTokenList = []; while(text !== '') { // enquanto existir texto a ser consumido var hasMatch = false; /** * Iteramos sobre toda a lista de TOKENS até encontrarmos algum cujo padrão bata com o início do nosso texto. * Quando ocorre um "match" nós adicionamos o lexeme com seu respectivo token na lista de tokens de saída do lexer. * Caso nenhum padrão bata com o texto uma exceção é lançada imprimindo a linha que contêm o erro. * */ for (var i=0; i<TOKENS.length; i++) { var obj = TOKENS[i]; var tokenName = obj.name; var tokenRegex = obj.regex; var matched = text.match(tokenRegex); if (!matched) { continue; } hasMatch = true; var lexeme = matched[1]; // removemos do texto o lexeme encontrado // para que outro possa ser considerados // na próxima iteração text = text.substring(lexeme.length); if (tokenName in {'SELECTOR': 1, 'RULE': 1, 'RULE_VALUE': 1}) { // adicionamos tanto o nome do token quanto o lexeme outputTokenList.push([tokenName, lexeme]); } else if (tokenName in {'EMPTY': 1}) { // discard, não nos importamos com espaços e quebras de linha. } else { // nestes casos o relacionamento entre o nome do token // e o lexeme é 1<->1 então não precisamos adicionar o lexeme. outputTokenList.push([tokenName]); } break; }; if (!hasMatch) { throw 'Invalid pattern at:\n' + (text.substring(0, text.indexOf('\n'))); break; } } outputTokenList.forEach(function(token) { document.write('< ' + token[0]); if (token.length > 1) { document.write(' , ' + token[1]); } document.write(' ><br>'); });

I won't go into explanations about lexer implementations as this is an extensive subject and not directly related to the question, so note that this is just an illustration of the possible working of a lexer, reads a target text and generates tokens as output, the code it is neither efficient nor robust.

Scroll to Top