Lexical Analysis

Lexical analysis is the first phase of a compiler. It reads raw source code and breaks it into a structured stream of tokens that the rest of the compiler can work with.

Computer Science
Lexical Analysis

Every program you write starts its journey through a compiler as a single, flat string of characters. Before the compiler can understand your intentions, check your syntax, or generate machine code, it must first make sense of that raw character stream. That is the job of the lexical analyzer.

Lexical analysis is the first phase of compilation. It reads source code character by character, groups those characters into meaningful units called lexemes, assigns each one a category code called a token, and passes the resulting stream to the syntax analyzer. Everything else in the compiler depends on this step being done correctly.

The Three Implementation Approaches

To understand where lexical analysis fits, it helps to understand the broader landscape of language implementation. There are three main approaches:

Compilation translates a high-level program entirely into machine code before execution. The resulting binary runs directly on hardware. C, C++, and COBOL are typical examples. Compilation produces fast executables, but the translation step takes time upfront.

Pure interpretation executes programs directly from their source form, with no translation step. A software interpreter reads and evaluates the program statement by statement. JavaScript embedded in HTML pages is a classic example. This is flexible and platform-independent, but slower at runtime.

Hybrid implementation translates source code into an intermediate form, which is then interpreted. Java and Python both follow this model, compiling to bytecode that runs on a virtual machine. Just-in-Time (JIT) compilers, which translate bytecode to machine code at runtime (the first time a method is called), bring hybrid systems close to compiled performance.

All three approaches use both a lexical analyzer and a syntax analyzer. Understanding what each does, and why they are kept separate, is the foundation of compiler theory.

Why Separate Lexical Analysis from Syntax Analysis?

Nearly all compilers divide the task of analyzing source text into two distinct phases: lexical analysis and syntax analysis. There are three compelling reasons for this separation:

Simplicity. The techniques required for lexical analysis are less complex than those needed for syntax analysis. Keeping them separate means each component can be simpler and easier to reason about. Removing low-level character-level details from the syntax analyzer makes it both smaller and cleaner.

Efficiency. Lexical analysis accounts for a significant portion of total compilation time, so it is worth optimizing. Syntax analysis, by contrast, benefits less from optimization. Keeping them separate allows selective tuning of whichever phase needs it.

Portability. The lexical analyzer reads input files and often manages buffering, making it somewhat platform-dependent. The syntax analyzer, which only sees the token stream, can be fully platform-independent. Isolating machine-dependent components is always good engineering practice.

What is a Lexical Analyzer?

A lexical analyzer is a pattern matcher. It scans an input string of characters, identifies substrings that match predefined patterns, and produces a structured output for each match. At its core, it does two things: recognize valid token patterns and report what it found.

When the syntax analyzer needs the next token, it calls the lexical analyzer. The lexical analyzer scans forward through the source, skips whitespace and comments, identifies the next lexeme, determines its token category, and returns both to the caller. This continues until the entire source file has been consumed.

The lexical analyzer also serves several secondary functions:

  • It skips whitespace and comments, which carry no semantic meaning
  • It inserts user-defined names into the symbol table, used by later compiler phases
  • It detects and reports low-level token errors, such as malformed numeric literals

Tokens and Lexemes

Two terms are central to understanding lexical analysis:

A lexeme is the actual character sequence extracted from the source code. It is the raw text that matched a pattern.

A token is the category assigned to a lexeme. Tokens are the symbolic codes the syntax analyzer works with. While they are typically represented as integer values internally, they are written as named constants for readability.

Consider this assignment statement:

result = oldsum - value / 100;

The lexical analyzer produces the following token-lexeme pairs:

TokenLexeme
IDENTresult
ASSIGN_OP=
IDENToldsum
SUB_OP-
IDENTvalue
DIV_OP/
INT_LIT100
SEMICOLON;

As another example, the C declaration:

int value = 100;

produces these tokens:

LexemeToken Category
intkeyword
valueidentifier
=operator
100constant
;symbol

Tokens include keywords, identifiers, constants, string literals, operators, and punctuation symbols. The exact set of token categories is defined by the grammar of the programming language.

Specifications of Tokens

To define what the lexical analyzer should recognize, we need precise vocabulary. Formal language theory provides the tools.

Alphabets

An alphabet is any finite, non-empty set of symbols. Examples:

  • {0, 1} is the binary alphabet
  • {0–9, A–F} is the hexadecimal alphabet
  • {a–z, A–Z} is the English letter alphabet

Strings

A string is any finite sequence of symbols drawn from an alphabet. The length of a string is the number of symbols it contains, written with vertical bars: |Compiler_Design| = 16. A string of zero length is the empty string, written as ε (epsilon).

Special Symbols

High-level programming languages use a rich set of special symbols, organized by purpose:

  • Arithmetic: +, -, *, /, %
  • Punctuation: ,, ;, ., ->
  • Assignment: =, +=, -=, *=, /=
  • Comparison: ==, !=, <, <=, >, >=
  • Logical: &, &&, |, ||, !
  • Shift: >>, >>>, <<, <<<
  • Preprocessor: #
  • Address: &

Language

A language is a finite or infinite set of strings over some alphabet. Computer languages are treated as mathematical sets, and set operations (union, concatenation, closure) can be performed on them.

Regular Expressions

The lexical analyzer needs to recognize a finite set of valid token patterns. Regular expressions are the standard notation for specifying those patterns.

A regular expression describes a set of strings. The set of strings matched by a regular expression is called the regular language (or regular set) it defines. Regular grammars are well-understood, efficiently implementable, and expressive enough to describe all programming language token patterns.

Why Regular Expressions?

  • They are a concise and readable notation for pattern specifications
  • They have efficient implementations via finite automata
  • They are composable through well-defined algebraic rules
  • The tokens of any programming language form a regular language

Operations on Languages

Three operations are fundamental:

Union of languages L and M produces all strings that appear in either:

L ∪ M = { p | p is in L or p is in M }

Concatenation of L and M produces all strings formed by taking a string from L followed by a string from M:

LM = { pq | p is in L and q is in M }

Kleene Closure (L*) produces zero or more concatenations of L, including the empty string:

L* = L0 ∪ L1 ∪ L2 ∪ ...

Positive Closure (L+) produces one or more concatenations of L, excluding the empty string:

L+ = L1 ∪ L2 ∪ L3 ∪ ...

Operator Precedence

When evaluating regular expressions, operators are applied in this order (highest to lowest):

  1. Kleene closure (*) - unary, highest precedence
  2. Concatenation - left-associative
  3. Union (|) - left-associative, lowest precedence

So ab*|c is read as (a(b*)) | c, not a(b*|c).

Algebraic Laws

Regular expressions obey algebraic laws that allow equivalent transformations:

LawDescription
r | s = s | rUnion is commutative
r | (s | t) = (r | s) | tUnion is associative
r(st) = (rs)tConcatenation is associative
r(s | t) = rs | rtConcatenation distributes over union
εr = rε = rε is the identity for concatenation
r* = (r | ε)*ε is guaranteed in closure
r** = r*Closure is idempotent

Regular Definitions

Regular definitions give names to regular expressions for reuse. A sequence of definitions takes the form:

letter   →  A | B | ... | Z | a | b | ... | z
digit    →  0 | 1 | 2 | ... | 9
id       →  letter (letter | digit)*

The identifier pattern letter (letter | digit)* reads: "a letter, followed by zero or more letters or digits." This is the token pattern for variable names in most programming languages.

Example: Language Described by Regular Expressions

Regular ExpressionLanguage
aThe set containing only the string "a"
r | sL(r) union L(s)
rsConcatenation of L(r) and L(s)
r*Kleene closure of L(r)

Finite Automata

A regular expression is useful for specification, but a compiler needs to execute pattern matching. The tool for this is a finite automaton (FA).

A finite automaton is a mathematical machine defined by:

  • A finite set of states Q
  • A finite set of input symbols Σ
  • A start state q0
  • A set of final (accepting) states F
  • A transition function δ: Q × Σ → Q

The automaton processes an input string symbol by symbol, following transitions between states. If it reaches a final state when the input is exhausted, the string is accepted. Otherwise, it is rejected.

A finite automaton built from a regular expression is called a recognizer for that language. It answers yes or no: does this input string belong to the language?

Deterministic vs Non-Deterministic Automata

Deterministic Finite Automata (DFA): For every state and every input symbol, exactly one transition is defined. The automaton's next state is always uniquely determined.

Non-Deterministic Finite Automata (NFA): Multiple transitions may be defined for the same state and symbol. Transitions on the empty string ε are also allowed, meaning the automaton can change state without consuming any input.

Both DFA and NFA recognize exactly the class of regular languages. Every NFA can be converted to an equivalent DFA. DFAs are more efficient to execute; NFAs are often easier to construct from a regular expression.

Read Also: Automata Theory - Finite Automata, DFA, and NFA

State Transition Diagrams

A state diagram is a directed graph representation of a finite automaton:

  • Nodes represent states
  • Directed edges (arcs) represent transitions, labeled with the input symbol that triggers them
  • Double circles indicate final (accepting) states
  • An arrow with no source points to the start state

For example, a DFA that accepts any three-digit binary string ending in 1:

FA = { Q(q0, q1, q2, qf), Σ(0,1), q0, qf, δ }

The automaton reads three symbols. On the third symbol, it only transitions to the final state if that symbol is 1.

The Longest Match Rule

When the lexical analyzer scans source code, it follows the Longest Match Rule: always recognize the longest possible lexeme for each token.

Consider:

int intvalue;

After reading i, n, t, the analyzer has matched the keyword int. But it must keep reading, because intvalue is a longer valid identifier. The longest match rule means intvalue is recognized as a single identifier token, not the keyword int followed by the identifier value.

When two patterns match a lexeme of the same length, rule priority applies: reserved words and keywords take priority over user-defined identifiers.

Converting Regular Expressions to DFA

Building a lexical analyzer requires translating regular expression token specifications into a DFA that can be executed efficiently. There are two primary methods.

Method 1: Thompson's Construction (RE to NFA to DFA)

Step 1: Convert the regular expression to an NFA.

Thompson's construction provides rules for each operator:

  • Union (r1 | r2): Create a new start state with ε-transitions to the start states of both r1 and r2. Create a new final state with ε-transitions from both final states.
  • Concatenation (r1 r2): Connect the final state of r1 to the start state of r2 via an ε-transition.
  • Closure (r1*): Add a new start and final state, with ε-transitions that allow skipping r1 entirely or looping back through it.

Step 2: Convert the NFA to a DFA using Subset Construction.

The key operation is ε-closure: the set of all states reachable from a given state by following ε-transitions alone.

The subset construction algorithm:

  1. Compute the ε-closure of the NFA start state. This becomes the DFA start state.
  2. For each DFA state (a set of NFA states) and each input symbol, compute the set of NFA states reachable by that symbol followed by any ε-transitions.
  3. This result is the next DFA state. If it has not been seen before, add it to the set of DFA states.
  4. Repeat until no new DFA states are produced.
  5. Any DFA state that contains an NFA final state is a DFA final state.

The transition function for the DFA is:

Dtran[state, symbol] = ε-closure(move(state, symbol))

Method 2: Direct Construction (RE to DFA)

The direct method constructs a DFA from a regular expression without the intermediate NFA step. It uses the augmented regular expression r#, where # is a special end marker.

The regular expression is represented as a syntax tree. Interior nodes correspond to operators (union, concatenation, closure). Leaf nodes correspond to input symbols, each assigned a unique position number.

Four functions are computed for each node in the syntax tree:

  • nullable(n): True if the subexpression at node n can generate the empty string ε.
  • firstpos(n): The set of positions that correspond to the first symbol of some string generated by the subexpression at n.
  • lastpos(n): The set of positions that correspond to the last symbol of some string generated by the subexpression at n.
  • followpos(i): The set of positions that can immediately follow position i in any string generated by the full expression.

Rules for computing these functions:

Node Typenullablefirstposlastpos
Leaf εtrue{}{}
Leaf at position ifalse{i}{i}
c1 | c2nullable(c1) or nullable(c2)firstpos(c1) ∪ firstpos(c2)lastpos(c1) ∪ lastpos(c2)
c1 c2nullable(c1) and nullable(c2)if nullable(c1): firstpos(c1) ∪ firstpos(c2), else firstpos(c1)if nullable(c2): lastpos(c1) ∪ lastpos(c2), else lastpos(c2)
c1*truefirstpos(c1)lastpos(c1)

Computing followpos:

Two rules define when position j is in followpos(i):

  1. If n is a concatenation node c1 c2, then for every position i in lastpos(c1), all positions in firstpos(c2) are in followpos(i).
  2. If n is a star node c1*, then for every position i in lastpos(n), all positions in firstpos(n) are in followpos(i).

The DFA states are sets of positions. The DFA start state is firstpos of the root. The DFA final states are all states that contain the position of the end marker #.

Building a Lexical Analyzer in Practice

There are three practical approaches to constructing a lexical analyzer:

Use a generator tool. Write formal token descriptions using a tool-specific notation based on regular expressions, and let a tool generate the lexical analyzer automatically. The classic UNIX tool lex (and its GNU successor flex) does exactly this. This is the most common approach in production compilers.

Implement a state transition diagram by hand. Design the DFA states and transitions manually for your language's token patterns, then code it as a program with a switch-case or if-else structure mirroring the state machine.

Build a table-driven implementation. Design the DFA manually, then encode the entire transition function as a two-dimensional table indexed by state and input symbol. The main loop simply looks up the next state in the table. This is highly efficient and easy to modify by changing the table.

Summary

Lexical analysis transforms raw source text into a clean, structured token stream. It is the first compiler phase precisely because everything that follows depends on having well-defined units to work with.

The theory behind it is clean and well-settled: tokens are regular languages, regular languages are described by regular expressions, and regular expressions compile to finite automata that execute efficiently. This pipeline from specification to implementation is one of the most elegant results in computer science, and it underpins every compiler and interpreter ever written.

Whether you are building a compiler, writing a parser, or designing a domain-specific language, lexical analysis is always where the work begins.

Share this article

Post on XLinkedInWhatsApp