Lexical Analysis
Lexical analysis is the first phase of a compiler. It reads raw source code and breaks it into a structured stream of tokens that the rest of the compiler can work with.

Every program you write starts its journey through a compiler as a single, flat string of characters. Before the compiler can understand your intentions, check your syntax, or generate machine code, it must first make sense of that raw character stream. That is the job of the lexical analyzer.
Lexical analysis is the first phase of compilation. It reads source code character by character, groups those characters into meaningful units called lexemes, assigns each one a category code called a token, and passes the resulting stream to the syntax analyzer. Everything else in the compiler depends on this step being done correctly.
The Three Implementation Approaches
To understand where lexical analysis fits, it helps to understand the broader landscape of language implementation. There are three main approaches:
Compilation translates a high-level program entirely into machine code before execution. The resulting binary runs directly on hardware. C, C++, and COBOL are typical examples. Compilation produces fast executables, but the translation step takes time upfront.
Pure interpretation executes programs directly from their source form, with no translation step. A software interpreter reads and evaluates the program statement by statement. JavaScript embedded in HTML pages is a classic example. This is flexible and platform-independent, but slower at runtime.
Hybrid implementation translates source code into an intermediate form, which is then interpreted. Java and Python both follow this model, compiling to bytecode that runs on a virtual machine. Just-in-Time (JIT) compilers, which translate bytecode to machine code at runtime (the first time a method is called), bring hybrid systems close to compiled performance.
All three approaches use both a lexical analyzer and a syntax analyzer. Understanding what each does, and why they are kept separate, is the foundation of compiler theory.
Why Separate Lexical Analysis from Syntax Analysis?
Nearly all compilers divide the task of analyzing source text into two distinct phases: lexical analysis and syntax analysis. There are three compelling reasons for this separation:
Simplicity. The techniques required for lexical analysis are less complex than those needed for syntax analysis. Keeping them separate means each component can be simpler and easier to reason about. Removing low-level character-level details from the syntax analyzer makes it both smaller and cleaner.
Efficiency. Lexical analysis accounts for a significant portion of total compilation time, so it is worth optimizing. Syntax analysis, by contrast, benefits less from optimization. Keeping them separate allows selective tuning of whichever phase needs it.
Portability. The lexical analyzer reads input files and often manages buffering, making it somewhat platform-dependent. The syntax analyzer, which only sees the token stream, can be fully platform-independent. Isolating machine-dependent components is always good engineering practice.
What is a Lexical Analyzer?
A lexical analyzer is a pattern matcher. It scans an input string of characters, identifies substrings that match predefined patterns, and produces a structured output for each match. At its core, it does two things: recognize valid token patterns and report what it found.
When the syntax analyzer needs the next token, it calls the lexical analyzer. The lexical analyzer scans forward through the source, skips whitespace and comments, identifies the next lexeme, determines its token category, and returns both to the caller. This continues until the entire source file has been consumed.
The lexical analyzer also serves several secondary functions:
- •It skips whitespace and comments, which carry no semantic meaning
- •It inserts user-defined names into the symbol table, used by later compiler phases
- •It detects and reports low-level token errors, such as malformed numeric literals
Tokens and Lexemes
Two terms are central to understanding lexical analysis:
A lexeme is the actual character sequence extracted from the source code. It is the raw text that matched a pattern.
A token is the category assigned to a lexeme. Tokens are the symbolic codes the syntax analyzer works with. While they are typically represented as integer values internally, they are written as named constants for readability.
Consider this assignment statement:
result = oldsum - value / 100;
The lexical analyzer produces the following token-lexeme pairs:
| Token | Lexeme |
|---|---|
IDENT | result |
ASSIGN_OP | = |
IDENT | oldsum |
SUB_OP | - |
IDENT | value |
DIV_OP | / |
INT_LIT | 100 |
SEMICOLON | ; |
As another example, the C declaration:
int value = 100;
produces these tokens:
| Lexeme | Token Category |
|---|---|
int | keyword |
value | identifier |
= | operator |
100 | constant |
; | symbol |
Tokens include keywords, identifiers, constants, string literals, operators, and punctuation symbols. The exact set of token categories is defined by the grammar of the programming language.
Specifications of Tokens
To define what the lexical analyzer should recognize, we need precise vocabulary. Formal language theory provides the tools.
Alphabets
An alphabet is any finite, non-empty set of symbols. Examples:
- •
{0, 1}is the binary alphabet - •
{0–9, A–F}is the hexadecimal alphabet - •
{a–z, A–Z}is the English letter alphabet
Strings
A string is any finite sequence of symbols drawn from an alphabet. The length of a string is the number of symbols it contains, written with vertical bars: |Compiler_Design| = 16. A string of zero length is the empty string, written as ε (epsilon).
Special Symbols
High-level programming languages use a rich set of special symbols, organized by purpose:
- •Arithmetic:
+,-,*,/,% - •Punctuation:
,,;,.,-> - •Assignment:
=,+=,-=,*=,/= - •Comparison:
==,!=,<,<=,>,>= - •Logical:
&,&&,|,||,! - •Shift:
>>,>>>,<<,<<< - •Preprocessor:
# - •Address:
&
Language
A language is a finite or infinite set of strings over some alphabet. Computer languages are treated as mathematical sets, and set operations (union, concatenation, closure) can be performed on them.
Regular Expressions
The lexical analyzer needs to recognize a finite set of valid token patterns. Regular expressions are the standard notation for specifying those patterns.
A regular expression describes a set of strings. The set of strings matched by a regular expression is called the regular language (or regular set) it defines. Regular grammars are well-understood, efficiently implementable, and expressive enough to describe all programming language token patterns.
Why Regular Expressions?
- •They are a concise and readable notation for pattern specifications
- •They have efficient implementations via finite automata
- •They are composable through well-defined algebraic rules
- •The tokens of any programming language form a regular language
Operations on Languages
Three operations are fundamental:
Union of languages L and M produces all strings that appear in either:
L ∪ M = { p | p is in L or p is in M }
Concatenation of L and M produces all strings formed by taking a string from L followed by a string from M:
LM = { pq | p is in L and q is in M }
Kleene Closure (L*) produces zero or more concatenations of L, including the empty string:
L* = L0 ∪ L1 ∪ L2 ∪ ...
Positive Closure (L+) produces one or more concatenations of L, excluding the empty string:
L+ = L1 ∪ L2 ∪ L3 ∪ ...
Operator Precedence
When evaluating regular expressions, operators are applied in this order (highest to lowest):
- •Kleene closure (
*) - unary, highest precedence - •Concatenation - left-associative
- •Union (
|) - left-associative, lowest precedence
So ab*|c is read as (a(b*)) | c, not a(b*|c).
Algebraic Laws
Regular expressions obey algebraic laws that allow equivalent transformations:
| Law | Description |
|---|---|
r | s = s | r | Union is commutative |
r | (s | t) = (r | s) | t | Union is associative |
r(st) = (rs)t | Concatenation is associative |
r(s | t) = rs | rt | Concatenation distributes over union |
εr = rε = r | ε is the identity for concatenation |
r* = (r | ε)* | ε is guaranteed in closure |
r** = r* | Closure is idempotent |
Regular Definitions
Regular definitions give names to regular expressions for reuse. A sequence of definitions takes the form:
letter → A | B | ... | Z | a | b | ... | z
digit → 0 | 1 | 2 | ... | 9
id → letter (letter | digit)*
The identifier pattern letter (letter | digit)* reads: "a letter, followed by zero or more letters or digits." This is the token pattern for variable names in most programming languages.
Example: Language Described by Regular Expressions
| Regular Expression | Language |
|---|---|
a | The set containing only the string "a" |
r | s | L(r) union L(s) |
rs | Concatenation of L(r) and L(s) |
r* | Kleene closure of L(r) |
Finite Automata
A regular expression is useful for specification, but a compiler needs to execute pattern matching. The tool for this is a finite automaton (FA).
A finite automaton is a mathematical machine defined by:
- •A finite set of states
Q - •A finite set of input symbols
Σ - •A start state
q0 - •A set of final (accepting) states
F - •A transition function
δ: Q × Σ → Q
The automaton processes an input string symbol by symbol, following transitions between states. If it reaches a final state when the input is exhausted, the string is accepted. Otherwise, it is rejected.
A finite automaton built from a regular expression is called a recognizer for that language. It answers yes or no: does this input string belong to the language?
Deterministic vs Non-Deterministic Automata
Deterministic Finite Automata (DFA): For every state and every input symbol, exactly one transition is defined. The automaton's next state is always uniquely determined.
Non-Deterministic Finite Automata (NFA): Multiple transitions may be defined for the same state and symbol. Transitions on the empty string ε are also allowed, meaning the automaton can change state without consuming any input.
Both DFA and NFA recognize exactly the class of regular languages. Every NFA can be converted to an equivalent DFA. DFAs are more efficient to execute; NFAs are often easier to construct from a regular expression.
State Transition Diagrams
A state diagram is a directed graph representation of a finite automaton:
- •Nodes represent states
- •Directed edges (arcs) represent transitions, labeled with the input symbol that triggers them
- •Double circles indicate final (accepting) states
- •An arrow with no source points to the start state
For example, a DFA that accepts any three-digit binary string ending in 1:
FA = { Q(q0, q1, q2, qf), Σ(0,1), q0, qf, δ }
The automaton reads three symbols. On the third symbol, it only transitions to the final state if that symbol is 1.
The Longest Match Rule
When the lexical analyzer scans source code, it follows the Longest Match Rule: always recognize the longest possible lexeme for each token.
Consider:
int intvalue;
After reading i, n, t, the analyzer has matched the keyword int. But it must keep reading, because intvalue is a longer valid identifier. The longest match rule means intvalue is recognized as a single identifier token, not the keyword int followed by the identifier value.
When two patterns match a lexeme of the same length, rule priority applies: reserved words and keywords take priority over user-defined identifiers.
Converting Regular Expressions to DFA
Building a lexical analyzer requires translating regular expression token specifications into a DFA that can be executed efficiently. There are two primary methods.
Method 1: Thompson's Construction (RE to NFA to DFA)
Step 1: Convert the regular expression to an NFA.
Thompson's construction provides rules for each operator:
- •Union (
r1 | r2): Create a new start state with ε-transitions to the start states of both r1 and r2. Create a new final state with ε-transitions from both final states. - •Concatenation (
r1 r2): Connect the final state of r1 to the start state of r2 via an ε-transition. - •Closure (
r1*): Add a new start and final state, with ε-transitions that allow skipping r1 entirely or looping back through it.
Step 2: Convert the NFA to a DFA using Subset Construction.
The key operation is ε-closure: the set of all states reachable from a given state by following ε-transitions alone.
The subset construction algorithm:
- •Compute the ε-closure of the NFA start state. This becomes the DFA start state.
- •For each DFA state (a set of NFA states) and each input symbol, compute the set of NFA states reachable by that symbol followed by any ε-transitions.
- •This result is the next DFA state. If it has not been seen before, add it to the set of DFA states.
- •Repeat until no new DFA states are produced.
- •Any DFA state that contains an NFA final state is a DFA final state.
The transition function for the DFA is:
Dtran[state, symbol] = ε-closure(move(state, symbol))
Method 2: Direct Construction (RE to DFA)
The direct method constructs a DFA from a regular expression without the intermediate NFA step. It uses the augmented regular expression r#, where # is a special end marker.
The regular expression is represented as a syntax tree. Interior nodes correspond to operators (union, concatenation, closure). Leaf nodes correspond to input symbols, each assigned a unique position number.
Four functions are computed for each node in the syntax tree:
- •nullable(n): True if the subexpression at node n can generate the empty string ε.
- •firstpos(n): The set of positions that correspond to the first symbol of some string generated by the subexpression at n.
- •lastpos(n): The set of positions that correspond to the last symbol of some string generated by the subexpression at n.
- •followpos(i): The set of positions that can immediately follow position i in any string generated by the full expression.
Rules for computing these functions:
| Node Type | nullable | firstpos | lastpos |
|---|---|---|---|
Leaf ε | true | {} | {} |
Leaf at position i | false | {i} | {i} |
c1 | c2 | nullable(c1) or nullable(c2) | firstpos(c1) ∪ firstpos(c2) | lastpos(c1) ∪ lastpos(c2) |
c1 c2 | nullable(c1) and nullable(c2) | if nullable(c1): firstpos(c1) ∪ firstpos(c2), else firstpos(c1) | if nullable(c2): lastpos(c1) ∪ lastpos(c2), else lastpos(c2) |
c1* | true | firstpos(c1) | lastpos(c1) |
Computing followpos:
Two rules define when position j is in followpos(i):
- •If n is a concatenation node
c1 c2, then for every position i in lastpos(c1), all positions in firstpos(c2) are in followpos(i). - •If n is a star node
c1*, then for every position i in lastpos(n), all positions in firstpos(n) are in followpos(i).
The DFA states are sets of positions. The DFA start state is firstpos of the root. The DFA final states are all states that contain the position of the end marker #.
Building a Lexical Analyzer in Practice
There are three practical approaches to constructing a lexical analyzer:
Use a generator tool. Write formal token descriptions using a tool-specific notation based on regular expressions, and let a tool generate the lexical analyzer automatically. The classic UNIX tool lex (and its GNU successor flex) does exactly this. This is the most common approach in production compilers.
Implement a state transition diagram by hand. Design the DFA states and transitions manually for your language's token patterns, then code it as a program with a switch-case or if-else structure mirroring the state machine.
Build a table-driven implementation. Design the DFA manually, then encode the entire transition function as a two-dimensional table indexed by state and input symbol. The main loop simply looks up the next state in the table. This is highly efficient and easy to modify by changing the table.
Summary
Lexical analysis transforms raw source text into a clean, structured token stream. It is the first compiler phase precisely because everything that follows depends on having well-defined units to work with.
The theory behind it is clean and well-settled: tokens are regular languages, regular languages are described by regular expressions, and regular expressions compile to finite automata that execute efficiently. This pipeline from specification to implementation is one of the most elegant results in computer science, and it underpins every compiler and interpreter ever written.
Whether you are building a compiler, writing a parser, or designing a domain-specific language, lexical analysis is always where the work begins.


