“Maximal-munch” tokenization in linear time

Authors:
Thomas Reps
Affiliations:
Univ. of Wisconsin, Madison
Venue:
ACM Transactions on Programming Languages and Systems (TOPLAS)
Year:
1998

Citing 10
Cited 1

Compilers: principles, techniques, and tools

Compilers: principles, techniques, and tools
Crafting a compiler

Crafting a compiler
Algorithms for finding patterns in strings

Handbook of theoretical computer science (vol. A)
Worm-2DPDAs: an extension to 2DPDAs that can be simulated in linear time

Information Processing Letters
Compiler Design

Compiler Design
Introduction To Automata Theory, Languages, And Computation

Introduction To Automata Theory, Languages, And Computation
Compiler Construction

Compiler Construction
The Design and Analysis of Computer Algorithms

The Design and Analysis of Computer Algorithms
Lexical Analysis

Compiler Construction, An Advanced Course, 2nd ed.
Principles of Compiler Design (Addison-Wesley series in computer science and information processing)

Principles of Compiler Design (Addison-Wesley series in computer science and information processing)

Two-way finite automata with a write-once track

Journal of Automata, Languages and Combinatorics

Quantified Score

Hi-index	0.00

Visualization

Abstract

The lexical-analysis (or scanning) phase of a compiler attempts to partition an input string into a sequence of tokens. The convention in most languages is that the input is scanned left to right, and each token identified is a “maximal munch” of the remaining input—the longest prefix of the remaining input that is a token of the language. Although most of the standard compiler textbooks present a way to perform maximal-munch tokenization, the algorithm they describe is one that, for certain sets of token definitions, can cause the scanner to exhibit quadratic behavior in the worst case. In the article, we show that maximal-munch tokenization can always be performed in time linear in the size of the input.