Class UkkonenSuffixTreeBuilder
Builds objects, such as edges and nodes, for SuffixTreeNode structures, using the Ukkonen algorithm.
Inheritance
Implements
Inherited Members
Namespace: MoreStructures.SuffixTrees.Builders
Assembly: MoreStructures.dll
Syntax
public class UkkonenSuffixTreeBuilder : IBuilder<SuffixTreeEdge, SuffixTreeNode>
Remarks
Iterative implementation of the Ukkonen algorithm: see https://en.wikipedia.org/wiki/Ukkonen%27s_algorithm for an introduction and https://www.cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdf for the original paper.
ALGORITHM OPTIMIZATIONS
The algorithm applies some optimizations to achieve linear complexity:
- Edge Labels compression: strings on edges are stored as (start, end) indexes. This makes the space
used by a single edge constant, i.e. O(1), because it always consists of two integers, regardless of
how many chars of the text it represents.
- Global End: all leafs have a dynamic end index, autoincremented every new phase. This makes the time
used to apply Rule 1 Extension constant, i.e. O(1), because incrementing the global end also increments
all leaf implicitely.
- Suffix Links: all internal nodes point to another internal node sharing the suffix. This makes more
efficient traversal, because storing and jumping the suffix link when needed means traversal doesn't
have to be done again.
- Terminators Index Map: a function is used, mapping each index i of the full input text (consolidated
TextWithTerminator including all TextWithTerminator instances passed in the
input) to the index of the next terminator appearing in the full text not before i. Such function is used
to perform edge trimming of the mutable tree structure built by the Ukkonen algorithm in linear time.
ALGORITHM
The algorithm is structured in phases, as many as the number of chars in the text.
At the beginning of a new phase, both the number of remaining suffixes to take care of, and the global end
pointer are both increased by 1.
Each phase is composed of at least 1 iteration, each one taking care of remaining suffixes.
At the beginning
USECASES
Not limited by call stack depth. Convenient with long input text (i.e. string having a length < ~1K chars).
COMPLEXITY
- The Time Complexity of the Ukkonen algorithm, building the mutable tree from the input texts, is O(t).
Space Complexity is O(2 * t) where t = length of the text to match.
- While there are as many phases as number of chars in text (t), and there can be multiple iterations per
phase (as many as the number of remaining suffixes to process), the complexity is still linear, ~ 2t.
- Each node of the resulting mutable tree (there are O(n) of them, and at most 2 * n), is processed by the
mutable tree conversion method at most twice: once for leaves and twice for intermediate nodes, which are
re-pushed onto the stack before its children, so that they can be processed after all their children have
already been converted.
- Each stack frame processing by the mutable tree conversion method performs constant-time operations only:
SuffixTreeNode and SuffixTreeEdge instances creation, children dictionary
instantiation and edge trimming (trimming the label of edges in generalized Suffix Trees, to the first
terminator appearing in the label, when multiple are present - i.e. when the label spans multiple
concatenated texts).
- Edge trimming is a constant-time operation once the cost of calculating the Terminators Index Map is paid,
which is a linear cost in time and space (linear in t).
- Therefore, mutable tree conversion, and the overall algorithm, is linear in time and space over t.
Methods
| Improve this Doc View SourceBuildTree(TextWithTerminator[])
Declaration
public SuffixTreeNode BuildTree(params TextWithTerminator[] texts)
Parameters
Type | Name | Description |
---|---|---|
TextWithTerminator[] | texts |
Returns
Type | Description |
---|---|
SuffixTreeNode |