Class SuffixTreeBasedSnssFinder
A ISnssFinder implementation and SuffixStructureBasedSnssFinder concretion which uses a SuffixTreeNode structure to implement Find(IEnumerable<Char>, IEnumerable<Char>).
Implements
Inherited Members
Namespace: MoreStructures.SuffixStructures.Matching
Assembly: MoreStructures.dll
Syntax
public class SuffixTreeBasedSnssFinder : SuffixStructureBasedSnssFinder, ISnssFinder
Remarks
ADVANTAGES AND DISADVANTAGES
- Compared to the naive implementation of NaiveSnssFinder, has both average and worst case better
runtime, at the cost of space used (which is O(1) for the naive implementation).
- Compared to SuffixTrieBasedSnssFinder, it has better Time and Space Complexity, due to Branches
Coaleascing combined with Path Compression. It's, however, more complex to implement, visualize and debug
step-by-step.
ALGORITHM
- First uses the UkkonenSuffixTreeBuilder to build a suffix tree of the
concatenation of first text, Terminator1, second text and
Terminator2, stopping branches at the first terminator
encountered. This structure is also known as Generalized Suffix Tree.
- Visit the suffix tree breadth-first, stopping at the first node such that the root-to-node prefix is
substring of text1 but not of text2.
- The root-to-node prefix is a substring of text1 when there is a path-to-leaf which contains an edge including
Terminator1.
- The root-to-node prefix is NOT a substring of text2 when there is no path-to-leaf which doesn't contain
Terminator1.
- Such substring of text1 is guaranteed to be the shortest in number of edges by the visit order imposed by the
breadth-first search. However, unlike for Suffix Tries, in Suffix Trees the number of chars per edges varies.
- Therefore, all pontential results have to be collected, then sorted by actual prefix length.
COMPLEXITY
- Validating the input requires going through text1 and text2, and takes linear time in the the number of chars
n of the concatenated text "text1 ## separator1 ## text2 ## separator2", and constant space.
- Building the Generalized Suffix Tree via the Ukkonen algorithm takes time and space at least proportional to
the number of nodes of the tree, which is linear with n.
- For each level of the breadth-first traversal of the trie, all node-to-leaf paths are checked (in the worst
case).
- There are at most n levels in the tree, since there can't be a path longer than a suffix of the concatenated
text. The higher is the level, the shorter are node-to-leaf paths. However, their number is always the same
or lower.
- For each node there are as many node-to-leaf paths as leaves, and there are at most n leaves in the tree
(since each suffix can add at most a single intermediate node and a single leaf, having terminator 1 or
terminator2 as incoming edge).
- Checking whether a path contains terminator1 takes constant space and a time proportional to the number of
nodes in the path, which is O(n).
- The following optimization is implemented: if a path P1 starting from a node N1 identifies a prefix p1 which
is a potential SNSS, all paths starting from nodes Pi which are descendants of N1 would identify prefixes pi
which would be longer than p1, so they can be excluded.
- Rebuilding the string from each identified path takes O(n) time and space. Sorting would take time
O(m * log(m) * n) where m is the number of potential SNSS (which is in average much smaller than n).
- So in conclusion, Time Complexity is O(n^2) and Space Complexity is O(n).
Constructors
| Improve this Doc View SourceSuffixTreeBasedSnssFinder(Char, Char)
Builds an instance with the provided terminators, validating the input by default. All text passed to Find(IEnumerable<Char>, IEnumerable<Char>) must conform with these two terminators.
Declaration
public SuffixTreeBasedSnssFinder(char terminator1, char terminator2)
Parameters
Type | Name | Description |
---|---|---|
System.Char | terminator1 | |
System.Char | terminator2 |
Methods
| Improve this Doc View SourceFind(IEnumerable<Char>, IEnumerable<Char>)
This implementation builds and uses a SuffixTreeNode structure to perform the search.
Declaration
public override IEnumerable<string> Find(IEnumerable<char> text1, IEnumerable<char> text2)
Parameters
Type | Name | Description |
---|---|---|
IEnumerable<System.Char> | text1 | |
IEnumerable<System.Char> | text2 |
Returns
Type | Description |
---|---|
IEnumerable<System.String> |