Class SuffixTreeBasedSnssFinder

A ISnssFinder implementation and SuffixStructureBasedSnssFinder concretion which uses a SuffixTreeNode structure to implement Find(IEnumerable<Char>, IEnumerable<Char>).

Inheritance

System.Object

SuffixStructureBasedSnssFinder

SuffixTreeBasedSnssFinder

Implements

ISnssFinder

Inherited Members

SuffixStructureBasedSnssFinder.Terminator1

SuffixStructureBasedSnssFinder.Terminator2

SuffixStructureBasedSnssFinder.ValidateInput

SuffixStructureBasedSnssFinder.ValidateTexts(IEnumerable<Char>, IEnumerable<Char>)

SuffixStructureBasedSnssFinder.CollectPrefixChars<TEdge, TNode>(TextWithTerminator, TNode, IDictionary<TNode, TreeTraversalVisit<TEdge, TNode>>)

System.Object.Equals(System.Object)

System.Object.Equals(System.Object, System.Object)

System.Object.GetHashCode()

System.Object.GetType()

System.Object.MemberwiseClone()

System.Object.ReferenceEquals(System.Object, System.Object)

System.Object.ToString()

Namespace: MoreStructures.SuffixStructures.Matching

Assembly: MoreStructures.dll

Syntax

public class SuffixTreeBasedSnssFinder : SuffixStructureBasedSnssFinder, ISnssFinder

Remarks

ADVANTAGES AND DISADVANTAGES
- Compared to the naive implementation of NaiveSnssFinder, has both average and worst case better runtime, at the cost of space used (which is O(1) for the naive implementation).
- Compared to SuffixTrieBasedSnssFinder, it has better Time and Space Complexity, due to Branches Coaleascing combined with Path Compression. It's, however, more complex to implement, visualize and debug step-by-step.

ALGORITHM
- First uses the UkkonenSuffixTreeBuilder to build a suffix tree of the concatenation of first text, Terminator1, second text and Terminator2, stopping branches at the first terminator encountered. This structure is also known as Generalized Suffix Tree.
- Visit the suffix tree breadth-first, stopping at the first node such that the root-to-node prefix is substring of text1 but not of text2.
- The root-to-node prefix is a substring of text1 when there is a path-to-leaf which contains an edge including Terminator1.
- The root-to-node prefix is NOT a substring of text2 when there is no path-to-leaf which doesn't contain Terminator1.
- Such substring of text1 is guaranteed to be the shortest in number of edges by the visit order imposed by the breadth-first search. However, unlike for Suffix Tries, in Suffix Trees the number of chars per edges varies.
- Therefore, all pontential results have to be collected, then sorted by actual prefix length.

COMPLEXITY
- Validating the input requires going through text1 and text2, and takes linear time in the the number of chars n of the concatenated text "text1 ## separator1 ## text2 ## separator2", and constant space.
- Building the Generalized Suffix Tree via the Ukkonen algorithm takes time and space at least proportional to the number of nodes of the tree, which is linear with n.
- For each level of the breadth-first traversal of the trie, all node-to-leaf paths are checked (in the worst case).
- There are at most n levels in the tree, since there can't be a path longer than a suffix of the concatenated text. The higher is the level, the shorter are node-to-leaf paths. However, their number is always the same or lower.
- For each node there are as many node-to-leaf paths as leaves, and there are at most n leaves in the tree (since each suffix can add at most a single intermediate node and a single leaf, having terminator 1 or terminator2 as incoming edge).
- Checking whether a path contains terminator1 takes constant space and a time proportional to the number of nodes in the path, which is O(n).
- The following optimization is implemented: if a path P1 starting from a node N1 identifies a prefix p1 which is a potential SNSS, all paths starting from nodes Pi which are descendants of N1 would identify prefixes pi which would be longer than p1, so they can be excluded.
- Rebuilding the string from each identified path takes O(n) time and space. Sorting would take time O(m * log(m) * n) where m is the number of potential SNSS (which is in average much smaller than n).
- So in conclusion, Time Complexity is O(n^2) and Space Complexity is O(n).

Constructors

| Improve this Doc View Source

SuffixTreeBasedSnssFinder(Char, Char)

Builds an instance with the provided terminators, validating the input by default. All text passed to Find(IEnumerable<Char>, IEnumerable<Char>) must conform with these two terminators.

Declaration

public SuffixTreeBasedSnssFinder(char terminator1, char terminator2)

Parameters

Type	Name	Description
System.Char	terminator1
System.Char	terminator2

Methods

| Improve this Doc View Source

Find(IEnumerable<Char>, IEnumerable<Char>)

This implementation builds and uses a SuffixTreeNode structure to perform the search.

Declaration

public override IEnumerable<string> Find(IEnumerable<char> text1, IEnumerable<char> text2)

Parameters

Type	Name	Description
IEnumerable<System.Char>	text1
IEnumerable<System.Char>	text2

Returns

Type	Description
IEnumerable<System.String>

Overrides

SuffixStructureBasedSnssFinder.Find(IEnumerable<Char>, IEnumerable<Char>)

Remarks

Implements

ISnssFinder

Extension Methods

SuffixStructureNodeExtensions.GetAllSuffixesFor<TEdge, TNode>(TNode, TextWithTerminator)