Compression Method of Factor Oracle by Triple-Array Structures
- Koji Bando, Takato Nakano, Kazuhiro Morita, and Masao Fuketa
Pattern matching is an important technique in text processing and is used for character string replacement and search. A factor oracle is a data structure for pattern matching, and it is a finite state automaton that can search substrings. This data structure consists of internal and external transitions and has the characteristic property of accepting at least all substrings. The automaton including the factor oracle is represented by a two-dimensional array called Table, Johnson method, etc. Search using Table is fast, but the memory capacity is large. The Johnson method has the feature of representing the factor oracle using a small amount of storage, and since it can be achieved with a transition speed of O(1), it is considered as fast as Table. The memory of the Johnson method depends on the size of the element itself and the number of elements. In other words, by reducing these, the applicability of the factor oracle to enormous data will be further enhanced. In this research, using compact double-array, we propose a method consisting of compressing the size of the elements themselves, improving the Johnson method in order to achieve a construction using external transitions only, and compressing the number of elements. As shown by the experimental results, the proposed method has higher storage efficiency compared to the Johnson method and is capable under certain conditions of high-speed search.