Int J Performability Eng ›› 2019, Vol. 15 ›› Issue (4): 1247-1254.doi: 10.23940/ijpe.19.04.p20.12471254

Previous Articles     Next Articles

Compression Method of Factor Oracle by Triple-Array Structures

Koji Bandoa, Takato Nakanob, Kazuhiro Moritac, and Masao Fuketac, *   

  1. a NTT Plala Inc., 24th Florr, Sunshine 60, 3-1-1 Higashi Ikebukuro, Toshima-ku, Tokyo, 170-6024, Japan;
    b Nichia Corporation, 491 Oka, Kaminaka-Cho, Anan-Shi, Tokushima, 774-8601, Japan;
    c Tokushima University, 2-1 Minami-Josanjima, Tokushima, 770-8506, Japan
  • Revised on ; Accepted on
  • About author:Koji Bando received his B.E. degree from Tokushima University, Tokushima, Japan, in 1977. After graduating in 1977, He joined NTT Corporation, and he is currently the President and CEO of NTT Plala Inc., Tokyo, Japan. In 2003, he received Telecommunications Association's IT Business Encouragement Special Award. In 2018, he began participating in a doctoral course at Tokushima University. His research interest includes information retrieval, natural language processing, and artificial intelligence. Takato Nakano received his B.Sc. degree in information science and intelligent systems from Tokushima University in 2017. He currently works for Nichia Corporation. Kazuhiro Morita received his B.Sc., M.Sc., and Ph.D. degrees in information science and intelligent systems from Tokushima University, Japan, in 1995, 1997, and 2000, respectively. He was a research assistant from 2000 to 2006 in information science and intelligent systems at Tokushima University. He is currently an associate professor in the Department of Information Science and Intelligent Systems at Tokushima University. His research interests are sentence retrieval from huge text databases, double-array structures, and binary search tree. Masao Fuketa received his B.Sc., M.Sc., and Ph.D. degrees in information science and intelligent systems from Tokushima University in 1993, 1995, and 1998, respectively. He was a research assistant and an associate professor respectively from 1998 to 2000 and 2000 to 2015 in information science and intelligent systems at Tokushima University. He is currently a professor in the Department of Information Science and Intelligent Systems at Tokushima University. He is a member of the Information Processing Society in Japan and the Association for Natural Language Processing of Japan. His research interests are information retrieval and natural language processing.

Abstract: Pattern matching is an important technique in text processing and is used for character string replacement and search. A factor oracle is a data structure for pattern matching, and it is a finite state automaton that can search substrings. This data structure consists of internal and external transitions and has the characteristic property of accepting at least all substrings. The automaton including the factor oracle is represented by a two-dimensional array called Table, Johnson method, etc. Search using Table is fast, but the memory capacity is large. The Johnson method has the feature of representing the factor oracle using a small amount of storage, and since it can be achieved with a transition speed of O(1), it is considered as fast as Table. The memory of the Johnson method depends on the size of the element itself and the number of elements. In other words, by reducing these, the applicability of the factor oracle to enormous data will be further enhanced. In this research, using compact double-array, we propose a method consisting of compressing the size of the elements themselves, improving the Johnson method in order to achieve a construction using external transitions only, and compressing the number of elements. As shown by the experimental results, the proposed method has higher storage efficiency compared to the Johnson method and is capable under certain conditions of high-speed search.

Key words: factor oracle, pattern matching, triple-array, double-array