blog2.shibu.jp

Python JavaScript Qt

Performance of the FM-index

From the last week, I am tuning index file size. The search algorithm, FM-index, is a flexible algorithm. Programmers can control the performance of the algorithm easily.

I changed the sorting algorithm inside Burrows-Wheeler transformation from a quick sort to an induced-sorting. The performance of sorting algorithms are O(n * log n) and O(n). If an input data size become bigger, induced-sorting can finish earlier. Now my search engine can create index files from big file size in reasonable time. It is a described in the following paper:

One of the biggest document it needs more efficient search algorithm (like FM-index) around me is a Japanese translation of the programming language document of Python. The PDF version of the document is 3,000 pages. The last week version of my search engine couldn’t process this document (it took more than 10 minutes), but now it can do in 60 seconds. So I could check the performance by using the data it is suitable for practical use.

Inside of the FM-index, there is the data it is made from original source document. This data, it is called Suffix Array, is the most important data. But calculating from this data takes much time. So all implementations of the FM-index I read have pre-calculated cached data. My search engine has this feature too. I compared the performance of the cache density. I used an Python-JP translation document as a source data (4.6 million characters) and use the word “構造体” (it means structure, it is used more than 100 times in the document):

Cache Density Search Response Cache Size
0.01% 22 [S] 7 [KB]
0.1% 2 [S] 74 [KB]
0.3% 600 [mS] 228 [KB]
1% 190 [mS] 738 [KB]
25% 11 [mS] 18.5 [MB]

A cache density and a cache size are increase commensurately and a search response time is in inverse proportion to them. In my use case, my search engine works on browsers. So index files are downloaded over the Internet. A file size of each index file is an important factor in this case. If someone uses my engine on server, file size is not important so much. You can get an extreme performance by using big cache data. This cache size is in proportion to an input text size, so you can use big cache density for small documents. In this case, you can use real-time search (when user type one character, update search result) even if it is processed on browsers.

Reducing Index File Size

If the index file size become smaller, you can use bigger cache and get more performance. Now I am implementing following feature:

  • Compressing array of numbers:

    It is a simple compression algorithm (Run-Length Encoding, RLE). In the data, there are many blank area (number 0 are placed continuously). If there are following data, storing just count of zero items and compress. It reduce 40% of input data.

    [1, 0, 0, 0, 0, 0, 0, 2] -> [1, [0 * 6], 2]
    

    Original RLE has a weak point. Result size become bigger than original data if there are few continuous values. So I implemented modified version of algorithm Switched Run Length Encoding.

    There are many types of compression algorithms that have high compression rate, but web servers and browsers can have feature to compress data with a gzip command to increase transfer speed. But using similar algorithms are not good for efficiency. This is faster to decompress and result file size is not bad and doesn’t interfere web servers/browsers’ data compression.

  • Remapping character code:

    JavaScript uses UTF-16 as a character encoding. All character uses 2 bytes in memory. My search engine uses the Wavelet-Matrix to store the BWT data. It is an invert table of character bit and string. It is an example of memory image. Actually Wavelet-Matrix stores sorted data in each row, but the following sample is not sorted for description.

    String "ABC" -> [0000000001000001] [0000000001000010] [0000000001000011]
    
    Wavelet-Matrix "ABC" -> [000] [000] [000] [000] [000] [000] [000] [000] [000] [111] [000] [000] [000] [000] [011] [101]
    

    As you see, there are many empty spaces. If you remapping character code into 2 bits to store character, you can reduce Wavelet-Matrix depth significantly.

    String "ABC" -> [00] [01] [10]
    Wavelet-Matrix "ABC" -> [001] [010]
    

    Python Japanese translation uses 1557 characters (11 bits). So the result file size become 11/16. Python English document uses just 147 characters (8 bits). It will be half.

Python Japanese document is 9.2MB (in UTF-16). Expected result file size is 5.1MB (the cache data, density is 1%, is 740KB). Actual transfer file size (compressed by a gzip command) become 2.4MB. My first version exported 26MB. It becomes 1/10!

Remained Tasks

I implemented a search command on console and a base64 encoded index file exporting. I want to release browser interface in this week.

  • Browser Interface
  • Some small parts (stores stemmed words, parses CSV and text files), etc...