blog2.shibu.jp

Python JavaScript Qt

Performance of the FM-index

From the last week, I am tuning index file size. The search algorithm, FM-index, is a flexible algorithm. Programmers can control the performance of the algorithm easily.

I changed the sorting algorithm inside Burrows-Wheeler transformation from a quick sort to an induced-sorting. The performance of sorting algorithms are O(n * log n) and O(n). If an input data size become bigger, induced-sorting can finish earlier. Now my search engine can create index files from big file size in reasonable time. It is a described in the following paper:

One of the biggest document it needs more efficient search algorithm (like FM-index) around me is a Japanese translation of the programming language document of Python. The PDF version of the document is 3,000 pages. The last week version of my search engine couldn’t process this document (it took more than 10 minutes), but now it can do in 60 seconds. So I could check the performance by using the data it is suitable for practical use.

Inside of the FM-index, there is the data it is made from original source document. This data, it is called Suffix Array, is the most important data. But calculating from this data takes much time. So all implementations of the FM-index I read have pre-calculated cached data. My search engine has this feature too. I compared the performance of the cache density. I used an Python-JP translation document as a source data (4.6 million characters) and use the word “構造体” (it means structure, it is used more than 100 times in the document):

Cache Density Search Response Cache Size
0.01% 22 [S] 7 [KB]
0.1% 2 [S] 74 [KB]
0.3% 600 [mS] 228 [KB]
1% 190 [mS] 738 [KB]
25% 11 [mS] 18.5 [MB]

A cache density and a cache size are increase commensurately and a search response time is in inverse proportion to them. In my use case, my search engine works on browsers. So index files are downloaded over the Internet. A file size of each index file is an important factor in this case. If someone uses my engine on server, file size is not important so much. You can get an extreme performance by using big cache data. This cache size is in proportion to an input text size, so you can use big cache density for small documents. In this case, you can use real-time search (when user type one character, update search result) even if it is processed on browsers.

Reducing Index File Size

If the index file size become smaller, you can use bigger cache and get more performance. Now I am implementing following feature:

  • Compressing array of numbers:

    It is a simple compression algorithm (Run-Length Encoding, RLE). In the data, there are many blank area (number 0 are placed continuously). If there are following data, storing just count of zero items and compress. It reduce 40% of input data.

    [1, 0, 0, 0, 0, 0, 0, 2] -> [1, [0 * 6], 2]
    

    Original RLE has a weak point. Result size become bigger than original data if there are few continuous values. So I implemented modified version of algorithm Switched Run Length Encoding.

    There are many types of compression algorithms that have high compression rate, but web servers and browsers can have feature to compress data with a gzip command to increase transfer speed. But using similar algorithms are not good for efficiency. This is faster to decompress and result file size is not bad and doesn’t interfere web servers/browsers’ data compression.

  • Remapping character code:

    JavaScript uses UTF-16 as a character encoding. All character uses 2 bytes in memory. My search engine uses the Wavelet-Matrix to store the BWT data. It is an invert table of character bit and string. It is an example of memory image. Actually Wavelet-Matrix stores sorted data in each row, but the following sample is not sorted for description.

    String "ABC" -> [0000000001000001] [0000000001000010] [0000000001000011]
    
    Wavelet-Matrix "ABC" -> [000] [000] [000] [000] [000] [000] [000] [000] [000] [111] [000] [000] [000] [000] [011] [101]
    

    As you see, there are many empty spaces. If you remapping character code into 2 bits to store character, you can reduce Wavelet-Matrix depth significantly.

    String "ABC" -> [00] [01] [10]
    Wavelet-Matrix "ABC" -> [001] [010]
    

    Python Japanese translation uses 1557 characters (11 bits). So the result file size become 11/16. Python English document uses just 147 characters (8 bits). It will be half.

Python Japanese document is 9.2MB (in UTF-16). Expected result file size is 5.1MB (the cache data, density is 1%, is 740KB). Actual transfer file size (compressed by a gzip command) become 2.4MB. My first version exported 26MB. It becomes 1/10!

Remained Tasks

I implemented a search command on console and a base64 encoded index file exporting. I want to release browser interface in this week.

  • Browser Interface
  • Some small parts (stores stemmed words, parses CSV and text files), etc...

Full Text Search Engine on JS(2)

I almost finished first implementation. I describe the current status.

All steps of the search engine

Here is a sample program that uses this search engine. It searches HTTP status codes and returns:

$ git clone git@github.com:shibukawa/oktavia.git
$ cd oktavia
$ git make
$ ./httpstatus 30
300: Multiple Choices
301: Moved Permanently
302: Found
303: See Other
304: Not Modified
305: Use Proxy

This program contains all parts of the search engine. To use an index engine, there are many steps to do like the following:

  • Create an index file from source text.
    • Read source contents.
    • Set up a target file structure (registers meta data).
    • Store an index file.
  • Search content and show result.
    • Read an index file.
    • Receive and parse a user query.
    • Search words.
    • Summarise each word’s result.
    • Show result.

This program does all steps, but splitting into more than two programs is common in most cases.

Read source contents

This part has many variations for each user. Some users want to search all documents in their computers and others want to search their e-mail. Gathering information has many options. My program provide the following methods:

  • Local HTML files.
  • A single local CSV file.
  • local text files.

In future, I want to implement the Sphinx index generator to read Sphinx’s internal document data (doctree).

Note

I will add a multiple index file support to future version will support. It can have single CSV file now, but it is not a big limitation.

Set up a target file structure

In my program, this is called “meta data”. Search target texts are combined into one big text in an index file. These meta data provide more detailed information. My search engine provides the following four types of meta data:

  • Section

    This is a named content boundary. It is used for file names, document’s sub section titles and so on.

  • Splitter

    This is an anonymous content boundary. It is used for paragraphs, lines, words and so on.

  • Block

    This is a range of text. It is used for adjusting the score or filtering content.

  • Table

    It is used for data bases.

For example, httpstatus‘s input data is a simple text. It uses the Splitter for line breaks. It combines detected words in the same lines. So each matched line is shown once.

Store&Read an index file

It generates a binary index file. It contains a wavelet-matrix, some pre-cached search results, bit-vectors and name lists for meta data. Now there is an option to remove pre-cached search results to reduce file size. In future, I want to add the following options:

  • Compress index by using zlib.
  • Export as source codes (.jsx or .js)

I assume read index files with XMLHTTPRequest exists on browsers. The search engine for browsers should use small index files because they are downloaded. Some browsers can’t use XMLHTTPRequest for local contents (the URL starts with file:///), so JavaScript source code style index files are used as a workaround.

Receive and parse a user query

The receiving part is under construction. I want to provide a JavaScript library for the browsers.

Now my search engine provides Google style search query syntax, except number ranging queries:

  • All words: word1 word2
  • Exact words or phrase: "word1 word2"
  • Any of these words: word1 OR word2
  • None of these words: -word1

For example, the following query returns lines that contain “40” and don’t contain “not”:

$ ./httpstatus 40 -not
400: Bad Request
401: Unauthorized
402: Payment Required
403: Forbidden
407: Proxy Authentication Required
408: Request Timeout
409: Conflict

Search words and summarise each word’s result

The FM-index algorithm can return matched word positions. At first, my search engine combines the positions of each word by using a “search-unit”. If the “search-unit” is a “file”, it creates a matched file list. If it is “line”, it creates a matched line list.

After creating the lists for each query word, the search engine applies set operations of the query (intersection, union, complement) and makes a final result. Then it calculates scores for each “search-unit” and sorts.

Remained Tasks

Now almost all basic features are implemented. To provide this search engine for websites, I need to implement:

  • User interface (receiving a user query and showing a result)
  • Source code style output (base64 encoding, decoding)
  • Index file loading
  • Some small parts (stores stemmed words, parses CSV and text files), etc...

It is very close to a first release.

Full Text Search Engine on JS

I am creating a full text search engine in JavaScript. My source code is uploaded here.

Goal of my project

My goal is creating an alternative search engine for Sphinx‘s HTML outputs.

Sphinx is a document generator written in Python. It reads simple formatted (reStructuredText) text and converts int into several formats like HTML, PDF and so on. It is a very popular tool for writing documents for programming language reference manual, software instruction, library, API etc.

Sphinx’s HTML generator has a simple built-in search engine. An HTML document written by Sphinx has a search window and the user can search whole documents. It is written in JavaScript and it downloads an index file and searches on the client browser. A document provider doesn’t have to set up search engines on his/her server. Just deploying html folder on his/her hosting server, then it is available.

I liked this concept, but originally it supported only English. I sent a patch to support other languages (it provides API to support other languages) and a Japanese plug-in for Sphinx. I heard it was merged into the latest stable version 1.1.

When I visited Japan for a business trip (now I live BayArea), I bought the book “World of high-speed string analysis”. In this book, an amazing algorithm, “FM-index”, is described. It is good for a client side search engine. I think this improves Sphinx’s search engine even more.

And I am a founder of sphinx-users.jp. Now I have gave the leader position to @tk0miya and @shimizukawa but I still like Sphinx.

In addition to that, there are many simple HTML hosting service now, like bitbucket.org, github.com, PyPI, Google Drive and more. Also, HTML5 provides an application cache feature. I am thinking static HTML documents (like Sphinx’s outputs) without a search engine server will increase more and more. Providing a pure JavaScript search engine helps many users.

FM-index

This engine uses the FM-index algorithm. FM-index was invented Paolo Ferragina and Giovanni Manzini. Compared to “inverted index”, FM-index has the following advantages:

  • It doesn’t need word splitting before a search

    Some eastern Asian languages don’t have space between words. To split words, it needs language information (it is called a “dictionary”) For example, the famous Japanese word splitter, MeCab, needs a big dictionary file (> 10MB). I don’t know Chinese and Korean well. Supporting many languages by “inverted index” is difficult.

  • It can recreate original document from index file

    “Inverted index” only has word positions in documents. To show document samples on the search result page, it needs the original documents in addition to the index file. FM-index needs only an index file to search words and show document samples. Sphinx uses the original document source file for this purpose. But it is not good with translation feature because source files are written in different languages. And a user should enable the source exporting option.

  • It is the fastest full text search algorithm using a compressed index file

    The book “World of high-speed string analysis” said this. I don’t have any other evidence because I don’t have any other experience implementing a compressed index for inverted index algorithms, but smaller and faster is good for my goal.

I ported the FM-index library Shellinford that was written in C++ by @echizen_tm.

Other tools

The search algorithm was already ported and it passed the unit test. But it needs other parts to create a useful search engine.

At first, I ported the HTML parsing library. It analyses HTML and picks out text from existing HTML files. It is used from the index generator tool.

Now I am porting stemming algorithms. Stemming converts words into base/root form. It provides flexibility to the search engine.

I am changing Snowball‘s source code generator. Snowball a tool that has a long history in this area. It has stemming algorithms for more than 15 natural languages and generates C/C++/Java source code for each algorithm. I added a JSX generator and am adding Python (Python is needed for supporting Sphinx).

I selected JSX as the programming language. It is created by my DeNA Co.,Ltd. coworkers. It can generate faster and safer JavaScript code. It has a powerful static type system and detect many errors before running.

First Milestone

I will provide the search engine and an index file for the JSX website later this month.

Qt + JavaScript(1): Install QtScript Binding

Nowadays, JavaScript is the most used language for HTML client and some guys start using it for servers on node.js. And also, JavaScript can be used with the most powerful cross platform GUI framework, Qt.

Qt includes a JavaScript engine. Qt4 uses JavaScriptCore made by Apple, and Qt5 uses V8 made by Google. Qt programmers can use the JavaScript engine for the following features:

  • QtScript
  • QtQuick

QtScript is a standard JavaScript (ECMAScript) environment. QtScript provides a simple API to register existing C++ classes that are based on QObject. This engine itself doesn’t have any special features for Qt GUI programming, but you can register extensions for Qt widgets and use it for GUI programming.

QtQuick is one of the first-class programming tools for Qt5. It is a declerative programming environment. The main source code is JavaScript code segments inside QML. Qt5 starts supporting Android and iOS using QML.

On my blog, I will describe the QtScript environment.

QtScript Binding Generators

The QtScript Binding Generator generates source code that coordinates almost all Qt widgets with the QtScript engine. This tool allows you to implement GUI programs in a JavaScript.

Unfortunately there are many repositories of QtScript Binding Generators on Google Code, an old QtLabs page and so on. The following repositories are the better ones:

The Binding Generator for Qt5 supports fewer classes/modules than Qt4 now. For example, it doesn’t create WebView and OpenGL, QGraphicsView support and so on.

Install QtScript Binding Generator on MacOS

Install the Qt library for your computer. I am using Mac OS X 10.8. Download the installer from http://qt-project.org/downloads, and then install the git command to get the QtScript Binding Generator. MacPorts and Homebrew are common ways to install the command line development tools. My recommendation is MacPorts because it tries to install pre-built binary packages as much as possible from MacPorts version 2.0. It saves your time and electric cost.

Get the QtScript Binding Generator source code and set the environment variables to build it:

# for Qt4.X
$ git clone git://gitorious.org/qt-labs/qtscriptgenerator.git
$ export QTDIR=~/QtSDK/Desktop/Qt/4.8.1/gcc/
$ export PATH=$QTDIR/bin:$PATH
# for Qt5
$ git clone git://github.com/svalaskevicius/qtscriptgenerator.git
$ export QTDIR=~/Qt5.0.02/5.0.0/clang_64/
$ export PATH=$QTDIR/bin:$PATH

At first, you have to build Binding Generator and then generate registration code:

$ cd qtscriptgenerator/generator
$ qmake
$ make
(wait..)
$ ./generator --include-paths={$QTDIR}/gcc/include
(wait..)
Classes in typesystem: 532
Generated:
  - classes...: 514 (80)
  - header....: 327 (0)
  - impl......: 327 (0)
  - modules...: 10 (9)
  - pri.......: 10 (0)

Check result classes’ number. If this number is much less than this, check QTDIR environment variable.

If your MacOS X version is 10.8, you should use one of following command instead of qmake:

# for Qt 4.X
$ qmake -spec unsupported/macx-clang

# for Qt5
$ qmake -spec macx-clang

Let’s build a generated QtScript binding code.

$ cd ../qtbindings/
$ qmake
$ make
(wait...)

Try sample codes like this:

$ ./qs_eval/qs_eval ../examples/AnalogClock.js
../../../_images/analog.png