blog2.shibu.jp

Python JavaScript Qt

Full Text Search Engine on JS(2)

I almost finished first implementation. I describe the current status.

All steps of the search engine

Here is a sample program that uses this search engine. It searches HTTP status codes and returns:

$ git clone git@github.com:shibukawa/oktavia.git
$ cd oktavia
$ git make
$ ./httpstatus 30
300: Multiple Choices
301: Moved Permanently
302: Found
303: See Other
304: Not Modified
305: Use Proxy

This program contains all parts of the search engine. To use an index engine, there are many steps to do like the following:

  • Create an index file from source text.
    • Read source contents.
    • Set up a target file structure (registers meta data).
    • Store an index file.
  • Search content and show result.
    • Read an index file.
    • Receive and parse a user query.
    • Search words.
    • Summarise each word’s result.
    • Show result.

This program does all steps, but splitting into more than two programs is common in most cases.

Read source contents

This part has many variations for each user. Some users want to search all documents in their computers and others want to search their e-mail. Gathering information has many options. My program provide the following methods:

  • Local HTML files.
  • A single local CSV file.
  • local text files.

In future, I want to implement the Sphinx index generator to read Sphinx’s internal document data (doctree).

Note

I will add a multiple index file support to future version will support. It can have single CSV file now, but it is not a big limitation.

Set up a target file structure

In my program, this is called “meta data”. Search target texts are combined into one big text in an index file. These meta data provide more detailed information. My search engine provides the following four types of meta data:

  • Section

    This is a named content boundary. It is used for file names, document’s sub section titles and so on.

  • Splitter

    This is an anonymous content boundary. It is used for paragraphs, lines, words and so on.

  • Block

    This is a range of text. It is used for adjusting the score or filtering content.

  • Table

    It is used for data bases.

For example, httpstatus‘s input data is a simple text. It uses the Splitter for line breaks. It combines detected words in the same lines. So each matched line is shown once.

Store&Read an index file

It generates a binary index file. It contains a wavelet-matrix, some pre-cached search results, bit-vectors and name lists for meta data. Now there is an option to remove pre-cached search results to reduce file size. In future, I want to add the following options:

  • Compress index by using zlib.
  • Export as source codes (.jsx or .js)

I assume read index files with XMLHTTPRequest exists on browsers. The search engine for browsers should use small index files because they are downloaded. Some browsers can’t use XMLHTTPRequest for local contents (the URL starts with file:///), so JavaScript source code style index files are used as a workaround.

Receive and parse a user query

The receiving part is under construction. I want to provide a JavaScript library for the browsers.

Now my search engine provides Google style search query syntax, except number ranging queries:

  • All words: word1 word2
  • Exact words or phrase: "word1 word2"
  • Any of these words: word1 OR word2
  • None of these words: -word1

For example, the following query returns lines that contain “40” and don’t contain “not”:

$ ./httpstatus 40 -not
400: Bad Request
401: Unauthorized
402: Payment Required
403: Forbidden
407: Proxy Authentication Required
408: Request Timeout
409: Conflict

Search words and summarise each word’s result

The FM-index algorithm can return matched word positions. At first, my search engine combines the positions of each word by using a “search-unit”. If the “search-unit” is a “file”, it creates a matched file list. If it is “line”, it creates a matched line list.

After creating the lists for each query word, the search engine applies set operations of the query (intersection, union, complement) and makes a final result. Then it calculates scores for each “search-unit” and sorts.

Remained Tasks

Now almost all basic features are implemented. To provide this search engine for websites, I need to implement:

  • User interface (receiving a user query and showing a result)
  • Source code style output (base64 encoding, decoding)
  • Index file loading
  • Some small parts (stores stemmed words, parses CSV and text files), etc...

It is very close to a first release.