Full Text Search Engine on JS(2)
I almost finished first implementation. I describe the current status.
All steps of the search engine
Here is a sample program that uses this search engine. It searches HTTP status codes and returns:
$ git clone email@example.com:shibukawa/oktavia.git $ cd oktavia $ git make $ ./httpstatus 30 300: Multiple Choices 301: Moved Permanently 302: Found 303: See Other 304: Not Modified 305: Use Proxy
This program contains all parts of the search engine. To use an index engine, there are many steps to do like the following:
- Create an index file from source text.
- Read source contents.
- Set up a target file structure (registers meta data).
- Store an index file.
- Search content and show result.
- Read an index file.
- Receive and parse a user query.
- Search words.
- Summarise each word’s result.
- Show result.
This program does all steps, but splitting into more than two programs is common in most cases.
Read source contents
This part has many variations for each user. Some users want to search all documents in their computers and others want to search their e-mail. Gathering information has many options. My program provide the following methods:
- Local HTML files.
- A single local CSV file.
- local text files.
In future, I want to implement the Sphinx index generator to read Sphinx’s internal document data (doctree).
I will add a multiple index file support to future version will support. It can have single CSV file now, but it is not a big limitation.
Set up a target file structure
In my program, this is called “meta data”. Search target texts are combined into one big text in an index file. These meta data provide more detailed information. My search engine provides the following four types of meta data:
This is a named content boundary. It is used for file names, document’s sub section titles and so on.
This is an anonymous content boundary. It is used for paragraphs, lines, words and so on.
This is a range of text. It is used for adjusting the score or filtering content.
It is used for data bases.
For example, httpstatus‘s input data is a simple text. It uses the Splitter for line breaks. It combines detected words in the same lines. So each matched line is shown once.
Store&Read an index file
It generates a binary index file. It contains a wavelet-matrix, some pre-cached search results, bit-vectors and name lists for meta data. Now there is an option to remove pre-cached search results to reduce file size. In future, I want to add the following options:
- Compress index by using zlib.
- Export as source codes (.jsx or .js)
Receive and parse a user query
Now my search engine provides Google style search query syntax, except number ranging queries:
- All words: word1 word2
- Exact words or phrase: "word1 word2"
- Any of these words: word1 OR word2
- None of these words: -word1
For example, the following query returns lines that contain “40” and don’t contain “not”:
$ ./httpstatus 40 -not 400: Bad Request 401: Unauthorized 402: Payment Required 403: Forbidden 407: Proxy Authentication Required 408: Request Timeout 409: Conflict
Search words and summarise each word’s result
The FM-index algorithm can return matched word positions. At first, my search engine combines the positions of each word by using a “search-unit”. If the “search-unit” is a “file”, it creates a matched file list. If it is “line”, it creates a matched line list.
After creating the lists for each query word, the search engine applies set operations of the query (intersection, union, complement) and makes a final result. Then it calculates scores for each “search-unit” and sorts.
Now almost all basic features are implemented. To provide this search engine for websites, I need to implement:
- User interface (receiving a user query and showing a result)
- Source code style output (base64 encoding, decoding)
- Index file loading
- Some small parts (stores stemmed words, parses CSV and text files), etc...
It is very close to a first release.