Full Text Search Engine on JS
Goal of my project
My goal is creating an alternative search engine for Sphinx‘s HTML outputs.
Sphinx is a document generator written in Python. It reads simple formatted (reStructuredText) text and converts int into several formats like HTML, PDF and so on. It is a very popular tool for writing documents for programming language reference manual, software instruction, library, API etc.
I liked this concept, but originally it supported only English. I sent a patch to support other languages (it provides API to support other languages) and a Japanese plug-in for Sphinx. I heard it was merged into the latest stable version 1.1.
When I visited Japan for a business trip (now I live BayArea), I bought the book “World of high-speed string analysis”. In this book, an amazing algorithm, “FM-index”, is described. It is good for a client side search engine. I think this improves Sphinx’s search engine even more.
This engine uses the FM-index algorithm. FM-index was invented Paolo Ferragina and Giovanni Manzini. Compared to “inverted index”, FM-index has the following advantages:
It doesn’t need word splitting before a search
Some eastern Asian languages don’t have space between words. To split words, it needs language information (it is called a “dictionary”) For example, the famous Japanese word splitter, MeCab, needs a big dictionary file (> 10MB). I don’t know Chinese and Korean well. Supporting many languages by “inverted index” is difficult.
It can recreate original document from index file
“Inverted index” only has word positions in documents. To show document samples on the search result page, it needs the original documents in addition to the index file. FM-index needs only an index file to search words and show document samples. Sphinx uses the original document source file for this purpose. But it is not good with translation feature because source files are written in different languages. And a user should enable the source exporting option.
It is the fastest full text search algorithm using a compressed index file
The book “World of high-speed string analysis” said this. I don’t have any other evidence because I don’t have any other experience implementing a compressed index for inverted index algorithms, but smaller and faster is good for my goal.
The search algorithm was already ported and it passed the unit test. But it needs other parts to create a useful search engine.
At first, I ported the HTML parsing library. It analyses HTML and picks out text from existing HTML files. It is used from the index generator tool.
Now I am porting stemming algorithms. Stemming converts words into base/root form. It provides flexibility to the search engine.
I am changing Snowball‘s source code generator. Snowball a tool that has a long history in this area. It has stemming algorithms for more than 15 natural languages and generates C/C++/Java source code for each algorithm. I added a JSX generator and am adding Python (Python is needed for supporting Sphinx).
I will provide the search engine and an index file for the JSX website later this month.