CoolTools - SphinxSearch

In the previous posts, we've set up a virtual machine, installed ubuntu, got the prerequisite server software ready and a basic install of MediaWiki running and reachable as your intranet wiki.
But we're not quite ready to go live. In this and the next post, we'll be installing Sphinx as your back-end search engine and doing some additional configuration work.

People will be continuously searching through your wiki for that exact bit of information they need, so it follows you'll be needing a good search functionality. The basic search included in your wiki installation is not a good search functionality.
SphinxSearch is a (stand-alone) lightning-fast indexing and searching engine. We'll install it, then give it access to the MediaWiki MySQL-database . Finally, we'll install the sphinxsearch-mediawiki-extension to feed search queries from wiki users to Sphinx and results from Sphinx to your wiki.

I realize this post is a bit extensive, but rest assured, it's worth it.




SphinxSearch installation

Using FireFox, download your preferred version from http://sphinxsearch.com/downloads/. For myself, I've chosen the most recent beta. Pick the Ubuntu “.deb” package (12.04 LTS at time of writing, i386 for Intel, x86_64 for AMD processor).
After clicking it, select not to just to download it but to let it open with Ubuntu software center. After the file is downloaded, the Ubuntu software center will analyze the file and give you the option to install it. Click the install button and let Ubuntu do the work.
You might wonder if there isn't a sudo apt-get install something instead; after all, it’s been going great that way. Well, there is. But we're not using it this time because the apt-get version is terribly outdated and not being updated with a good frequency.

Alright, so the install is done. Now, we'll need to provide a configuration file for Sphinx to work nicely with MediaWiki. A preconfigured one (for MediaWiki-use) is provided by the SphinxSearch extension (we'll get to that in the next section). There is also a configuration file provided by Sphinx itself - this is a general one which you can use as a starting point when you’re configuring Sphinx for other purposes. This configuration file is automatically loaded and will mess things up for us as we will have a preconfigured one in a different location. So first thing you need to do is delete this configuration file.

Open the terminal and type:
gksu nautilus
which will start the GUI file manager with super-user permission. Use it to navigate to File system, etc, sphinxsearch. There you will find a file sphinx.conf. You need to delete this file, we'll be setting up a different configuration file in a minute.


Configuration - MediaWiki extension

Using FireFox inside your virtual machine, download the extension from:
http://www.mediawiki.org/wiki/Special:ExtensionDistributor/SphinxSearch

When the download is complete, open the terminal and enter the following commands:
cd Downloads
tar xvzf wikimedia-mediawiki-extensions-SphinxSearch*.tar.gz 
Move the extracted file - check the output given during extraction for the exact name of the folder for your version (the underlined part may be different):
sudo mv wikimedia-mediawiki-extensions-SphinxSearch-f6f56dd /etc/w/extensions/SphinxSearch
(note that this is one long line)


Now to move the sphinx.conf file and prepare a folder structure for Sphinx. At the terminal which you still have open, type the following series of commands:
cd ../../../etc/w/extensions/SphinxSearch
sudo mkdir /usr/local/var/
sudo mkdir /usr/local/var/data
sudo mkdir /usr/local/var/data/sphinx
sudo mkdir /usr/local/var/data/sphinx/wiki_main
sudo mkdir /usr/local/var/data/sphinx/wiki_incremental
sudo mkdir /usr/local/var/log
sudo mkdir /usr/local/var/log/sphinx
sudo mv sphinx.conf usr/local/var/data/sphinx
cd ../../../../usr/local/var/data/sphinx
Update the default configuration using nano:
nano sphinx.conf
Enter your database name (mediawiki), username (mediawiki) and password (you've set this previously). 
You also need to change all the paths to the folder structure we've created above. This means you need to
  • replace all occurrences of the /var/data/... paths to /usr/local/var/data/... 
  • replace all occurrences of /var/log/... to /usr/local/var/log/...
Don't forget the paths always start with a forward slash (/). Forget the slash, and it won't work. When you're done, press CTRL+o, enter, and close with CTRL+x.


Indexer & Folder Access

The indexer is a process which will read the articles in your wiki (by means of the queries in the sphinx.conf-file), and create an index on this textual data. The index will allow very quick searching.
A little in-between theory on indexers: indexing all your articles (when you have tens of thousands) will take some time. On the other hand, you want to include new articles to your index very quickly, because they can't be found as long as they haven't been indexed. To address this dual requirement, there are two indexes being used:

  • The main index, which will index all your articles but will only run once a day.
  • The incremental index, which will only index the articles added/updated since the last run of the main index and so doesn't take much time to complete. We'll configure it to run every few minutes.

Enough theory. Let's run the indexer to check if it’s working (and at the same time create the initial index):
cd ../../../../bin/
sudo indexer --config /usr/local/var/data/sphinx/sphinx.conf --all
If it doesn't give you any errors, then you're good. Otherwise, read the error and try to fix whatever went wrong.


We've let Sphinx read and write to /usr/local/var/data/sphinx and /usr/local/var/log/sphinx, but only our “sudo” lets Sphinx access those folders. When Sphinx is being queried by MediaWiki, it won’t have access. So we need to change this. At the terminal, type:
gksu nautilus
This will start up the GUI file manager in superuser mode again. Use it to navigate to the above mentioned folders: first click File System in the left window of the file manager, then you'll find the /usr folder in the right window. Navigate further through local, var to find the data and log folders. 
Now right-click the sphinx folder (do this both in the data and log folder) and go to the permissions tab. Search for the user group your username (as set when installing Ubuntu). Then set the access to let this user group create and delete files, and read and write in existing files. Don't forget to apply the permissions to enclosed files.


SearchDaemon & Cron

The SearchDaemon is a process that listens continuously on a certain port and accepts queries from other processes (in our case: the MediaWiki-SphinxSearch-extension). It then performs the search on the indexes prepared by the indexing process (which we executed once in the previous section).
To make the search daemon start up automatically when your server starts up, you need to add it to the startup script (rc.local):
cd ../../etc
sudo nano rc.local
Before the line with “exit 0”, add:

/usr/bin/searchd --config /usr/local/var/data/sphinx/sphinx.conf >> /usr/local/var/log/sphinx/sphinx-startup.log 2>&1
(one long line, space before and after the >>)

End with CTRL+o, enter, CTRL+x.

Now, to set up an automated task for the indexers:
crontab -e
Which will open your crontab in nano. Add the following below the comments:
0 7 * * * /usr/bin/indexer --quiet --config /usr/local/var/data/sphinx/sphinx.conf wiki_main --rotate >/dev/null 2>&1; /usr/bin/indexer --quiet --config /usr/local/var/data/sphinx/sphinx.conf wiki_incremental --rotate >/dev/null 2>&1
(one long line)

That will let the main and incremental index run at 7 AM every day. To let the incremental indexer run with a higher frequency during the day, add another line below the previous one:
*/5 * * * * /usr/bin/indexer --quiet --config /usr/local/var/data/sphinx/sphinx.conf wiki_incremental --rotate >/dev/null 2>&1
(again, one long line)

This will make the incremental indexer work every 5 minutes. You may want to lower the frequency if you’re having performance problems (e.g. */10 = every 10 minutes, */30 = every 30 minutes).

Again CTRL+o, enter, CTRL+x.


MediaWiki extension

We need to make the Sphinx PHP API available to the extension. We do this by copying sphinxapi.php from /usr/share/sphinxsearch/api to /etc/w/extensions/sphinxsearch. Do this using the file manager.

Then we need to tell our wiki to use Sphinx for searching. Using the file manager, go to /etc/w and open the LocalSettings.php file. At the bottom of the file, add
$wgSearchType = 'SphinxMWSearch';
require_once( "$IP/extensions/SphinxSearch/SphinxSearch.php" );
$wgEnableMWSuggest = true;
$wgEnableSphinxPrefixSearch = true;
Save and close.


Done!

Reboot the server to let the search-daemon start up (as configured in rc.local). After reboot, Sphinx will be serving your wiki-users their search results.

In the following article, we'll be installing and configuring some other nice extra's - we're not done pimping your workplace yet! But what you have so far is good to go, so at this point you can start adding information in your wiki and letting your colleagues/employees know about it.

2 comments:

  1. Question about the Sphinxsearch installation package.
    Does it matter which processor you are using if you install it in a virtual machine?

    Pick the Ubuntu “.deb” package (12.04 LTS at time of writing, i386 for Intel, x86_64 for AMD processor).

    ReplyDelete
    Replies
    1. Yes. It depends on which processor the environment (your virtual machine "player") is offering your virtual machine. This will probably be the same as the one you're using on your actual underlying machine.

      Delete