I’ve been looking around on the net for a little while for snippets of code to help with a “Google Suggest” style Ajax search suggestion tool I am developing but there seems to be a distinct lack of topics which actually discuss the data source you need for creating the list of suggestions. I’m fine with the JavaScript/Ajax bit and I am quite happy building the inteface to get it to work – but – I want a relevant list of suggestions based directly on a search of my site content – I couldn’t seem to find any sensible answers anywhere (feel free to point me in the direction of some I may have missed btw!).
So, I have had to come up with my own solution. The technology used is XHTML/CSS->AJAX/JS->PHP->MySQL – but I’m just going to focus on the MySQL bit in this post.
OK – to cover the basics… if you don’t know what I am talking about with regards to a suggestion tool then have alook around here: http://www.google.co.uk/search?q=ajax+suggest
The main issue is that I have a site with about 150ish pages on and I want the list of suggestions to actually be terms which are found in the site. Therefore I need to get my suggestion tool to search the database for the query being entered. It’s not a difficult concept.
Firstly I want to get a list of words which I know are actually used throughout the site – and ideally I’d like to know how often they are used and what density they have to the site/page so I can order the results by relevancy. The only place I could think of where this information is readily available is within the FULL-TEXT index for the table/s I want to search. For those of us who are perhaps not quite as geeky as we’d like let me explain what this is – in MySQL you can set indexes for various columns in a particular table – the most obvious of these is the PRIMARY key, but others include UNIQUE columns and FULL-TEXT (FT) indexes. The FT index will record all the words used in your selected table columns (over a particular size – default 4+) and then lists them with how many times the word is used and what weighting it has in the index. The FT index also ignores most commonly used words (stopwords) so you get a nice usable list. As I already have my columns set up with FT indexing I just want to get this index data into another table in my database. In comes myisam_ftdump! – I got the info I needed from – http://www.databasejournal.com/features/mysql/article.php/3512461/MySQLs-Full-Text-Formulas.htm
Myisam_ftdump allows you to dump all this information into a text file which can then be parsed back into a mySQL DB table. It took me a few minutes to work out how to get it to do what I wanted but basically you do the following:
- Work out where your MySQL data is stored – e.g. something like /usr/local/mysql/data/yourDB/var or C:/MySQL/data/yourDB/
- Make sure you have a file in there named after the table you have indexed suffixed with .MYI – e.g. mytable.myi
- Work out what the ID of the index reference is – to find this look at your table structure in something like phpmyadmin and see how many indexes you have – I have PRIMARY first then a UNIQUE pageURL then my FULL-TEXT – this means my PRIMARY is ID 0 (zero), my UNIQUE id ID 1 and my FULL-TEXT is 2.
- Work out where you want to save the dump file
- Open terminal/shell/command prompt
- Use the following syntax:
myisam_ftdump \path\to\your\data\table(do not add .myi) ID-of-Your-FT-Index -c > location\of\text\output.txt
The code I used was:
myisam_ftdump \\DEV\c$\MySQL\data\mydatabase\content 2 -c > \\DEV\htdocs\mysite\ftcalc.txt
To explain: the -c switch is used to output the calculated values for each word (count and relevance) and the > sign is used to write each line output to the text file.
OK – have a look at the output file and you should see a list of words with counts and a relevance score. We now need to get this data into a new table in MySQL You can do this however you like but I wrote this little script just to do it for me – I am well aware of it not being optimised/efficient and that you could make it smaller etc etc but I don’t really care – it’s a one off (or very infrequent) run – script as a .txt file here: http://mj7.co.uk/jmm it uses an INSERT or UPDATE statement so you can re-run it over the top of your existing data to updated previously entered values and add new ones. You can (if you want) manually go through it and add/remove words to hone your suggestions.
So now I have a table with all my individual words in which I can use for my suggestions. I sorted the rest of my code out and this worked a treat, however, I was not quite happy. Although this gave great results it only returns single words and I wanted some common phrases to appear as well. So back to MySQL!
Basically I used the query below to search through my table using regular expression matching and I asked it to return 50 characters to the right of the search phrase so I could capture the context of the word queried.
SELECT TRIM(SUBSTRING(Concat(columnOne,' ',columnTwo)
FROM (LOCATE('YOUR-SEARCH-QUERY',LOWER(Concat(columnOne,' ',columnTwo))))
FOR (LENGTH('YOUR-SEARCH-QUERY')+50))) AS idx_word
FROM content
WHERE Concat(columnOne,' ',columnTwo) REGEXP 'YOUR-SEARCH-QUERY'
As you can see I used the CONCAT statement to merge multiple columns together – this is totally optional – I only used it because I had two columns which held data I wanted to search.
Now – you need to be aware that this will probably return half words where the 50 characters limit cuts through words. To get around this I just trimmed off everything from the last space character using PHP. The other issue I hit was that my data returned HTML so I had to run various filters to return just the text I wanted – but in the end it worked well.
Finally I wanted to return both the single word data and the phrases together from one query but wanted to give priority to the single words – so I just UNION’ed the two SQL statements and mannually added priority values to each – like this (written in PHP):
$q = 'FIND THIS';
$sql = "(SELECT idx_word, 1 as priority FROM content_index
WHERE idx_word REGEXP '".$q."'
ORDER BY idx_count DESC
) UNION (
SELECT TRIM(SUBSTRING(Concat(columnOne,' ',columnTwo)
FROM (LOCATE('".$q."',LOWER(Concat(columnOne,' ',columnTwo))))
FOR (LENGTH('".$bq."')+50))) AS idx_word, 5 as priority
FROM contentTable
WHERE Concat(pageName,' ',pageBody) REGEXP '".$q."'
)
ORDER BY priority
LIMIT 50"; // reduce query to save a little time
And that’s about it! – obviously there’s a little tweaking to be done behind the scenes but as a quick and easy suggestion list it worked well for me.
I’d love to hear your thoughts on this, whether you have used it, if you think it is rubbish etc – or especially if you have a better solution. Comments are most welcome.
P.s : this code was written for one of my clients – see it working at http://www.seriousdigitalideas.com