Speedy Search

Of all the things I could’ve been doing over this holiday time, I’ve spent the time rewriting the search engine for Blogwise from scratch.

Blogwise search is a bit of an embarassment at the moment – search results take forever to appear (that’s 9 seconds +) and consistently holds the spot of the slowest search on Grabperf.

The reason for this is three-fold: lack of scalability, lack of decent hardware and lack of time. Over the past few months, page requests to the search have at least trebled. In the same time, the database itself has doubled in size. The search, which is currently live on Version 2, is a bit of a kludge. It has its own database system, thus removing the demand on the main server (a huge problem with Version 1), but it completely lacks any kind of scalability. When you run a search, you’re effectively tying up an entire computer for the few seconds that it’s dealing with your search.

Although I originally had three servers load-balancing the search results, it wasn’t distributing the load very well, so a search was taking 9 seconds on one server while the other two could have been idle.

Version 3 was a first stab at resolving this, by breaking up the database into three chunks (assuming three servers) and having each one deal with a third of the database. With a blog database of 60,000 blogs this meant each one served results for 20,000 blogs – theoretically a better break-up of the load.

I had to drop the rewrite suddenly due to the usual lack of time, and never really got back to it. However, with the glory of 9 straight days of home-time I’ve been able to get back in front of the computer and rewrite the entire search system as Version 4.

Results are looking promising. Because of the way I’ve redesigned the database structure and the algorithms, the search is already giving results in under 1.2 seconds on a good day – that may not sound like much but this is before I put the new load balancing in place. The breakdown of the search results is the key bit – gathering search results takes almost all of the time; the final arrangement and rendering is a miniscule 0.1 second at the most.

A good load-balancing system should see that time drop every time I increase the number of servers – with the three servers back in action on Version 4 code, search results should come in at around 0.4 seconds. That’ll move me from the twenty slowest sites on Grabperf to just below the twenty fastest – neat!

The load balancing system is already mapped out on paper. Every few hours the index will be refreshed. This is then divided up according to servers’ various demands and resource availability. The new data is shipped to each search server and the aggregator is then updated with a new map of indexes. Give or take TCP and mapping overheads, this should crudely mean that more servers = faster speed. I like that kind of scalability!

The search rewrite also coincides with a huge increase in the amount of data being searched – one thing I failed to mention is that the 1.2 seconds is inclusive of both the previous keyword index, but a new index of full-text RSS feeds. ie. the search will be indexing content as well as metadata (finally).

As I get this thing rolled out, I’ll write up more here. In the meantime, hope you’re having a good time!

 

Full content feeds

Owing to popular demand (one person), the RSS feed is now full-text. Formatting is still a bit iffy, but for the most part it works OK.

I’m still undecided about the whole full-text versus partial feeds thing. On a non-profit site like this I don’t really mind how you read my blog, but for an advert-supported blog/service it’s a whole load of money they’re missing out on if people don’t visit the site (and/or deny ads in RSS feeds).

Still, it’s there now. Let me know how it works out.

 

Google Books and Ye Olde Engliſh

Poking around Google Books (formerly Google Print) this morning and discovered an interesting "oddity".

In old English texts, the lower-case letter ‘s’ when appearing at the start or within a word was written as a sort-of ‘f’ character – more specifically, ſ

Turns out that Google Books can’t cope with this – it reads these as the letter ‘f’, so when searching old texts be sure to accommodate this – for example, searching for impoffible will work and highlight the correct words, but impossible won’t.

It’s hardly an earth-shattering bug, but it’s an interesting note to Google and other would-be book search services to check their OCR software is compatible with 18th century texts!

 

Alexa Web Search

Ok, now this is interesting.

http://websearch.alexa.com/welcome.html

Alexa (ie. Amazon) are offering developers access to their search index. Essentially that means that if you wanted to, say, create a video search engine, you could run your scripts against Alexa’s index and do what you need to the data. Actually, I can see this far more likely to appeal to market researchers – if Company X wants to know everything on the Internet about their competitor, Company Y – or even how others refer to X – this is the perfect place to run it. The difference between this and, say, running a Google search is that it’s your search algorithms, not Google’s. You can get to the raw data here.

I have to say (how geeky is this) that the documentation and API at http://pages.alexa.com/awsp/docs/WebHelp/AWSP_User_Guide.htm is far more interesting. It provides an insight into how a large search engine organises their files and indexes. I suspect there’s a lot to learn from this.

From your friends at Google

Just got an unexpected package through the mail – a gift set from Google, straight from San Jose. The box contains a whole bunch of branded goodies: wireless mouse, USB key, USB hub, carry wallet and light. Very nice surprise- I can only imagine they’re sending these out to select Adsense customers; there’s no note inside and the sender’s address is fairly ambiguous (I had no idea it was even from Google until I opened it).

London Geek Dinner

Right, over a day since I got back from the London Geek Dinner and I figured I ought to write something about it.

The event was interesting, enjoyable and useful. It was held at the Texas Embassy just off Trafalgar Square – a regular meeting place for these geek dinners, apparantly, and I can see why. They have cracking food and I’m told the margueritas are excellent (unfortunately, yours truly was driving…)

As a networking event it was useful to get around and meet people. This is one of my first such events, and I’m still to overcome the whole approach strangers thing. Despite that I managed to meet some pretty interesting people, including:

Definitely worthwhile – it looks like there’ll be a smaller one for Sussex people (I’m Sussex-ish) in Uckfield in January – I’m already on the list there and it should be another useful and interesting time.

I didn’t take a camera, but loads of photos can be had at Flickr including ego-feeding photos here and here from Andrew and Jen respectively – thanks!

Off to London

I am writing this on a train to London on a palm computer with barely enough battery to last for this post. Today I am off to the London Geek Dinner at the Texas Embassy on Trafalgar Square.

This is a bit of a follow on from last week’s trip to Ireland where I met with a bunch of really nice and interesting Irish bloggers. I am starting to make a point of going to these events; they are a great way to meet people and to get my own name more recognisable. Networking, basically.

This evening I’m looking forward to meeting a whole new group of people. There’ll be some reasonably high brow people there. Robert Scoble (who I met in Ireland), Dan Gillmor and Hugh McLeod stand out in the guest list.

In other news, full text searching in Blogwise is but a gnat’s breath away. I am already successfully parsing RSS feeds and discovering new ones through Pings and Autodiscovery. Search by keyword has been completely rewritten and is now far faster- no longer will Blogwise search be on Grabperf‘s ‘slowest site’ list!

All this should coincide with a redesign, which will finally include text ads on every page. This is something I’ve put off for as long as possible but is finally necessary to keep Blogwise running. I want these to be as unobtrusive as possible and the redesign will be structured to display the ads effectively but not offensively.

Russell Beattie has been running ads in an interesting way. He only shows ads for those visitors who’ve followed a direct link (ie through search or from others’ blogs). This makes all kinds of sense: his regulars either bookmark his homepage, read his RSS feed or type his address directly. They visit his site with the specific intent of reading his content, and are therefore far less likely to click ads than the casual surfers who came from Google and are far less likely to stick around. His results so far seem promising, and I wonder whether a similar technique could work on Blogwise.

In the next few days I should have more on Ireland, on Blogwise and on stuff in general. Stay tuned!