Yet Another Blog – Page 37 – Notes from Sven Latham

WordPress 2.0

I may try switching to WordPress 2.0 some time later this week – expect wonkiness. I’ve already written most of the import script, so the transition should be fairly painless, but the new design (I’m not bothering right away to template) will be a bit of a shock.

Update – if you can read this the upgrade worked and the DNS switched has gone through. Excellent, now I just need to categorise everything.

Google Maps

Yet another little niggling bug with one of Google’s services (I’m not complaining though, they sent me a Happy New Year card 🙂

Google Misleading directions

In the directions from work to the Sussex Geek Dinner I’m going to tonight, Google recommended I turn left at the Stockbridge Roundabout. I’m sure it meant ‘right’ or ‘third exit’, since left is a tad misleading.

Us darn Brits with our crazy roundabouts though – it’s not hard to see how the software got confused!

Happy New Year

Happy New Year – the new Blogwise is up!

American Beauty

Just watched American Beauty for the first time. Suddenly a whole load of Family Guy scenes make sense.

Speedy Search

Of all the things I could’ve been doing over this holiday time, I’ve spent the time rewriting the search engine for Blogwise from scratch.

Blogwise search is a bit of an embarassment at the moment – search results take forever to appear (that’s 9 seconds +) and consistently holds the spot of the slowest search on Grabperf.

The reason for this is three-fold: lack of scalability, lack of decent hardware and lack of time. Over the past few months, page requests to the search have at least trebled. In the same time, the database itself has doubled in size. The search, which is currently live on Version 2, is a bit of a kludge. It has its own database system, thus removing the demand on the main server (a huge problem with Version 1), but it completely lacks any kind of scalability. When you run a search, you’re effectively tying up an entire computer for the few seconds that it’s dealing with your search.

Although I originally had three servers load-balancing the search results, it wasn’t distributing the load very well, so a search was taking 9 seconds on one server while the other two could have been idle.

Version 3 was a first stab at resolving this, by breaking up the database into three chunks (assuming three servers) and having each one deal with a third of the database. With a blog database of 60,000 blogs this meant each one served results for 20,000 blogs – theoretically a better break-up of the load.

I had to drop the rewrite suddenly due to the usual lack of time, and never really got back to it. However, with the glory of 9 straight days of home-time I’ve been able to get back in front of the computer and rewrite the entire search system as Version 4.

Results are looking promising. Because of the way I’ve redesigned the database structure and the algorithms, the search is already giving results in under 1.2 seconds on a good day – that may not sound like much but this is before I put the new load balancing in place. The breakdown of the search results is the key bit – gathering search results takes almost all of the time; the final arrangement and rendering is a miniscule 0.1 second at the most.

A good load-balancing system should see that time drop every time I increase the number of servers – with the three servers back in action on Version 4 code, search results should come in at around 0.4 seconds. That’ll move me from the twenty slowest sites on Grabperf to just below the twenty fastest – neat!

The load balancing system is already mapped out on paper. Every few hours the index will be refreshed. This is then divided up according to servers’ various demands and resource availability. The new data is shipped to each search server and the aggregator is then updated with a new map of indexes. Give or take TCP and mapping overheads, this should crudely mean that more servers = faster speed. I like that kind of scalability!

The search rewrite also coincides with a huge increase in the amount of data being searched – one thing I failed to mention is that the 1.2 seconds is inclusive of both the previous keyword index, but a new index of full-text RSS feeds. ie. the search will be indexing content as well as metadata (finally).

As I get this thing rolled out, I’ll write up more here. In the meantime, hope you’re having a good time!

Full content feeds

Owing to popular demand (one person), the RSS feed is now full-text. Formatting is still a bit iffy, but for the most part it works OK.

I’m still undecided about the whole full-text versus partial feeds thing. On a non-profit site like this I don’t really mind how you read my blog, but for an advert-supported blog/service it’s a whole load of money they’re missing out on if people don’t visit the site (and/or deny ads in RSS feeds).

Still, it’s there now. Let me know how it works out.

Feedster

Odd, looks like Feedster has been removed from the Google search results. Wonder why.

Google Books and Ye Olde Engliſh

Poking around Google Books (formerly Google Print) this morning and discovered an interesting "oddity".

In old English texts, the lower-case letter ‘s’ when appearing at the start or within a word was written as a sort-of ‘f’ character – more specifically, ſ

Turns out that Google Books can’t cope with this – it reads these as the letter ‘f’, so when searching old texts be sure to accommodate this – for example, searching for impoffible will work and highlight the correct words, but impossible won’t.

It’s hardly an earth-shattering bug, but it’s an interesting note to Google and other would-be book search services to check their OCR software is compatible with 18th century texts!

Sad News

"John Spencer, the actor who plays vice-presidential
candidate Leo McGarry in NBC television’s The West Wing, has died of a
heart attack at 58."

Alexa Web Search

Ok, now this is interesting.

http://websearch.alexa.com/welcome.html

Alexa (ie. Amazon) are offering developers access to their search index. Essentially that means that if you wanted to, say, create a video search engine, you could run your scripts against Alexa’s index and do what you need to the data. Actually, I can see this far more likely to appeal to market researchers – if Company X wants to know everything on the Internet about their competitor, Company Y – or even how others refer to X – this is the perfect place to run it. The difference between this and, say, running a Google search is that it’s your search algorithms, not Google’s. You can get to the raw data here.

I have to say (how geeky is this) that the documentation and API at http://pages.alexa.com/awsp/docs/WebHelp/AWSP_User_Guide.htm is far more interesting. It provides an insight into how a large search engine organises their files and indexes. I suspect there’s a lot to learn from this.