On CHAR, Unicode and String searching

As part of my wifi project, I need to store the hashed MAC address of devices so we can run analysis and gather daily figures. We use MariaDB, based on MySQL, for this particular application. There are various tables used for this, and all JOINed by the common hash.

When I first put the system together in haste, I used VARCHAR(20) as it wasn’t obvious at the time which route we’d go down, or how things would be processed. You can be a little more wasteful with VARCHAR as the storage space is n+1 (where n is number of chars), so the field shrinks to suit. Fixed strings using CHAR always pre-allocate a fixed space, so are less flexible.

Roll forward a year. Things have grown very fast, and we’re starting to see performance limits being hit, and storage space becoming an issue, so it’s time to review the schema.

 

Now we know the maximum bytes for any hash will always be four. They can be smaller than this (depends on application) but eight is the limit.

So, CHAR(8) would seem a reasonable way to store eight bytes in hex. If you’re screaming about INT/BLOB/VARBINARY storage, I’ll come back to that in a moment…

Issue one – UNICODE

First problem is that, if you use a multi-byte charset like utf-8, CHAR is potentially much worse than VARCHAR. As a fixed field, CHAR must allocate the maximum space for all supported Unicode characters, so three bytes are reserved for each character (MySQL utf8 doesn’t support supplemental unicode characters, so three is the limit).

Compare to VARCHAR, where the field size is dynamic and therefore grows according to the characters in use. For anybody using straightforward Latin characters, this needn’t be more than one byte per character.

In the case of utf8, VARCHAR can be far more efficient than CHAR.

If you’re only using basic characters or storing hex values in CHAR, explicitly set the latin1 character set on this field:

ALTER TABLE `mytable` MODIFY `col` CHAR(8) CHARSET latin1

Issue two – Comparing CHAR and VARCHAR

As I was testing this I converted some tables hoping that the comparison would be fairly straightforward, but it seems there’s a big performance hit when comparing a VARCHAR field to a CHAR field (possibly not helped by charset differences). I was seeing queries slow by a factor of 50+ in some cases.

SELECT a.id, b.category from a inner join b on b.fk=a.id

(fk is a varchar; id is a char – both contain the same string)

This is easily solved by explicitly setting a cast on the VARCHAR field, thus:

SELECT a.id, b.category from a inner join b on CAST(b.fk AS CHAR(8) CHARSET latin1)=a.id

Performance will be drastically improved. I believe this is due to the order and type of string comparison going on, and this addition explicitly converts all the potential strings up-front and avoids doing it on an ad-hoc basis.

Next steps

Performance is now much better, but this is still a fairly wasteful way of storing four bytes. The advantage is that, for debugging, it’s easy enough to test queries and investigate issues without having to deal with binary fields.

In terms of efficiency, the next step might be to switch to an UNSIGNED INT, which is allocated four bytes. This is readable in query output, and can be passed in the browser without too much fuss.

I suspect INT and BINARY also have further performance gains. I haven’t tested the overhead, but I suspect the collation (the system of testing equivalence in characters) is adding some work to string comparisons – none of which we need here.

I need to test further before using UNSIGNED INT, and the update process is marginally more involved as the field needs to be converted and modified.

Right now, the performance is back at decent levels, so this is probably something to park for a while!

 

 

 

Reporting Data with Daylight Savings Time

At the end of October it’ll be time to put the clocks back. We’ll get an extra hour in bed to compensate for the hour lost at the end of March, and the bi-annual tradition of media articles asking whether we should ditch the whole thing will run again.

For anybody dealing with dates and times, daylight savings can be a pain. If your job involves managing computers, you might well keep everything in UTC (/GMT) to save headaches. This is also sensible advice if you deal with multiple time-zones. By sticking to UTC, you avoid the annoying issue of living out the same hour twice one morning somewhere in Autumn.

In city centres, retail and business, our data is very much dictated by local time, not UTC. If the shops open at 8am, it will be based on local time. Work starts at 9 sharp, even if your previous night’s sleep was rudely shortened by sixty minutes.

For this reason, when we (meaning I – my company) report on city centre figures, we use local time. It keeps the peaks and troughs of the day in order. If we used UTC, it would be harder to compare a July day with a November day – they’d be off by an hour.

We also store data in local time, mainly because of the complexities involved in constantly switching between UTC and local time. It’s a minor issue, buts adds time to every query and makes the underlying system more complicated. Some might store in UTC, and convert. However you do this you’ll still need to decide what to do at 2am at the end of October, when time steps backwards.

Springboard

For some useful observations, I needed to look for high-traffic places with a decent night-time economy. These are from Heart of London (West End):

heartlondon-springboard

First observation is that all the times appear to be based on local time. The general peaks/troughs appear to line up irrespective of daylight savings time. Sunday is shown in grey on these charts.

There’s not a lot to go on here, but March 2015 looks like a straight line between 1am and 3am. March 2014 has a data point in the non-existent time, but there’s a noticeable drop. By comparison, October (where we live 1am-2am twice) seems to have a bump at 3am. This is also what I saw in the 2014 data – not shown here.

Given that there’s data in the March 2014 slot, I wonder if this is being accommodated in Springboard’s stats – and if they’re similarly compensating in the early hours of October’s backwards step.

Highways England

Looking elsewhere, Highways England publishes traffic data across its network on a 15 minute basis. They also use local time – again, this makes sense as traffic demands are dictated by the clock.

In March, they simply skip over the non-existent time. 1am to 2am is missing in the data, so for one day per year there are 23 hours’ worth of records.

In October, there is something weirder. The hour is repeated, but the data is inconsistent. This is the traffic data on a section of the M25 over October, with 1am-2am counted twice. Time Period shows the end of the 15 minute section as measured; the last column shows the number of vehicles of a given length over this period.

he-october

It looks like the sensor has managed to send something across throughout the affected time, but the data is largely missing. Interesting that the time period reports as :59 seconds only in this highlighted period, and for the two records where this doesn’t happen (01:44:00 and 01:59:00) we seem to have data. I wonder if this is a bug of some kind.

Conclusions

These are the two main data suppliers I have an interest in, but it’d be useful to gauge feedback from elsewhere. This is a tricky issue. The ultimate goal is to show something which is meaningful to the reader, but we need to do this in a way that does not affect comparisons at the hourly level.

In one way this is a fairly moot point. The volume of traffic at 2am on an out-of-season Sunday is likely inconsequential for many. However it does raise an interesting challenge for reports and figures, and is just one of many subtleties to consider in this sort of analysis.

Final thought: is it possible that the change in daylight savings time actually attracts people?

MariaDB Indexes on DATETIME and DATE functions

Quick note in haste, but important for me to remember.

SELECT * FROM table WHERE DATE(`period`)='2016-09-05'

is orders of magnitude slower than

SELECT * FROM table WHERE `period`>='2016-09-05 00:00:00' AND `period`<'2016-09-06 00:00:00'

where `period` is an indexed DATETIME field. The former uses an assortment of WHERE and INDEX clauses; the latter relies on the INDEX and uses a RANGE search, which is much faster.