September 2016 – Yet Another Blog

On CHAR, Unicode and String searching

As part of my wifi project, I need to store the hashed MAC address of devices so we can run analysis and gather daily figures. We use MariaDB, based on MySQL, for this particular application.Â There are various tables used for this, and all JOINed by the common hash.

When I first put the system togetherÂ in haste, I used VARCHAR(20) as it wasn’t obvious at the time which route we’d go down, or how things would be processed.Â You can be a little more wasteful with VARCHAR as the storage space is n+1 (where n is number of chars), so the field shrinks to suit. Fixed strings using CHAR always pre-allocate a fixed space, so are less flexible.

Roll forward a year. Things have grown very fast, and we’re starting to see performance limits being hit, and storage space becoming an issue, so it’s time to review the schema.

Now we know the maximum bytes for any hash will always be four. They can be smaller than this (depends on application) but eight is theÂ limit.

So, CHAR(8) would seem a reasonable way to store eight bytes in hex.Â If you’re screaming about INT/BLOB/VARBINARYÂ storage, I’ll come back to that in a moment…

Issue one – UNICODE

First problem is that, if you use a multi-byte charset like utf-8, CHAR is potentially much worse than VARCHAR. As a fixed field, CHAR must allocate the maximum space for all supported Unicode characters, so three bytes are reserved for each character (MySQL utf8 doesn’t support supplemental unicode characters, so three is the limit).

Compare to VARCHAR, where the field size is dynamic and therefore grows according to the characters in use. For anybody using straightforward Latin characters, this needn’t be more than one byte per character.

In the case of utf8, VARCHAR canÂ be far more efficient than CHAR.

If you’re only using basic characters or storing hex values in CHAR, explicitly set the latin1 character set on this field:

ALTER TABLE `mytable` MODIFY `col` CHAR(8) CHARSET latin1

Issue two – Comparing CHAR and VARCHAR

As I was testing this I converted some tables hoping that the comparison would be fairly straightforward, but it seems there’s a big performance hit when comparing a VARCHAR field to a CHAR field (possibly not helped by charset differences). I was seeing queries slow by a factor of 50+ in some cases.

SELECT a.id, b.category from a inner join b on b.fk=a.id

(fk is a varchar; id is a char – both contain the same string)

This is easily solved by explicitly setting a cast on the VARCHAR field, thus:

SELECT a.id, b.category from a inner join b on CAST(b.fk AS CHAR(8) CHARSET latin1)=a.id

Performance will be drastically improved. I believe this is due to the order and type of string comparison going on, and this addition explicitly converts all the potential strings up-front and avoids doing it on an ad-hoc basis.

Next steps

Performance is now much better, but this is still a fairly wasteful way of storing four bytes. The advantage is that, for debugging, it’s easy enough to test queries and investigate issues without having toÂ deal with binary fields.

In terms of efficiency, the nextÂ step might be to switch to an UNSIGNED INT, whichÂ is allocated four bytes. This is readable in query output, and can be passed in the browser without too much fuss.

I suspect INT and BINARY also have further performance gains.Â I haven’t tested the overhead, but I suspect the collation (the system of testing equivalence in characters)Â isÂ adding some workÂ to string comparisons – none of which we need here.

I need to test further before using UNSIGNED INT, and the update process is marginally more involved as the field needs to be converted and modified.

Right now, the performance is back at decent levels, so this is probably something to park for a while!

Reporting Data with Daylight Savings Time

At the end of October it’ll be time to put the clocks back. We’ll get an extraÂ hour in bed to compensate for the hour lost at the end of March, and the bi-annual tradition ofÂ mediaÂ articlesÂ asking whether we should ditch the whole thing will run again.

For anybody dealing with dates and times, daylight savings can be a pain. If your job involves managing computers, you might well keep everything in UTC (/GMT) to save headaches. This is also sensible advice if you deal with multiple time-zones. By sticking to UTC, you avoid the annoying issue of living out the same hour twiceÂ one morningÂ somewhereÂ in Autumn.

InÂ city centres,Â retail and business, our data is very much dictated by local time, not UTC. IfÂ the shops open at 8am, it will be based on local time. Work starts at 9 sharp, even if your previous night’s sleep was rudely shortened by sixty minutes.

For this reason, when we (meaning I – my company) report on city centre figures, we use local time. It keeps the peaks and troughs of the day in order. If weÂ used UTC, it would be harder to compare a July day with a November day -Â they’d be off by an hour.

We also store data in local time, mainly because of the complexitiesÂ involved inÂ constantly switching between UTC andÂ local time. It’s a minor issue, buts adds time to every query and makes the underlying systemÂ more complicated. Some might store in UTC, and convert. However you do this you’ll still need to decide what to do at 2am at the end of October, when time steps backwards.

Springboard

For some useful observations, I needed to look for high-traffic places with a decent night-time economy. These are from Heart of London (West End):

heartlondon-springboard

First observation is that all the times appear to be based on local time. The general peaks/troughs appear toÂ line up irrespective of daylight savings time. Sunday is shown in grey on these charts.

There’s not a lot to go on here, but March 2015 looks like a straight line between 1am and 3am. March 2014 hasÂ a data point in the non-existent time, but there’s a noticeable drop. By comparison, October (where we liveÂ 1am-2am twice)Â seems to have a bump at 3am. This is also what I saw in the 2014 data – not shown here.

Given that there’s data in the March 2014 slot, I wonder if this is being accommodated inÂ Springboard’s stats – and if they’re similarly compensating in theÂ early hours of October’s backwards step.

Highways England

Looking elsewhere, Highways England publishes traffic data across its network on a 15 minute basis. They also use local time – again, this makes senseÂ as traffic demands areÂ dictated by the clock.

In March, they simply skip over the non-existent time. 1am to 2am is missing in the data, so for one day per year there are 23 hours’ worth of records.

In October,Â there is something weirder. The hour is repeated, but the data is inconsistent. This is theÂ traffic data on a section of the M25 over October, with 1am-2am counted twice. Time Period shows the end of the 15 minute section as measured; the last column shows the number of vehicles of a given length over this period.

he-october

It looks like the sensor has managed to send somethingÂ across throughout the affected time, but the data is largely missing. Interesting that the time period reports as :59 seconds only in this highlighted period, and for the two records where thisÂ doesn’tÂ happen (01:44:00 and 01:59:00) we seem to have data. I wonder if this is aÂ bug of some kind.

Conclusions

These areÂ the two main data suppliers I have an interest in, but it’d be useful to gauge feedback from elsewhere. This is a tricky issue. TheÂ ultimate goal is to show something which is meaningful to the reader, but we need to do this in a wayÂ that does not affectÂ comparisons at the hourly level.

In one wayÂ this is a fairly moot point. The volume of trafficÂ at 2am on an out-of-season Sunday is likely inconsequential for many. However it does raise an interesting challenge for reports and figures, and is just one of many subtleties to consider in this sort of analysis.

Final thought: is it possible that the change in daylight savings time actually attracts people?

MariaDB Indexes on DATETIME and DATE functions

Quick note in haste, but important for me to remember.

SELECT * FROM table WHERE DATE(`period`)='2016-09-05'

is orders of magnitude slower than

SELECT * FROM table WHERE `period`>='2016-09-05 00:00:00' AND `period`<'2016-09-06 00:00:00'

where `period` is an indexed DATETIME field. The former uses an assortment of WHERE and INDEX clauses; the latter relies on the INDEX and uses a RANGE search, which is much faster.