On CHAR, Unicode and String searching

As part of my wifi project, I need to store the hashed MAC address of devices so we can run analysis and gather daily figures. We use MariaDB, based on MySQL, for this particular application. There are various tables used for this, and all JOINed by the common hash.

When I first put the system together in haste, I used VARCHAR(20) as it wasn’t obvious at the time which route we’d go down, or how things would be processed. You can be a little more wasteful with VARCHAR as the storage space is n+1 (where n is number of chars), so the field shrinks to suit. Fixed strings using CHAR always pre-allocate a fixed space, so are less flexible.

Roll forward a year. Things have grown very fast, and we’re starting to see performance limits being hit, and storage space becoming an issue, so it’s time to review the schema.

 

Now we know the maximum bytes for any hash will always be four. They can be smaller than this (depends on application) but eight is the limit.

So, CHAR(8) would seem a reasonable way to store eight bytes in hex. If you’re screaming about INT/BLOB/VARBINARY storage, I’ll come back to that in a moment…

Issue one – UNICODE

First problem is that, if you use a multi-byte charset like utf-8, CHAR is potentially much worse than VARCHAR. As a fixed field, CHAR must allocate the maximum space for all supported Unicode characters, so three bytes are reserved for each character (MySQL utf8 doesn’t support supplemental unicode characters, so three is the limit).

Compare to VARCHAR, where the field size is dynamic and therefore grows according to the characters in use. For anybody using straightforward Latin characters, this needn’t be more than one byte per character.

In the case of utf8, VARCHAR can be far more efficient than CHAR.

If you’re only using basic characters or storing hex values in CHAR, explicitly set the latin1 character set on this field:

ALTER TABLE `mytable` MODIFY `col` CHAR(8) CHARSET latin1

Issue two – Comparing CHAR and VARCHAR

As I was testing this I converted some tables hoping that the comparison would be fairly straightforward, but it seems there’s a big performance hit when comparing a VARCHAR field to a CHAR field (possibly not helped by charset differences). I was seeing queries slow by a factor of 50+ in some cases.

SELECT a.id, b.category from a inner join b on b.fk=a.id

(fk is a varchar; id is a char – both contain the same string)

This is easily solved by explicitly setting a cast on the VARCHAR field, thus:

SELECT a.id, b.category from a inner join b on CAST(b.fk AS CHAR(8) CHARSET latin1)=a.id

Performance will be drastically improved. I believe this is due to the order and type of string comparison going on, and this addition explicitly converts all the potential strings up-front and avoids doing it on an ad-hoc basis.

Next steps

Performance is now much better, but this is still a fairly wasteful way of storing four bytes. The advantage is that, for debugging, it’s easy enough to test queries and investigate issues without having to deal with binary fields.

In terms of efficiency, the next step might be to switch to an UNSIGNED INT, which is allocated four bytes. This is readable in query output, and can be passed in the browser without too much fuss.

I suspect INT and BINARY also have further performance gains. I haven’t tested the overhead, but I suspect the collation (the system of testing equivalence in characters) is adding some work to string comparisons – none of which we need here.

I need to test further before using UNSIGNED INT, and the update process is marginally more involved as the field needs to be converted and modified.

Right now, the performance is back at decent levels, so this is probably something to park for a while!

 

 

 

Reporting Data with Daylight Savings Time

At the end of October it’ll be time to put the clocks back. We’ll get an extra hour in bed to compensate for the hour lost at the end of March, and the bi-annual tradition of media articles asking whether we should ditch the whole thing will run again.

For anybody dealing with dates and times, daylight savings can be a pain. If your job involves managing computers, you might well keep everything in UTC (/GMT) to save headaches. This is also sensible advice if you deal with multiple time-zones. By sticking to UTC, you avoid the annoying issue of living out the same hour twice one morning somewhere in Autumn.

In city centres, retail and business, our data is very much dictated by local time, not UTC. If the shops open at 8am, it will be based on local time. Work starts at 9 sharp, even if your previous night’s sleep was rudely shortened by sixty minutes.

For this reason, when we (meaning I – my company) report on city centre figures, we use local time. It keeps the peaks and troughs of the day in order. If we used UTC, it would be harder to compare a July day with a November day - they’d be off by an hour.

We also store data in local time, mainly because of the complexities involved in constantly switching between UTC and local time. It’s a minor issue, buts adds time to every query and makes the underlying system more complicated. Some might store in UTC, and convert. However you do this you’ll still need to decide what to do at 2am at the end of October, when time steps backwards.

Springboard

For some useful observations, I needed to look for high-traffic places with a decent night-time economy. These are from Heart of London (West End):

heartlondon-springboard

First observation is that all the times appear to be based on local time. The general peaks/troughs appear to line up irrespective of daylight savings time. Sunday is shown in grey on these charts.

There’s not a lot to go on here, but March 2015 looks like a straight line between 1am and 3am. March 2014 has a data point in the non-existent time, but there’s a noticeable drop. By comparison, October (where we live 1am-2am twice) seems to have a bump at 3am. This is also what I saw in the 2014 data – not shown here.

Given that there’s data in the March 2014 slot, I wonder if this is being accommodated in Springboard’s stats – and if they’re similarly compensating in the early hours of October’s backwards step.

Highways England

Looking elsewhere, Highways England publishes traffic data across its network on a 15 minute basis. They also use local time – again, this makes sense as traffic demands are dictated by the clock.

In March, they simply skip over the non-existent time. 1am to 2am is missing in the data, so for one day per year there are 23 hours’ worth of records.

In October, there is something weirder. The hour is repeated, but the data is inconsistent. This is the traffic data on a section of the M25 over October, with 1am-2am counted twice. Time Period shows the end of the 15 minute section as measured; the last column shows the number of vehicles of a given length over this period.

he-october

It looks like the sensor has managed to send something across throughout the affected time, but the data is largely missing. Interesting that the time period reports as :59 seconds only in this highlighted period, and for the two records where this doesn’t happen (01:44:00 and 01:59:00) we seem to have data. I wonder if this is a bug of some kind.

Conclusions

These are the two main data suppliers I have an interest in, but it’d be useful to gauge feedback from elsewhere. This is a tricky issue. The ultimate goal is to show something which is meaningful to the reader, but we need to do this in a way that does not affect comparisons at the hourly level.

In one way this is a fairly moot point. The volume of traffic at 2am on an out-of-season Sunday is likely inconsequential for many. However it does raise an interesting challenge for reports and figures, and is just one of many subtleties to consider in this sort of analysis.

Final thought: is it possible that the change in daylight savings time actually attracts people?

MariaDB Indexes on DATETIME and DATE functions

Quick note in haste, but important for me to remember.

SELECT * FROM table WHERE DATE(`period`)='2016-09-05'

is orders of magnitude slower than

SELECT * FROM table WHERE `period`>='2016-09-05 00:00:00' AND `period`<'2016-09-06 00:00:00'

where `period` is an indexed DATETIME field. The former uses an assortment of WHERE and INDEX clauses; the latter relies on the INDEX and uses a RANGE search, which is much faster.

554 Message not allowed – Headers are not RFC compliant[291]

If you get this message when sending emails to Yahoo addresses from PHP, it’s quite possible the subject line is being duplicated and this is causing your email to be rejected.

PHP’s mail() function takes a Subject as a parameter. It also accepts a custom list of headers. The mail() function seems to always append a subject header irrespective of the contents of the custom list.

So, if you have the Subject already defined in your custom headers, PHP will add another. This is not RFC compliant, and Yahoo has a particularly strict checker for incoming emails (most other mail servers seem to ignore this, as far as I can see).

Take the Subject line out of your custom headers, make sure it’s in the mail() call itself, and the mail should send (assuming no other issues….)

Huawei E3131

It seems like I’m having mixed success with devices at the moment. To be fair, this one isn’t a product mismatch but a manufacturer issue, and thankfully seems fairly easy to get around.

This post is as much a personal note/reminder as anything else. If anybody needs detailed notes please drop me a line 🙂

The Huawei E3131 has been a pretty good 3G modem for the Raspberry Pi. After a bit of hacking about, it lights up and connects to the Internet.

The first batch (black cover in the photo) I bought worked as follows:

IMG_20151003_154035A network interface appears on eth1. The Pi gets an IP address 192.168.1.5 and the modem appears as 192.168.1.1. The modem has a mini web server which can be interrogated to control & get information about the connection.

The latest batch (white ones) work as follows:

Network interface is now on 192.168.8.100 and the modem is on 192.168.8.1. The web server still works, but it has some kind of basic security mechanism built into it, so you need to get a token first.

The original device had a USB id of 12d1:14db; the newer ones have a USB id of 12d1:14dc

So, not only has the IP address/local interface changed but there’s a new security mechanism involved. Time for some extra work!

Useful links

http://tjworld.net/wiki/Huawei/E3131UsbHspa

http://www.3g-modem-wiki.com/page/Huawei+E3131

Update: Looks like it validates the Referer field, perhaps for cross-site protection – I set it to http://192.168.8.1/html/index.html and requests work now.

Raspberry Pi Wifi Dongle on Amazon

Slightly annoyed to discover that my most recent batch of USB wifi adapters, sold as “for Raspberry Pi” are not quite compatible.

For my various wifi projects, I’ve been buying up lots of little wifi adapters. All are labelled in the same way “USB Wifi Adapters for the Raspberry Pi – By New IT”. It looks like they’re all based on this from supplier ‘New IMG_20150927_095541IT’ (whom I’ve no reason to believe are involved in the issues here).

The first few batches worked exactly as I wanted. Plug and play.

The most recent batch has, however, been an entirely different story. They’re simply different.

The picture shows two USB devices. On the left is the most recent, non-working device. On the right is the working device. The most notable difference is the ‘802.11n’ label is on a different side for each.

What’s the difference?

It’s all in the internals. Despite there being hundreds of wifi products they’re all based on a surprisingly small number of variants.

The working version is based on a chipset from Ralink, the RT5370. This works out of the box for the Raspberry Pi, seems to be a good performer, and – crucially for me – supports monitor mode (so it can act as an access point too).

The problematic version uses the Ralink 8188 (RT8188) chipset. This doesn’t have full support in the Raspberry Pi, and frankly it’s been a pain in the backside to work with. I have a box full of old USB wifi dongles; most seem to be based on RT8188 and never really had much luck with them.

It’s possible to get them up and running, by compiling special drivers, but frankly it isn’t worth the effort and I’ve had lots of memory leaks and other issues with them.

Be careful what you buy

So, back to Amazon. It seems that some devices sold as ‘for Raspberry Pi’ are in fact not ideal – and take a lot of work to get going. This shows up in the reviews as it looks like a few people have been stung.

The whole thing isn’t helped by Amazon’s peculiar supplier system, where it opts for the cheapest supplier based on the product you choose. It looks like one or two suppliers have these RT8188 chips and are selling them under the wrong label – no idea if intentional or not.

Thankfully it’s fairly easy to pick the right supplier. Make sure you choose ‘New IT’. I’ve just ordered another four devices from them and they work fine, straight out of the box, for the Raspberry Pi. Yes, they’re a bit more expensive but they work!

Now to decide whether it’s worth my effort negotiating Amazon’s support line and sending the two dud devices back…

Visual Studio/ASP.NET Code Behind

I’m going to need to remember this, as it runs slightly counter-intuitive to what I was expecting (although makes sense).

Code compiled in App_Code is available globally, so any other piece of code can reference it.

Code compiled in code-behind (i.e. the .cs files ‘behind’ ASPX pages) is only available to its corresponding ASPX page unless you explicitly reference it.

This can be done by adding the following to the ASPX page:

<%@ Reference Page="......link to .aspx" %>

One to remember.

Wifi Visitor Counting

It’s been a little while since my last post about visitor tracking, and some updates are long overdue.

The good news is that it’s ready to launch, with lots of interest. This has been a challenging project in many respects, and I’ll be documenting the various aspects shortly.

Initial results are very promising. We’re seeing figures that have never before been seen; stats and information that were until now completely invisible. It’s thrown up issues, about data transfer, privacy and many other factors.

Privacy is a huge issue for me – there are lots of companies that (in my mind) are far too aggressive on this issue – so expect more on this as I continue to share my work.

It’s exciting stuff, and I’m keen to share my experiences and thoughts for others. Stay tuned!

Avoiding SD Card Corruption on a Raspberry Pi

As part of my visitor mapping project, the bulk of my time has been spent throwing curse words at the SD cards.

Raspberry Pis depend on SD cards. There’s little other choice. More specifically, the newer devices use Micro SD cards. It’s fairly incredible to realise just how much data can be stored on something around the size of a thumbnail, but I digress.

The problem with these cards (& flash memory in general) is that they’re very sensitive to corruption. Any power fluctuation or problem writing is a potential problem, with a real chance that the next boot will end in failure/disaster.

In my experience (ten devices in lab conditions; regular reboots) failure rate was as high as one in ten reboots. Slightly oddly, software resets (i.e. rebooting from the command line) appeared to fail more often than unpredictable power loss.

Read-Only is a winner

There is a way to boot a Raspberry Pi in read-only mode, although it requires a bit of care. I found a couple of ways to do this, by editing the boot config file and by setting the fstab to boot as read only.

The consequences are hopefully clear – nothing is retained. However, it’s also possible to temporarily remount the file system in writeable mode (and revert shortly after). My software is capable of self-updating, and writes its logs to the filesystem in case of Internet problems. Both work very well.

This website gave the best instructions. It includes removing a bunch of daemons that – frankly – I had no need for. I also removed fake-hwclock (for reasons which I’ll describe on another post) and preferred mounting ramfs on places like /var/log/ over linking.

The results have been pretty impressive. Since switching to this method I haven’t had a single corruption issue (despite many tens of reboots)

Power is a concern

I suspect that one of the big problems is power, and began testing with a voltmeter. My suspicion is that brownouts – i.e. dips in power – were causing reboots and corruption.

The latest Raspberry Pi hardware supports a configuration that boosts the current to USB drives. Setting max_usb_current=1 in /boot/config.txt appears to increase the per-device supply to 1.2A which avoids many of the power-related issues on high-demand peripherals.

My non-scientific and unmeasured experience confirms this. I’ve seen much better performance and reliability as a result, and – although can’t be sure – I suspect this has helped the SD card woes as well.

In any case – touch wood and all that – this appears to have solved the corruption issues.

The 5 Weekends 823 Years Thing

There’s a thing going around Facebook about ‘this month has five weekends, it only happens once every 823 years’.

At first glance, it is pretty amazing. The dates line up just right to grant us five whole weekends, including Friday. Bliss.

11796265_868726643198344_7206118342273344835_nSadly, it’s rubbish. Or, to put it another way: if you like 5 weekend months, I’ve got good news for you!

The only way a month have even a chance of five weekends (I’m counting Fri+Sat+Sun) is if it has 31 days. Any shorter, and at least one of those weekends would also be cut short.

That means only January, March, May, July, August, October and December can qualify.

We can also only achieve this if the first of the month is a Friday, since it’s the only way we can fit all five weekends into a single month.

So, of the seven months of the year that fit the criteria, at least one of those needs to start on a Friday. You’ve seven months; seven days. The odds are pretty good.

I make it 1-(6/7)^7, around 66% chance that any given year will have at least one 31-day month that starts on a Friday.

But wait, there’s more…

Years are pretty regular things. Apart from the small annoyance of February (damn you February) every month is a fixed length – and even the changeable one is predictable.

This is partly how people manage to tell you the day of the week, given any date in recent history. They don’t memorise each day – it’s systematic.

So if January is a ‘5 weekend month’ then October of the same year will also be one, provided it’s not a leap year. If it is, then July is also a 5 weekend month.

This affects the probabilities a little, since the dates aren’t truly random, but 66% is a fairly good ballpark for most purposes.

Good news for fans of the weekend

2015 has one month that fits the ‘5 weekend month’ rule: May (not August as the original email suggested).

2016 has two: January and July

In 2017 we’ll have to wait until December to celebrate again.

2018 there won’t be any – so we’ll have to wait until March 2019 before we pull out the party poppers again.

Whatever the outcome, it’s certainly not once every 823 years so don’t worry if you missed it.