Don’t blame software for missing tests – it’s down to a lack of basic data handling skills

I spent 20 years doing data analysis and it is clear to me that the latest government cock-up was caused because data audit disciplines were jettisoned

Jane Fae
Wednesday 07 October 2020 14:03 BST
Comments
In the dark? Health secretary Matt Hancock should focus on a low-tech solution that actually works
In the dark? Health secretary Matt Hancock should focus on a low-tech solution that actually works (Reuters)

So 16,000 Covid-19 results went walkabout and it’s all the fault of those silly folks at Public Health England. Or the NHS. Or Microsoft, for allowing us to use an old format when more modern versions are available. Or perhaps it was the software used. Who knows? Meanwhile, IT experts were rushing to assert that, with a “proper” database, none of this would have happened.

Most worrying, though, is the smugness that seems to have gripped a commentariat not hitherto renowned for its data analysis skills. This has led them to focus on the wrong thing. The BBC suggested that “the badly thought-out use of Microsoft’s Excel software” was at fault and that “the problem is that PHE’s own developers picked an old file format”.

Of course the real focus, at least on the basis of government statements, ought to be the lack of basic data handling skills.

But the BBC was far from alone. Other major publications followed suit and Twitter… Well, the less said about Twitter, the better. I found myself sounding ever more Victor Meldrew, exploding into “I don’t believe it” rants, and observing that it wasn’t like this in my day.

My earliest days as a systems analyst entailed drawing detailed flow-charts that mapped the progress of every job through complex IT systems and, critically, logged records in and out at every process stage.

Matt Hancock, on the other hand, when challenged to publish relevant data-process diagrams, in order to crowdsource error-checking, agreed to consider doing so, but added: “The challenge of a maximum-file-size error is that it wouldn't necessarily appear on those sorts of flowcharts.”

Rubbish! If you aren’t constantly recording and checking the number of records input and output to a system critical process, you just aren’t doing system stuff right.

Besides, the real problem, obscured by this focus on using “old Excel” versus “new Excel” – or some other software variant – is an obsession that has bedevilled our anti-coronavirus efforts from the start. Let’s go hi-tech. Let’s build a pretty app. And yet, all folks are crying out for is a low tech solution that actually works.

”I spent 20 years doing data analysis, teaching data analysis, and writing about it. I started out using little more than beta versions of what have since become mainstream stats packages (SPSS and SAS), and I was using them to analyse the best part of a million records on an old 20Mb hard drive. Yes, you read that right. 20Mb total storage, not RAM!

It was wing-and-a-prayer stuff, and I would not recommend it to anyone. But it was manageable, so long as one came at it with decent data management disciplines.

The bane of my life, though, was a dominant ideology, forever trying to move data analysis out of a stats-based environment and into something traditional programmers were more comfortable with.

I discovered one of the most egregious cock-ups caused by this approach when checking (internally generated) reports submitted to a major office of state, and then reported to parliament. Based on a “standard database”, these double- and triple-counted figures in areas of postcode overlap and, inexplicably, had deleted Southport.

This same department also rejected a proposal to cut the cost of annual reporting from £750,000 to less than one tenth of that amount, using lower tech solutions, simply because they were used to paying mainframe prices and didn’t believe it could be done for less.

Close second was a tendency on the part of software developers to hide the guts of their software: so analysts had to do extra work just to understand what was going on when processing data. Here, most likely, lie the origins of the latest embarrassment. For just as sat-nav has untaught an entire generation the art of map-reading, so clever people have been conditioned to believe that the outputs from cleverer and cleverer software must be true, simply because a computer says it is.

The real issue is not what version of Excel was used: indeed, unless you know precisely how it was being used and what for, it is very hard to make any sensible observation on its use. Rather, we seem to have jettisoned good old-fashioned data audit disciplines. That is the problem.

Join our commenting forum

Join thought-provoking conversations, follow other Independent readers and see their replies

Comments

Thank you for registering

Please refresh the page or navigate to another page on the site to be automatically logged inPlease refresh your browser to be logged in