Is inflation really 8%? ONS web scraping trial problems

*** UPDATE: on 26 October 2015, ONS admitted to a significant error in this report. The revised report can be found here. ***

On 1 September 2015, ONS published a report on “Research indices using web scraped data” (original ONS report here and data here). It was an update on their initial analysis published in June on a trial using web scraped data to compile price indices.

The objectives of this trial appear to be to determine if and how the ONS might be able to use web scraping as an alternative method of data collection. Their hope, like many trying to use big data, is to reduce costs (by cutting out manual data collectors) and to improve quality (by increasing the number of prices checked and their frequency).

The results show that the ONS seems a very long way from perfecting the use of big data. Their report also somewhat confusingly initially focuses on differences between calculating prices indices using some statistical wizardry called “chain linking” and by comparing the average unit prices of products. These have led to a number of media headlines of inflation being severely underestimated e.g. Cost of everyday items for sale in supermarkets rockets 8 per cent in last year (Daily Mirror).

These reports are completely misleading, as I’ll discuss below. ONS should bear some responsibility for the confusion. Their report should probably have majored on what you have to wait until section 4.3 to read i.e. a “comparison with CPI”. The latter shows that when you scrape prices in a limited way and try and replicate what you are doing off-line as best you can, you can end up with fairly similar results. (Though strangely the results published using this method in September differ markedly to what they published in June – with there being no reference in the report to how they have done things differently to make the fit better).

ONS june rep

June report showing a big difference between CPI and its web scraped version

ONS sept rep

 September report showing the two be very similar

But let us return to hype being reported in the media. It is related to the results of the ONS’s attempts to produce price indices on a more frequent basis (e.g. daily or weekly) from the web scraped data. The problem with scraping as ONS have done it (i.e. letting a computer collect the price of everything with the label “whisky” on it from Tesco, Sainsburys and Waitrose’s websites), is that you can end up with a load data that is difficult to analyse and sometimes misleading (e.g. Tesco include rum in with whisky!) There are also problems that supermarkets frequently change their labels for products, products go out of stock or get delisted, and sometimes no data is collected due to computer problems. All of this means that being able to compare prices consistently over a period of time is almost impossible without some sort of major compromise.

The compromises you make then impact the results to get back. The two main ways ONS looked at it were just find the subset of products that they could track at least every month at some point and compare those – called unit price index. This subset is less than a quarter of the original one and sometimes is down to just one line per supermarket (in the case of bananas) – hardly big data then!

The alternative* is to look at each pair of days over the year and match every day and link the results to the previous day with some clever stats – called chain linking. The latter is a ingenious idea, but in practise means that price indices so created drift ever higher – possibly because supermarkets bring in new lines on promotion at a discount price and then when they return to full price, many then get delisted as people stop buying them.

ONS sept rep chain2

Neither of the above is a viable solution for creating a price index and so the ONS goal of maybe producing daily price indices seems a long way off.

Instead what ONS need to take out of this trial is basic issue with big data. To make it work accurately requires a lot of data cleaning/manipulation (called data wrangling). Without that you risk the “rubbish in, rubbish out” scenario that has dogged so many such projects.

Indeed ONS might find that it is actually cheaper and better quality just to send out the interviewers to collect the prices as they do now. Having said that, the ideal solution is probably a mixture of the two i.e. set up an online interface so an intelligent human data collector can go online and collect the prices from supermarkets that have online shops. They can then deal with the product renamings and when required propose substitutions, if products genuinely become delisted. You then end up with similar data to now but collected a bit quicker hopefully. (Note the word “bit”. I suspect the challenging usability of many online shops means that it might actually be quicker to visit the vegetable isle of Tesco and deal with the substitutions in person than try and do it online).

Finally to re-iterate, inflation for food is not running at 8%. The latest CPI figures today estimate it to be -2.4% (RPI -2.0%).

* Note, ONS also report a variant of chain-linking called GEKS which appears to suffer less from higher prices but its results appear inconsistent and so appears no solution either.