Will ONS scraping and wrangling lower CPI inflation in the future?

On 8th June 2015, ONS published the results of a trial in which they measured inflation online for just under a year by directly sourcing price data from three UK supermarkets websites. The purpose of the trial was:

  1. To develop and test methods of automatically collecting price information (“web scraping”)
  2. To determine its quality by looking at how much cleaning and manipulation it needed (“data wrangling”)
  3. To see how the results compared with the traditional CPI method.

In a nutshell, the results showed that most of the price indices so created showed lower rates of inflation, i.e.

Scraping trial

Source: ONS. Food, beverages and tobacco index. June 2014=100.

The trial covered 35 items in the food and alcoholic drinks sector. Some 2886 products were included. Prices were collected daily and about 1.5 millions price quotes in total were scraped. However ONS discovered that the wrangling process distilling these down to ones which could be correctly compared over time was very high, and so much so that about a half of the data had to be excluded.

Therefore any cost savings from not sending out real price collectors might be significantly consumed by paying for back office data wranglers sorting out the mess of the scrapings. For example, they said that rum often appeared in shops listings for whisky, as did apple juice in their listings for apples. Add to that supermarkets often changed their identifier codes for lines making linking up data automatically fraught with error.

The concept of web scraping prices is not new. Arguably ONS is doing a bit of catch up with this trial. In the US, MIT set up the Billion Prices Project (BPP) which has been successfully scraping prices there for seven years now and is also doing so in some South American countries. In the EU, the Dutch seem ahead of the game and are now using not only direct scraping but collector assisted web scraping interfaces to collect prices from places such as cinemas and for driving lessons – see recent research paper here.

The BPP data has not shown lower prices with scraping. Indeed their indices in Argentina are higher than official stats and in the US, they also tend to be slightly higher. It is therefore probable that the lower prices seen in the UK trial are related to some other factor.

The most likely reason is that of sample differences. The trial covered just three supermarkets – which more than likely included Britain’s top retailers such as Tesco and Sainsburys.  It is likely that price reductions here were markedly higher than seen elsewhere in the retail network. (I have asked ONS to quickly check this hypothesis out, but apparently they don’t have resource to do so!).

Add to that, it is wrong to assume that the prices online for UK retailers are necessarily the same as that in-store. They can vary by category of store (small metros being higher) and by region to some extent. The web price is most likely to exhibit the largest price reductions of all of them because of the competition faced online.

That said there are many other differences between the two methods that might also have an impact. For example, in CPI price collectors visit maybe a 100 stores but they only collect the price of one item of each line – normally the most frequently bought. In the scraping trial, the prices of all lines are included – which could be dozens of items or even a 100+. However logically to me this might argue for the normal CPI to be lower (not higher), as the main price reductions over the last year have probably focused on just a few key popular lines.

Having said that, ONS have only included in their main monthly analysis products that they scraped consistently over the trial period. In correspondence to me over it, they noted that they had to exclude a lot of products which were delisted periodically. A number of these then were relisted at higher prices. Indeed it is possible that retailers normally follow a policy of delisting temporarily before they raise prices. If true, this may be another significant factor in why the scraping trial appeared to show deflation (as items with price rises were systematically excluded).

Note, the method for dealing with out-of-stock is quite different in CPI. When a product cannot be found by a price collector, they carry on using the old price for up to three months. After that period, they start again by tracking a different product.

Day of collection is another possible factor. CPI is normally collected on the second or third Tuesday of the month. The web scraped data was being collected every day. But again there is no logical reason for this to affect the numbers. (Though apparently when ONS tried to compile a daily price index, they ended up with much higher inflation rates due to chain-linking prices so often).

All the above said, moving towards more collection of data online seems a sensible direction of travel for ONS in the 21st Century. It has the potential to harness a much broader sets of prices that might more accurately represent the multiplicity of choice the consumer faces today. Also collecting across the month has the potential to remove some of the weird quirks you sometimes see in the monthly data due to the timing of Easter, for example.

However rather than scraping massive amounts of data automatically the answer is probably to create systems that a human price collector can use online so that the process is more efficient and representative, but the quality of the data is maintained – rather than trying to wrangle it into some shape later. Also any move to reduce the number of sampling points for each item should be resisted.

So returning to the initial question of whether scraping will result in lower inflation rates in the future, I sincerely hope not. However one is reminded of the fact that even slight differences in the way you calculate inflation can have a profound effect on the numbers that come out the other end.

As Paul Johnson, director of the Institute for Fiscal Studies, noted in the FT, “the real finding of the initial research was not that inflation is too high, but the method of collecting prices matters rather a lot”.