Skip to main content
Hans' Blog

Reading Prices in Grocery Stores using AI

As part of a personal project I needed to autonomously read the contents of pricetags in grocery stores. The exact layout and contents of these vary by store, but they are generally epaper tags with at least the price, product name and some product number.

Image showing a grid of different pricetags taken in different conditions to illustrate diversity.
Some examples of the different tags, layouts and conditions seen in the wild. A different model crops the regions.

Images are often bad, and the model needs to handle this gracefully. Some examples of the cases the model will need to handle:

Base model #

The base model I chose for this task was donut-base.

Donut is a encoder-decoder transformer model, with SWIN serving as the encoder and BART serving as the decoder. The base model has been pretrained on a large synthetic dataset to do OCR.

We will fine tune the model to extract semantic data from the tags.

Barcodes #

You may have noticed most of the pricetag layouts contain some form of barcode. This is sometimes a EAN-13 barcode containing the product number, sometimes a ITF containing an internal store number, sometimes something else.

We want the model to be able to read the entire tag, including barcodes.

Primer on barcodes #

1D barcodes come in various forms, each differing in data capacity, error correction levels, and encoding complexity.

Beneath the encoding schemes, 1D barcodes are represented as a linear sequence of bars of varying lengths. Data is encoded into the width of the individual black and white bars. Some barcodes use only narrow and wide bars, while others use several bar widths to encode symbols.

The number 75 encoded using a CODE-125 barcode. The numbers represent the relative width of the black and white bars.

The figure above is a CODE-125 barcode. We can see that this format uses 4 different widths of both the black and white bars to represent the data.

So how do we tokenize this data for the model?

The obvious approach (and why it doesn't work) #

The approach that immediately comes to mind is to introduce a set of tokens, each representing a bar of a certain width.

<b1/> - black bar of width 1
<b2/> - black bar of width 2
<b3/> - black bar of width 3
<b4/> - black bar of width 4

<w1/> - white bar of width 1
<w2/> - white bar of width 2
<w3/> - white bar of width 3
<w4/> - white bar of width 4

These tokens can then be assembed into a sequence, probably surrounded by some pair of <barcode>/</barcode> tokens to indicate the beginning and end of the barcode.

In reality a model trained with this tokenization scheme would not work well. When determining the width of the first token, the model would have to look across the entire barcode in order to determine the width of the first bar.

This puts a heavy computational load on the first token in the barcode, and does not give the model any ability to self correct.

If what we fundamentally care about is the relative width of bars, how can this be best represented for the model?

Relative encoding scheme #

A more effective approach is using tokens that represent the relative width of bars, contrasting each bar against its neighbors, rather than defining absolute widths.

I chose to make a single token look at a set of 3 bars, two black and the white between them. Each token uses the initial black bar as a reference, and encodes the relative widths of the succeeding white and black bars.

<b0,0/> - all 3 bars are the same width
<b1,0/> - the white bar is 1 unit wider than the reference bar
<b-1,0/> - the white bar is 1 unit narrower than the reference bar
<b0,-1/> - the second black bar is 1 unit narrower than the reference bar
+ all other relative widths

Using this scheme, we end up with more tokens (7*7=49 vs 4*2=8), but they are much easier for the model to utilize.

Since this is a different format than the normal decoding algorightms for most barcode formats take by default, this does require a conversion to be done in code. This is overall not very complicated to do.

Dataset #

I generated a synthetic dataset of 10000 document images containing both text and barcodes to train the model to read barcodes. The dataset was generated using a modified version of synthdog, the tool the authors of donut used to train the base model.

Example image from the dataset. You can see mixed text and barcodes.

The full dataset is uploaded here.

Fine tuning for pricetags #

As with training the model for reading barcodes, we need to decide on two main points when finetuning for our specific purpose:

Pricetag tokenization scheme #

I went for a XML-ish tokenization scheme, where fields are represented using distinct open and close tokens.


Each of the XML tags in the text above is a separate token.

Training data #

Although generating synthetic data for pricetag reading was an option, I concluded that the return on investment would be higher with real-world data annotation due to its inherent accuracy and diversity. I believed this to be a better use of my time because:

  1. The main bulk of training the model to read is done on synthetic dataset, the datasets required to teach the model to read the specific pricetag layouts should not be massive.
  2. By using real world data we don't have to worry about subtle biases in the synthetic data, the data we train on is guaranteed to be in distribution.

I found that around 100 hand annotated samples were enough for the model to start performing reasonably well. After this point I started using the model to interactively assist me in annotating more data. After around 500 human validated samples the model was more than good enough to be deployed.

Results #

Here I have deliberately picked challenging cases to show how the model performs.

Glare obscures large parts of the tag. Future work might involve introducing tokens which allow the model to express obstructions.
<heading> RØ T.MOZZARELLA</heading>
<subheading> KTG TINE</subheading>
<cost> 73.90 Pr/Stk</cost>
<product_barcode_text> 70.90</cost>
Tag with moderate amounts of blur.
<heading> FRIGGS MAISKAKER</heading>
<subheading> 25 g cheese glutenfri</subheading>
<alt_cost> Pr/KG 14.90</alt_cost>
<cost> 14.90</cost>
<product_number> 7075062854306</product_number>
<barcode_text> 518546</barcode_text>
<text> ASKO NT</text>
Sometimes shelves have mirrors, this tag is mirrored. Notice the mismatched field start/end tokens and bogus data.
<heading> J 98r 603 2个月</heading>
<cost> 87. V23003400090</product_barcode_text>
<cost> 08.81.905</product_barcode_text>
The offer text on top is rare in stores, and is out of distribution. Ideally it should be annotated as its own field. The price fields are also malformed.
<heading> 2.for 30.7 MONSTER ULTRA WATERMELON</heading>
<subheading> 0.5 l bx</subheading>
<cost> 27.90</cost>
<product_number> 5060896621869</product_number>
<barcode_text> 5931068</barcode_text>
<cost> 27.90 Pr/L55,80 +pant</cost>
Tag with moderate amounts of blur.
<heading> CHIAKNEKKEBRØD</heading>
<subheading> GLUTENFRI 140G SEMPER</subheading>
<cost> 28.40 Pr/Stk</cost>
<alt_cost> 202.86 Pr/kg</alt_cost>
<product_barcode_text> 7310100602282</product_barcode_text>
Tag with large amounts of blur. The model goes into a loop in the <cost> field, generation is terminated after the field length limit is exceeded.
<heading> KMEKOKE BRØD</heading>
<subheading> 405kWBØRØ 166836.90.905</barcode_text>
<cost> 36.50 Pr/Stk 385k 385k 385k 385k 385k 385k 385k 385k 385k 385k 385k 385

The model performs more than satisfactory in the real world.

I considered doing more rigerous benchmarking on the model, but since this is a hobby project I would rather spend my time on other parts of the project now that this model works well enough.

Deployment #

The model is mainly deployed in two different settings:

For both of these I use the service, which I have been reasonably happy with. I have managed to get cold start down to around 20 seconds, but there is probably more room for improvement here since the model is only 600mb or so.

Future work #

There are many opportunities for improvement on the model itself. A lot of these relate to the format of the sequence the model is trained to output.

OCR from multiple images #

The nature of having an app running in real time on a mobile device, means there are often multiple captures available for a given tag. In many cases each capture contains glare or occlusions on different areas of the tag.

Read more

It should be possible to train the model to predict a special token whenever it is unsure about a part of the sequence. A special sampling strategy could then be used to decode in tandem with multiple captures.

Training the models in this regieme is probably a bit more tricky, but doable. I imagine a three step process:

  1. Images are captured and manually annotated. These captures should be fully readable.
  2. Introduce artifical imperfections (occlusion/glare/scratches), probably generate several variations for each annotated capture. Imperfections can be placed randomly.
  3. Decode both the capture and the augmented capture with the model we already trained. Where the two models differ is where we want to train the model to predict our new "uncertainty" token.
  4. Continue training the model on these augmented samples.

This should hopefully train the model to do two major things:

Sampling strategy #

Right now I sample the model using simple greedy sampling + a per field and total generation length limit.

A lot of clever stuff can be done, especially when it comes to barcodes with error correcting mechanisms.

Conclusion #

Overall, training the model has been a large success. It performs a lot better than any off the shelf OCR solution for my very specific use case, and is relatively cheap and quick to run.