Pull to refresh

How do you choose products in stores?

SAS corporate blog Data Mining *Machine learning *Social networks and communities
Original author: Dreamastiy

The most important single ingredient in the formula of success is knowing how to get along with people. Theodore Roosevelt

In the previous article I tried to cover the basics of pricing analytics. Now I'd like to talk about something more interesting.

Have you ever thought about why you choose certain products in stores, why you prefer them to other similar ones? Many shopping trips are spontaneous, so it's probably impossible to give a clear answer for all the times you go shopping. But the general idea is obvious: you go shopping for a specific reason (to get food, a gadget, for entertainment, to play blackjack). In this article I'm going to use available data from grocery retailers to talk about how a set of basic logical assumptions and community analysis can help us determine the way customers choose products.


When it comes to classic stories about retail, I can't help but think about recommendation systems that have been using receipt analysis for a long time. Like the famous stories about Target coupons and beer and diapers.


These cases use the little-known Market Basket Analysis (MBA) or affinity analysis approach. The main idea is to develop a set of rules that look like «when they buy X, they usually buy Y» and then use it in further operations (personal recommendations, visual merchandising, etc.) There rules are used to determine complements, i.e. goods which complement each other. This approach is quite popular as it is easy to implement and interpret the results. The problem is that it's not always clear how to use the data from your findings and how we can define substitute goods, in addition to complements. Let's try to improve this approach: we can group products based on customer needs and then figure out how consumers make purchasing decisions.

Making MBA more complex, determining substitute goods

Let's make the MBA approach a bit more complex and study the information from loyalty cards issued by many retailers (for online stores, you could use customer IDs). We can carry out an MBA analysis for the loyalty cards instead of the receipts (using card ID/customer ID instead of receipt number). This will give us pairs of products that are related on a customer level, i.e. if a customers buys X, they also buy Y. The key here is that they may buy Y when they go to the store a different time.

Let's think about how we can determine substitute products. We can make a logical assumption that people don't tend to buy substitute goods together (I assume you don't buy 150 and 300 fl oz laundry detergent at the same time very often). This is the most important assumption in the entire analysis and it works very well for grocery/household goods retailers and, with a few adjustments, for other retailers as well. This assumption allows us to conclude that if customers often buy two particular products, but those two products can rarely be found in one receipt, then they are probably substitutes. This is quite a serious claim that requires a prior qualitative analysis of pairs – we need to eliminate statistically irrelevant pairs, remove the «bananas», etc. For the remaining connections, we can introduce a W metric that reflects how much more often products are bought within one loyalty card than within one receipt.

In the end we'll have pairs of products that look like «products X and Y rarely occur in one receipt, but are often bought by the same people» with a certain W connection metric. The higher the connection metric, the more confident we can be that these products are substitutes.

From MBA to SNA

The next logical step is to look at all pairs of goods as a whole. We can represent each pair as an edge of a graph with W value. If we create a visual representation of all the connections, it will look something like this:


Here we can clearly see the groups of products that have strong connections. Let's apply SNA (social network analysis) algorithms and take a look at the results. I've used the Louvain method as an example. We should end up with groups of substitute products. Let's look at the potential result:

• DANONE ACTIVIA cherry 2.9% 150 g
• DANONE ACTIVIA strawberry 2.4% 150 g
• DANONE ACTIVIA blueberry 2.9% 150 g
• DANONE ACTIVIA muesli 2.4% 150 g
• DANONE ACTIVIA fiber and cereal 2.9% 150 g

The results look promising – these products indeed look like substitutes that cover costumer need for DANONE yogurts. All product groups determined in the analysis are in line with the intuitive perception of substitute goods. There are, of course, some less obvious examples of products that the retailer has assigned to different groups, partially because of the brand, but from the consumer perspective they still cover the same need:

• Lux Face Moisturizer for dry skin
• Yantar Face Moisturizer for normal to dry skin
• Nevskaya Kosmetika Carrot Face Moisturizer for dry and sensitive skin
• Nevskaya Kosmetika Cucumber Face Moisturizer for oily and combination skin
• Nevskaya Kosmetika Olive Face Moisturizer for dry and normal skin
• Nevskaya Kosmetika Ginseng Eye Cream

Now, for the hierarchy

The Louvain method can be used to create hierarchies of product groups. In simple terms, let's build product groups of different sizes, turn them into a tree (customer decision tree) and look at the results:


Yes! Our tree can be easily interpreted in terms of both business logic and intuition – consumers know they want condensed milk, then they choose between a can and a doypack, choose the price, and they're ready to buy. Now we know what criteria people use to satisfy their need for condensed milk – the type of packaging and the price. In this particular example, the choice was not determined by brand or anything else that people can often attribute to products.

Nice tree, what's next

This tree helps us determine customer needs (lower levels of the tree) and product characteristics that affect the final choice (according to the tree hierarchy). The results can be applied to different areas of retail:

  • ideally, at least one product should cover each need. So, every store in the chain should have goods that cover customer needs. Instead of having 20 cans of condensed milk, it's better to have 10 cans and 10 doypacks.
  • within one customer need, products have the highest cannibalization rate. Now we are limited to a set of products for which we can calculate cross-effects for pricing and demand forecasting.
  • this tree helps with visual merchandising (or placement of products online)
  • for personal recommendations, it's an addition to the classic MBA and helps form cross-sale offers

To sum up: we've made the classic MBA slightly more complex, and achieved results that can be used in different retail operations. It's been quite an interesting task – I've had to apply logical thinking, analyze data and cluster graphs.

I hope you enjoyed it! Optimize processes, cluster graphs, optimize data storage (because Garbage In, Garbage Out) and get amazing results.
Total votes 10: ↑8 and ↓2 +6
Views 1.3K
Comments 1
Comments Comments 1



201–500 employees