Market Basket Analysis in Alteryx.

Note. This blog focusses on ‘association rules’ type analysis within the wider ‘market basket analysis’ field.


Market basket analysis is a technique used to assess the likelihood of buying a particular products together.

It doesn’t have to be at the product level either, you can assess what colours of items people buy together, or what type of items people buy together.

This can be hugely valuable to all manor of industries, but most prominently in retail.

A good retail example would be for a department store (think John Lewis, House of Fraser, or for my US friends, Bloomingdales)  which sells goods from a number of brands. If they can assess the likelihood of their customers buying particular brands together then they can optimize the store layout as a result, and thus maximize the potential for sales.

Market basket analysis is a hugely discussed topic online, and it’s easy to find examples of real life use cases of where it can benefit businesses. This blog is a great place to start.


So how does market basket analysis work?

Firstly we start with our items. An item is a single product, we want to see whether the purchase of this item is influenced by the purchase of another item or items.

In order to do this we need a set of transactions. A transaction allows us to identify which products are bought together through a unique identifier.

Most commonly a transnational dataset will look something like this…

2017-02-10_21-20-04

From this dataset we can create a list of rules. Essentially a rule is the relationship between the item and other item(s).

{Polo Ralph Lauren, Armani Jeans} => {Hugo Boss}

In that, how does buying Polo Ralph Lauren and Armani Jeans items influence the purchase of Hugo Boss clothing. The above rule can also be known as an ‘item set’.

For each rule we generate three key metrics, ‘Support’, ‘Confidence’ and ‘Lift’. These metrics help us define just how significant a relationship there is between the two sides.

Support defines the % of transactions with which the rule or item set exists upon. It is ideal to have item sets with large support values.

Confidence refers to the probability that if the items on the left side of the rule were in a transaction, that the item on the right side will also be upon the transaction.

Finally lift equates to the probability that all the items exist upon the same transaction (also our lift value), divided by the likelihood of the items within that item set occurring independently. In such cases, returning a lift value of greater than 1 appears to suggest at least ‘some usefulness’ in the rule.


Completing Market Basket Analysis in Alteryx.

Well that’s easy thanks to this Alteryx macro that I have created (Alteryx also have a series of Market Basket tools available, but I decided to build my own in order to aid my understanding of the mechanics required to build association rules).

But how does it work? Well actually the heavy lifting is done using the R tool (though i’d be massively interested to see if any people can repeat this solely with Alteryx).

The following script will run successfully off an input that looks something like…

2017-02-10_21-30-01

2017-02-10_20-53-16

So lets breakdown the script step by step.

## Load the arules (association rules) library, library can be downloaded from https://cran.r-project.org/web/packages/arules/index.html

library(arules)

## Read in the Alteryx data stream as a data frame.

data <- read.Alteryx(“#1″, mode=”data.frame”)

## replace the transaction IDs with numeric IDs as required for a table of transaction class.

data$Num <- as.numeric(factor(data$TransactionID,levels=unique(data$TransactionID)))

## Create a single vector for each transaction which contains a list of items within it

AggPosData <- split(data$Item,data$TransactionID)

## Convert our data into a object of transaction class

Txns <- as(AggPosData,’transactions’)

## Compute market basket analysis using apriori algorithm

MarketBasket <- apriori(Txns,parameter = list(sup = 0.00001, conf = 0.5, maxlen = 3, target=’rules’))

## Convert the output to a data frame

MarketBasketData <- as(MarketBasket, ‘data.frame’)

## Write the data out into an alteryx data stream

write.Alteryx(MarketBasketData, 1)

2017-02-10_21-24-46


Paramertizing Support, Confidence and length.

Within the R script you may have noticed that when computing the actual market basket analysis using the apriori algorithm I passed in the parameter, ‘sup’. This parameter essentially represent minimum level for our key metric, ‘support’.

MarketBasket <- apriori(Txns,parameter = list(sup = 0.00001, conf = 0.5, maxlen = 3, target=’rules’))

This value is passed into the algorithm in order to ensure it runs in a reasonable amount of time, Market Basket Analysis is traditionally resource intensive.

The reason is simple. As the number of different ‘items’ that exist, increases, the number of possible item sets, or rules, rises exponentially.

Due to the nature of the support metric, as soon as we find a single itemset (or item), which has support lower than the threshold specified, we know that that itemset (or item), cannot be contained in any further item sets because the support value would be lower (remember that support is essentially the % of total transactions that the item set exists upon).

Further to this, you will also notice that ‘conf’ (confidence) and ‘maxlen’ (maximum length of an item set), are also Paramertized. This is to ensure the output data is trimmed as far as we need it. Remember confidence essentially refers to how confident we can be that the result is significant, therefor lower values are largely irrelevant for further analysis). Meanwhile it is almost impossible to instigate change that caters for all the conditions of a large item set.


Visualising the output.

Traditionally people visualisation market basket analysis with a scatter plot, which allows them to encode each of our three key metrics, with each point representing a rule.

2017-02-10_21-06-34

Of course this is something easily repeatable with Tableau, and Tableau’s interactive filters can be used to further define thresholds, beyond those passed into the R script, to identify rules of interest to the end user.

2017-02-11_22-13-55

I feel that whilst this is a hugely impact way of showing the data, a list of all the rules and their given support, confidence and lift values can also be hugely beneficial, especially in a dynamic platform such as Tableau where we can allow our audience to select which metric they wish to sort their rules by.

Other people have created network diagrams of association rules (with each node representing a different item set), but of course these are notoriously difficult to visualize in an impactful manor within Tableau.

Ben.

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s