Market Basket Analysis in Alteryx.

Note. This blog focusses on ‘association rules’ type analysis within the wider ‘market basket analysis’ field.


Market basket analysis is a technique used to assess the likelihood of buying a particular products together.

It doesn’t have to be at the product level either, you can assess what colours of items people buy together, or what type of items people buy together.

This can be hugely valuable to all manor of industries, but most prominently in retail.

A good retail example would be for a department store (think John Lewis, House of Fraser, or for my US friends, Bloomingdales)  which sells goods from a number of brands. If they can assess the likelihood of their customers buying particular brands together then they can optimize the store layout as a result, and thus maximize the potential for sales.

Market basket analysis is a hugely discussed topic online, and it’s easy to find examples of real life use cases of where it can benefit businesses. This blog is a great place to start.


So how does market basket analysis work?

Firstly we start with our items. An item is a single product, we want to see whether the purchase of this item is influenced by the purchase of another item or items.

In order to do this we need a set of transactions. A transaction allows us to identify which products are bought together through a unique identifier.

Most commonly a transnational dataset will look something like this…

2017-02-10_21-20-04

From this dataset we can create a list of rules. Essentially a rule is the relationship between the item and other item(s).

{Polo Ralph Lauren, Armani Jeans} => {Hugo Boss}

In that, how does buying Polo Ralph Lauren and Armani Jeans items influence the purchase of Hugo Boss clothing. The above rule can also be known as an ‘item set’.

For each rule we generate three key metrics, ‘Support’, ‘Confidence’ and ‘Lift’. These metrics help us define just how significant a relationship there is between the two sides.

Support defines the % of transactions with which the rule or item set exists upon. It is ideal to have item sets with large support values.

Confidence refers to the probability that if the items on the left side of the rule were in a transaction, that the item on the right side will also be upon the transaction.

Finally lift equates to the probability that all the items exist upon the same transaction (also our lift value), divided by the likelihood of the items within that item set occurring independently. In such cases, returning a lift value of greater than 1 appears to suggest at least ‘some usefulness’ in the rule.


Completing Market Basket Analysis in Alteryx.

Well that’s easy thanks to this Alteryx macro that I have created (Alteryx also have a series of Market Basket tools available, but I decided to build my own in order to aid my understanding of the mechanics required to build association rules).

But how does it work? Well actually the heavy lifting is done using the R tool (though i’d be massively interested to see if any people can repeat this solely with Alteryx).

The following script will run successfully off an input that looks something like…

2017-02-10_21-30-01

2017-02-10_20-53-16

So lets breakdown the script step by step.

## Load the arules (association rules) library, library can be downloaded from https://cran.r-project.org/web/packages/arules/index.html

library(arules)

## Read in the Alteryx data stream as a data frame.

data <- read.Alteryx(“#1″, mode=”data.frame”)

## replace the transaction IDs with numeric IDs as required for a table of transaction class.

data$Num <- as.numeric(factor(data$TransactionID,levels=unique(data$TransactionID)))

## Create a single vector for each transaction which contains a list of items within it

AggPosData <- split(data$Item,data$TransactionID)

## Convert our data into a object of transaction class

Txns <- as(AggPosData,’transactions’)

## Compute market basket analysis using apriori algorithm

MarketBasket <- apriori(Txns,parameter = list(sup = 0.00001, conf = 0.5, maxlen = 3, target=’rules’))

## Convert the output to a data frame

MarketBasketData <- as(MarketBasket, ‘data.frame’)

## Write the data out into an alteryx data stream

write.Alteryx(MarketBasketData, 1)

2017-02-10_21-24-46


Paramertizing Support, Confidence and length.

Within the R script you may have noticed that when computing the actual market basket analysis using the apriori algorithm I passed in the parameter, ‘sup’. This parameter essentially represent minimum level for our key metric, ‘support’.

MarketBasket <- apriori(Txns,parameter = list(sup = 0.00001, conf = 0.5, maxlen = 3, target=’rules’))

This value is passed into the algorithm in order to ensure it runs in a reasonable amount of time, Market Basket Analysis is traditionally resource intensive.

The reason is simple. As the number of different ‘items’ that exist, increases, the number of possible item sets, or rules, rises exponentially.

Due to the nature of the support metric, as soon as we find a single itemset (or item), which has support lower than the threshold specified, we know that that itemset (or item), cannot be contained in any further item sets because the support value would be lower (remember that support is essentially the % of total transactions that the item set exists upon).

Further to this, you will also notice that ‘conf’ (confidence) and ‘maxlen’ (maximum length of an item set), are also Paramertized. This is to ensure the output data is trimmed as far as we need it. Remember confidence essentially refers to how confident we can be that the result is significant, therefor lower values are largely irrelevant for further analysis). Meanwhile it is almost impossible to instigate change that caters for all the conditions of a large item set.


Visualising the output.

Traditionally people visualisation market basket analysis with a scatter plot, which allows them to encode each of our three key metrics, with each point representing a rule.

2017-02-10_21-06-34

Of course this is something easily repeatable with Tableau, and Tableau’s interactive filters can be used to further define thresholds, beyond those passed into the R script, to identify rules of interest to the end user.

2017-02-11_22-13-55

I feel that whilst this is a hugely impact way of showing the data, a list of all the rules and their given support, confidence and lift values can also be hugely beneficial, especially in a dynamic platform such as Tableau where we can allow our audience to select which metric they wish to sort their rules by.

Other people have created network diagrams of association rules (with each node representing a different item set), but of course these are notoriously difficult to visualize in an impactful manor within Tableau.

Ben.

 

Special characters and the Cross Tab tool.

Lets say you have data that looks something like this…

2017-02-06_18-47-04

And you want to change it to read something like this…

2017-02-06_18-47-31

Well with Alteryx that’s pretty simple, we just drag a cross-tab tool onto the canvas and configure it as appropriate and you have the correct result.

2017-02-06_18-48-59.png

But what you may not have tried before is to create your headers from a field that contains dimension members that have special characters. Lets take our original dataset and replace spaces with a hyphen (-).

2017-02-06_18-44-41

2017-02-06_18-45-24

ARR, we have a problem, you will see that the hyphens have been replaced with underscores (_). I don’t know the exact reason why the tool has been built like this, but i’d be interested in finding out. But in the meantime here’s my way around it.

The Cross Tab Real tool.

2017-02-06_18-58-00.png

This tool has been designed to provide the same functionality as the Cross Tab tool within Alteryx, but maintains the integrity of special characters within headers.


Here’s how it works.

  1. Create an alias for each of the dimension members within our dataset. This alias will not have special characters within it. To do this I first sort my data by the field I wish to make my headers from, and then use the multi-row tool to create a uniqueID for each member.

    2017-02-06_19-03-22

  2. Then we will cross-tab our data using the Alias as the header field.

    2017-02-06_19-07-56

    2017-02-06_19-08-42

  3. Perform a 2nd cross tab from our multi-row tool. This time we will configure it again with our alias field as the header field, but this time instead of choosing our true value field as the values in our output we will use the our actual header field, in my case, ship mode.

    2017-02-06_19-33-24

  4. Take a sample row from this 2nd cross tab stream. This will leave is with one row with our actual header fields underneath the alias names.

    2017-02-06_19-32-20

  5. I will now union this header stream on top of the stream creating in point 2.

    2017-02-06_19-32-32

  6. Now it’s just a simple case of using the dynamic rename on our alias fields and using the ‘Take Field Names from First Row of Data’ option and voila, we now have our cross tabbed data with the true headers

2017-02-06_19-30-28


Now I am definitely not the first person to solve this person, Alteryx egineer Adam Riley has posted a solution on the Alteryx community, and there is also a 2nd solution, again on the Alteryx community, which can be found here.

If you wish to view the workflow used in this post then it can be found here. And remember to try out the Cross Tab Real tool which packages this workflow up into a single tool for easy use.

Ben

Embedding a Tableau visualisation within a Jive site.

Jive is an online platform designed to help develop an online community for businesses through encouraging sharing and collaboration between employees to drive creativity.

One use-case for Jive might be to provide feedback and encourage interaction with an in development or in production Tableau report.

Now the great thing about Jive is that embedding a Tableau visualisation is incredibly easy which can make the whole process smoother, rather than having your users to navigate through links and to a new tab, your users have the content directly in front of them, inside the Jive site/post where they are expected to make contributions.

It’s not as simple as just pasting the embed link from Tableau server, but it is simple in that we simply need to use an iframe to present our content to the audience.


 

1. Take the link (NOT THE EMBED LINK) of your visualisation from Tableau Server or Tableau public.

FuelForAnger.png

2. Amend the link as appropriate should you wish to add specific URL parameters, such as hiding the toolbar, or hiding tabs.

3. Insert your link into the html iframe code shown below

vizlinkhere.png

so…

viz.png

4. Copy your new html code and paste it in the appropriate location in the html editor of Jive.

5. Done

Ben

 

 

 

A ‘loose’ method to enriching your geographical trade areas with demographic information.

Having in-depth demographic information can be seen as the holy grail in the eyes of some data anlaysts, and there is a good reason why. The context that can be added to your analytics as a result in superb.

The problem is, up-to-date demographic information is extremely difficult to get hold of and even more difficult too map into your data.

I may not be able to solve the former issue, but I will give you an example of how you can ‘loosely’ (there are some assumptions) map population data into your data to solve the latter issue.

For the purpose of this example I will use information generated by my Data School colleague, Pablo Sáenz de Tejada, in this blog about geocoding addresses with Alteryx.

The dataset produced gives the latitude and longitude for each Tesco store within London. I have then created voronoi style trade areas for each of these stores, following the steps outlined in this blog by Will Griffiths.


Finding demographic information.

It is important in order to maximise the accuracy of our demographic mapping, that we obtain the demographic information at the lowest level of detail available (and of course the latest).

Within England and Wales, the Office of National Statistics release mid-year estimates of population at the level of Lower Super Output Area (LSOA). These estimates are broken down by gender which gives an additional demographic variable.

LSOA’s have been developed to provide areas that contain an average of 1500 residents and 650 households, whilst other characteristics such as social homogeneity also play a role in defining their boundaries.

Additional information can also be found at this level, such as indices of deprivation.

It is unlikely that you will be able to find a free datasource at a lower level than this for the England and Wales.


‘Loose’ mapping.

Now that we have our demographic detail we can look at mapping it into our dataset.

In this example I will show a scenario where we want to assess the potential footfall for each of our Tesco stores alongside some demographic attributes, specifically age and gender.

My methodology is something like this…

  1. Take the store trade area and spatially match them against the LSOA areas2017-01-26_18-46-01.png
  2. Create, for each matched LSOA, an intersection object between itself and the trade area
  3. Identify the size of the original LSOA shape
  4. Identify the size of the intersection object shape
  5. From these two values calculate the % of the total LSOA shape that is within the trade area
  6. Multiply this value by the demographic variables we may have
  7. Done.

So how do we do that ^ in Alteryx.


Alteryx work.

1st things first we need to input our trade areas.

Then we need to input our LSOA shape files, which can be found here. Whilst I had this object I also used the spatial info tool to return the area value.

Now we need to spatially match these two objects. In my case I configured the spatial match to return objects that ‘touched or intersected’ the second object.

2017-01-26_19-06-36

For each of your trade areas you will now have a list of the LSOA areas that are, at least in part, within your trade areas.

2017-01-26_19-05-28

Now we know which LSOA objects are linked to which trade areas, alongside the actual spatial objects for each, we can create our intersect object using the Spatial Process tool.

2017-01-26_19-07-40.png

Now we have our intersect object we can use the Spatial Info tool to identify the size of this area before calculating our % overlap.

2017-01-26_19-12-57

2017-01-26_19-13-58

Once we have these values it is just a case of identifying the demographic attributes we wish to bring in, join these to this file and multiply it by the percentage overlap.


You can find my complete workflow here, and I have also produced an Alteryx application which returns demographic information (age and gender) for each individual shape that exists within a shape file, that can be found here (note this will only enrich shape files within England and Wales).

Ben

 

Finding the last instance of a character within a text string in Alteryx

Early I was doing some playing around, looking to manipulate file path strings to develop an application that unions all files in the same location and of the same type as a template file (yes I know these macros already exist but I wanted to develop my own to enable me to understand the workings).

The route I went for was that the user browsed to one of the files, and that I would then manipulate this path and essentially replace the filename with the Alteryx wildcard character ‘*’.

This means that I wanted to manipulate a string that looks like this…

C:\Users\Ben Moss\Desktop\filenamehere.filetypehere

into something that looked like this…

C:\Users\Ben Moss\Desktop\*.filetypehere

Given the syntax within filepaths I knew that anything after the last backslash (\) character and the first (and only) full stop represented the filename.

So my challenge was to acknowledge the positions of these characters.


Finding the location of a character is relatively simple, and most of you will probably know that, in Alteryx, the following formula…

findstring([field],”.”)

Will return me the position of the first full stop within the field.

It is possible, using a series of nested findstrings with a combination of the right and length functions that we can find the nth occurence of the character.

Something like

findstring([Field1],”\”)

//position of 1st backslash

+

findstring(right([Field1],length([Field1])-findstring([Field1],”\”)-1),”\”)

//position of 2nd backslash in string trimmed from after first back slash

+1

//add character because the first backslash is taken out of the equation but it still forms part of the string

Would return the 2nd instance for example. However, what if we don’t know the value for N. We don’t know the value of N (filepaths can contain any number of back slashes), but we do know we need the final one, thats for sure.

2017-01-26_18-18-25

That’s when, whilst browsing Alteryx’s string functions, I fell upon the ‘ReverseString()’ function, which does exactly what it says on the tin.

ReverseString(‘Ben’) for instance would return ‘Neb’.

So

ReverseString(C:\Users\Ben Moss\Desktop\filenamehere.filetypehere)

would return

erehepytelif.erehemanelif\potkseD\ssoM neB\sresU\:C

Which whilst looking likely absolute nonsense, this is actually very useful as we can now use the findstring() function to return the position of the 1st backslash in our reverse string, before then taking that away from the length of the string to return it’s actual position.

2017-01-26_18-06-51

Now we have acknowledged our position we just need to combine all of our formulas to develop our final wildcard path.

2017-01-26_18-04-18

2017-01-26_18-08-31

And there you go, that’s how I resolve my string issue.

Ben

I want more than 16 columns in my Tableau table.

So my colleague Ravi Mistry posted a blog a fair few months ago now, giving a neat tip to increase the maximum number of levels of columns and rows within a table type view.

He hinted at a tip about hacking the XML to have greater than 16 levels, here I am giving him a post to refer to.


 

  1. Go to ‘Analysis’ on the top tool bar, navigate to ‘table layout’ and then ‘advanced’, a table options menu will appear
  2. Here you will see a ‘Maximum level of column labels’, by default this is set to 6 (and you can only set a maximum of 16. In order to make this value visible in the xml we need to change it from the default value, I will set it to 16.

    2017-01-24_19-51-28.png

  3. Save your file as a .twb file
  4. Open the .twb file in a text editor, in my example I will use notepad
  5. Find the line of xml that defines the maximum level of rows (which actually relates to column headers), it looks something like…

    <format attr=’row-levels’ value=’16’ />

    2017-01-24_19-53-59.png

  6.  Amend the placeholder value we gave by amending the maximum number of column levels value, in my case 16, with the number of columns you want in your table, I’m going to put 17.
  7. Save the file
  8. Open the tableau workbook and build out your 17 column (or however many you wish) table

Of course you can do a similar trick with the ‘maximum number of horizontal row labels’ and ‘maximum number of column labels’ by just amending the default value to ensure the value is assigned in the xml and then finding it and changing it accordingly.

Ben

Colour encoding in preparation.

In Tableau we may have scenarios where we have a list of dimension members, and this list of dimension members may not be exhaustive (complete). However, we may know the other dimension members that may exist (but currently don’t).

An example of this would be on a critical risk dashboard. You have data set up that returns a ‘FALSE’ value if there is a fatal error within your system. It is very unlikely, that this fatal error will exist during the dashboard design phase, and therefore you can only ever colour encode the value ‘TRUE’ (meaning no fatal error).

Here I will outline two methods of ‘encoding in preparation’, for such dimension members.

In these examples I will set the scenario that I have a statement that at present only returns the value FALSE. I want to encode it in case TRUE values appear in the future.


Method 1. Directly in Tableau.

  1. Create a calculated field. This calculated field will be the one in which your final calculation, or copy of the dimension will exist.
  2. Within this calculation return, one by one, the values that you predict will exist in the future.

    In my case I have used the formula 1<2 to return the value TRUE

  3. Drag this calculation to the colour shelf
  4. Encode the colour as appropriate

    2017-01-23_22-07-43

  5. If you have more dimension members that you believe will exist in the future, loop through steps 2-5.
  6. Create your actual calculation.
  7. Check out your encoded values once they are ‘active’.

    2017-01-23_22-10-55


Method 2. Hacking the XML.

  1. Place your dimension or calculated field upon the colour shelf.
  2. Encode the variables with a colour palette (not automatic).

    2017-01-23_20-11-29.png

  3. Save your workbook as a .twb file, before closing and opening in notepad.
  4. Identify the location where the current dimension members have been assigned within the XML

    2017-01-23_20-14-00.png

  5. Build out the encoding XML for the missing dimension members and union it underneath the current lines, but before the closing </encoding> tag.

    <Map to=’#HEXCODEHERE’>
    <Bucket>DIMENSIONMEMBERHERE</bucket>
    </map>

    2017-01-23_20-37-37.png

  6. Save the workbook
  7. Open your workbook and wait for the dimension member to exist

    2017-01-23_20-39-06.png


And that’s how you can ‘colour encode in preparation’.

Ben