How train fares just got fairer in the UK

Okay, so the headline doesn’t portray an entirely accurate picture of the latest release of train fares in the UK.

According to the BBC the latest price release represented a 3.4% average increase on fares across the UK,.

Let’s get one thing straight, train fares are always going to rise, for the foreseeable future at least. The cost of maintaining the railways has increased, demand for train services has increased, and consumers are demanding a more punctual service.

If you take just those three things into account and still seriously expected train prices were going to fall then it’s fair too say you may be slightly foolish.

This 3.4% figure is also extremely crude; there are over 2,500 stations in the UK, meaning there are well over 6 million potential routes for a train passenger. These routes vary greatly in their distance, speed, flexibility, business and not least, service provider.

I’ve taken a very different approach to reviewing the difference in fare prices between 2017 and 2018, instead of focusing on the apparent disgrace of the increase in cost, I’ve looked deeper and found a story that appears to suggest the train fares may have just in fact got fairer.

North West v North east

I stumbled across the potential for this story after comparing the visualisation to the left, which shows the % increase in the cost of an anytime return into London, with the visualisation on the right, which shows the cost per mile of an anytime return into London from 2017.

1.pngIt’s quite evident, just from these two views, that there hasn’t just been a blanket rise in train fares across the UK. Fig 2. shows us that there appears to have been a price freeze in the North West, whilst figure 1 shows us that historically these areas represent those with the greatest cost per mile, specifically areas in North Wales and around Liverpool.

You could argue these views are still crude; train fares are not linear to a single destination, if I live say in Leicester, which is ~86 miles into its London terminus and an adult fare is £157 for an anytime return (that is true in 2017), then I wouldn’t expect someone that lives roughly double the distance away, say, York, to pay double the price, I would expect them to pay more but at a slower decreasing rate as the distance gets greater.

The views above do not make it easy too compare the cost of those journeys of a similar length, so lets do this differently, lets plot our distance in miles against cost per mile during 2017, with the % change on colour (as shown below in figure 3).

2This view shows exactly the same detail as the view in figure 2, but shows the disparity in growth against fares of a similar distance much more clearly.

Here you can see that the lighter colours (lower increase), appear, certainly in most cases, to lie above those darker colours (higher increase) at a similar length of travel. Given that our y-axis represents cost per mile of fares from 2017 this suggests that journeys that were cheaper in 2017 are being bought closer to the more expensive fares from 2017.

The Fare Setters

The divide between fare growth in the West and East points points us towards the two major service providers of these areas, Virgin Trains West Coast and Virgin Trains East Coast (it’s probably important to say at this point that despite their names, the two services are legally separate) and how they may have implemented different pricing strategies in the latest fare release.

If we change the colour on the previous chart from YoY % increase and instead to the ‘fare setter’, it’s possible to see a clear pattern.

3The values representing ‘Virgin Trains West Coast’ line up in every case with the circles which represent low growth, and by plotting growth instead of cost per mile through 2017 this becomes even clearer.

4Holy shit, there was a complete price freeze on Virgin Train West Coast route into it’s London Terminus; did anyone read about that this week?

Okay, so it might be a well overdue price freeze for customers on the West Coast Main Line where, as per figure 4, appear to have seen costs above the going cost per mile for routes of a similar distance historically, but at least got it.

North v South

When looking at figures 3 and 4, you may have noticed something that I havn’t pointed out (yet), a huge cluster fuck of stations where the cost per mile into central London doesn’t appear to match what we would expect. They could definitely be classed as outliers (fig 6). This was something that I wanted to look into deeper, so I started with our ‘fare setters’ and overlaid these onto colour (fig 7).


6These points appear to come from a very specific set of service providers, largely operating around the South and East London; now this doesn’t exactly seem fair; there are people that live in Axminster who pay £131.20 for a ~135 mile journey and then there are people from Uttoxeter who pay £290.00 for a ~120 mile journey (15 miles shorter!).

Unlike the West Coast and East Coast lines where there has been a distinct shift towards fairer pricing this isn’t quite the story for those operators that work on these shorter routes.


The above visual (fig 8) supports this statement and shows that for those operators where fares are distinctly lower than their peers (operators around South and East of London) the YoY % increase in fares is also lower, i.e. the gap between these two operator groups is actually increasing rather than decreasing.

Why this gap exists it’s hard to be sure without researching the topic in more detail, but it is more that likely a result of rail subsidies, designed to decrease the cost of rail travel to and from certain areas due to the social-economic benefits it can bring to the towns and people that they are serving.

Where Next

Now it’s important to note this post only focuses on train travel into Central London with the greatest level of flexibility (an anytime return on any permitted route). This post certainly tells us some stories but it certainly doesn’t paint the whole picture.

For example it would be good to develop this analysis further be investigating the different geographies in more detail; instead of focusing purely on London perhaps looking at transport to all major destinations in the UK, alongside routes to all major towns/cities within that specific origins region. This would allow us to get a better picture on whether these stories of geographical divides are ‘London Centric’ or actually a wider problem related to the cost of travel in those areas.

Additional data around, for example, the cost of maintaining each of Britain’s key railway line’s would also be beneficial. Perhaps the higher cost for passengers on the west coast line compared to the east coast line is justified due to higher maintenance costs. As mentioned at the beginning of the article, you would also expect things such as the number of services to the destination and the business of the service to act as drivers to higher train fares, but unfortunately data in this area is notoriously difficult to gather yet alone make publicly available (though TFL are making a good stab at it).

Data Summary

Data scraped from

2017 data is from 24/12/2017, 2018 data is from 02/01/2018.

All journeys are from origin to London to their London Terminus.

Fare data is based on an anytime return fare from an origin outside of the London Travel Card area and are valid on any permitted route.

Distances based on straight line distance from origin to London St. Pancras in miles (technically a return covers this distance x 2 but all values would be relative).


API’S, WTF are they?

A quick google reveals the following answer…

API defition.png

But to myself and many others, this is far to vague-er a definition to understand their use-cases for us data analysts.

Searching for what API actually stands for doesn’t really clear anything up either (Application Programming Interface for anyone interested).

So i’m going to try and help clear things up by creating a ‘lehmans’ definition for us data analysts, alongside an example use-case from the API made available by Transport for London (TFL),

So what do I think API’s are?

Well, to me, they represent the ability for us to query an online datasource by passing information (or parameters) into a URL.

Once we make our request, by pressing enter, we will be passed a resulting string of text, usually maintained in either an XML or JSON format.

API Documentation

This is always the first place to start. This will allow you to understand what data is available, what parameters you can pass into the query, how the query should be formatted, and the output type.

Lets have a look at the API documentation available on the TFL site.


1st things first, most APIs will require you to register your ‘app’, this is so they can. It also allows them to control the number of requests you make, most will have some sort of limit per minute/hour/day.


The 2nd thing to notice is that most API’s will actually contain a number of different ways in which you can query the database, each query will bring back a different set of information and are designed to answer the array of different questions that users may have.

It is entirely possible to use the results from one request to drive a follow up request, for example with the API, you can make one query to return a list of ID’s of all of your followers, you can use these follower ID’s to make a query against the user lookup API to return further information about each of your followers.



So give me an example.



Market Basket Analysis in Alteryx.

Note. This blog focusses on ‘association rules’ type analysis within the wider ‘market basket analysis’ field.

Market basket analysis is a technique used to assess the likelihood of buying a particular products together.

It doesn’t have to be at the product level either, you can assess what colours of items people buy together, or what type of items people buy together.

This can be hugely valuable to all manor of industries, but most prominently in retail.

A good retail example would be for a department store (think John Lewis, House of Fraser, or for my US friends, Bloomingdales)  which sells goods from a number of brands. If they can assess the likelihood of their customers buying particular brands together then they can optimize the store layout as a result, and thus maximize the potential for sales.

Market basket analysis is a hugely discussed topic online, and it’s easy to find examples of real life use cases of where it can benefit businesses. This blog is a great place to start.

So how does market basket analysis work?

Firstly we start with our items. An item is a single product, we want to see whether the purchase of this item is influenced by the purchase of another item or items.

In order to do this we need a set of transactions. A transaction allows us to identify which products are bought together through a unique identifier.

Most commonly a transnational dataset will look something like this…


From this dataset we can create a list of rules. Essentially a rule is the relationship between the item and other item(s).

{Polo Ralph Lauren, Armani Jeans} => {Hugo Boss}

In that, how does buying Polo Ralph Lauren and Armani Jeans items influence the purchase of Hugo Boss clothing. The above rule can also be known as an ‘item set’.

For each rule we generate three key metrics, ‘Support’, ‘Confidence’ and ‘Lift’. These metrics help us define just how significant a relationship there is between the two sides.

Support defines the % of transactions with which the rule or item set exists upon. It is ideal to have item sets with large support values.

Confidence refers to the probability that if the items on the left side of the rule were in a transaction, that the item on the right side will also be upon the transaction.

Finally lift equates to the probability that all the items exist upon the same transaction (also our lift value), divided by the likelihood of the items within that item set occurring independently. In such cases, returning a lift value of greater than 1 appears to suggest at least ‘some usefulness’ in the rule.

Completing Market Basket Analysis in Alteryx.

Well that’s easy thanks to this Alteryx macro that I have created (Alteryx also have a series of Market Basket tools available, but I decided to build my own in order to aid my understanding of the mechanics required to build association rules).

But how does it work? Well actually the heavy lifting is done using the R tool (though i’d be massively interested to see if any people can repeat this solely with Alteryx).

The following script will run successfully off an input that looks something like…



So lets breakdown the script step by step.

## Load the arules (association rules) library, library can be downloaded from


## Read in the Alteryx data stream as a data frame.

data <- read.Alteryx(“#1″, mode=”data.frame”)

## replace the transaction IDs with numeric IDs as required for a table of transaction class.

data$Num <- as.numeric(factor(data$TransactionID,levels=unique(data$TransactionID)))

## Create a single vector for each transaction which contains a list of items within it

AggPosData <- split(data$Item,data$TransactionID)

## Convert our data into a object of transaction class

Txns <- as(AggPosData,’transactions’)

## Compute market basket analysis using apriori algorithm

MarketBasket <- apriori(Txns,parameter = list(sup = 0.00001, conf = 0.5, maxlen = 3, target=’rules’))

## Convert the output to a data frame

MarketBasketData <- as(MarketBasket, ‘data.frame’)

## Write the data out into an alteryx data stream

write.Alteryx(MarketBasketData, 1)


Paramertizing Support, Confidence and length.

Within the R script you may have noticed that when computing the actual market basket analysis using the apriori algorithm I passed in the parameter, ‘sup’. This parameter essentially represent minimum level for our key metric, ‘support’.

MarketBasket <- apriori(Txns,parameter = list(sup = 0.00001, conf = 0.5, maxlen = 3, target=’rules’))

This value is passed into the algorithm in order to ensure it runs in a reasonable amount of time, Market Basket Analysis is traditionally resource intensive.

The reason is simple. As the number of different ‘items’ that exist, increases, the number of possible item sets, or rules, rises exponentially.

Due to the nature of the support metric, as soon as we find a single itemset (or item), which has support lower than the threshold specified, we know that that itemset (or item), cannot be contained in any further item sets because the support value would be lower (remember that support is essentially the % of total transactions that the item set exists upon).

Further to this, you will also notice that ‘conf’ (confidence) and ‘maxlen’ (maximum length of an item set), are also Paramertized. This is to ensure the output data is trimmed as far as we need it. Remember confidence essentially refers to how confident we can be that the result is significant, therefor lower values are largely irrelevant for further analysis). Meanwhile it is almost impossible to instigate change that caters for all the conditions of a large item set.

Visualising the output.

Traditionally people visualisation market basket analysis with a scatter plot, which allows them to encode each of our three key metrics, with each point representing a rule.


Of course this is something easily repeatable with Tableau, and Tableau’s interactive filters can be used to further define thresholds, beyond those passed into the R script, to identify rules of interest to the end user.


I feel that whilst this is a hugely impact way of showing the data, a list of all the rules and their given support, confidence and lift values can also be hugely beneficial, especially in a dynamic platform such as Tableau where we can allow our audience to select which metric they wish to sort their rules by.

Other people have created network diagrams of association rules (with each node representing a different item set), but of course these are notoriously difficult to visualize in an impactful manor within Tableau.



Special characters and the Cross Tab tool.

Lets say you have data that looks something like this…


And you want to change it to read something like this…


Well with Alteryx that’s pretty simple, we just drag a cross-tab tool onto the canvas and configure it as appropriate and you have the correct result.


But what you may not have tried before is to create your headers from a field that contains dimension members that have special characters. Lets take our original dataset and replace spaces with a hyphen (-).



ARR, we have a problem, you will see that the hyphens have been replaced with underscores (_). I don’t know the exact reason why the tool has been built like this, but i’d be interested in finding out. But in the meantime here’s my way around it.

The Cross Tab Real tool.


This tool has been designed to provide the same functionality as the Cross Tab tool within Alteryx, but maintains the integrity of special characters within headers.

Here’s how it works.

  1. Create an alias for each of the dimension members within our dataset. This alias will not have special characters within it. To do this I first sort my data by the field I wish to make my headers from, and then use the multi-row tool to create a uniqueID for each member.2017-02-06_19-03-22
  2. Then we will cross-tab our data using the Alias as the header field.2017-02-06_19-07-56


  3. Perform a 2nd cross tab from our multi-row tool. This time we will configure it again with our alias field as the header field, but this time instead of choosing our true value field as the values in our output we will use the our actual header field, in my case, ship mode.2017-02-06_19-33-24
  4. Take a sample row from this 2nd cross tab stream. This will leave is with one row with our actual header fields underneath the alias names.2017-02-06_19-32-20
  5. I will now union this header stream on top of the stream creating in point 2.2017-02-06_19-32-32
  6. Now it’s just a simple case of using the dynamic rename on our alias fields and using the ‘Take Field Names from First Row of Data’ option and voila, we now have our cross tabbed data with the true headers


Now I am definitely not the first person to solve this person, Alteryx egineer Adam Riley has posted a solution on the Alteryx community, and there is also a 2nd solution, again on the Alteryx community, which can be found here.

If you wish to view the workflow used in this post then it can be found here. And remember to try out the Cross Tab Real tool which packages this workflow up into a single tool for easy use.


Embedding a Tableau visualisation within a Jive site.

Jive is an online platform designed to help develop an online community for businesses through encouraging sharing and collaboration between employees to drive creativity.

One use-case for Jive might be to provide feedback and encourage interaction with an in development or in production Tableau report.

Now the great thing about Jive is that embedding a Tableau visualisation is incredibly easy which can make the whole process smoother, rather than having your users to navigate through links and to a new tab, your users have the content directly in front of them, inside the Jive site/post where they are expected to make contributions.

It’s not as simple as just pasting the embed link from Tableau server, but it is simple in that we simply need to use an iframe to present our content to the audience.


1. Take the link (NOT THE EMBED LINK) of your visualisation from Tableau Server or Tableau public.


2. Amend the link as appropriate should you wish to add specific URL parameters, such as hiding the toolbar, or hiding tabs.

3. Insert your link into the html iframe code shown below




4. Copy your new html code and paste it in the appropriate location in the html editor of Jive.

5. Done





A ‘loose’ method to enriching your geographical trade areas with demographic information.

Having in-depth demographic information can be seen as the holy grail in the eyes of some data anlaysts, and there is a good reason why. The context that can be added to your analytics as a result in superb.

The problem is, up-to-date demographic information is extremely difficult to get hold of and even more difficult too map into your data.

I may not be able to solve the former issue, but I will give you an example of how you can ‘loosely’ (there are some assumptions) map population data into your data to solve the latter issue.

For the purpose of this example I will use information generated by my Data School colleague, Pablo Sáenz de Tejada, in this blog about geocoding addresses with Alteryx.

The dataset produced gives the latitude and longitude for each Tesco store within London. I have then created voronoi style trade areas for each of these stores, following the steps outlined in this blog by Will Griffiths.

Finding demographic information.

It is important in order to maximise the accuracy of our demographic mapping, that we obtain the demographic information at the lowest level of detail available (and of course the latest).

Within England and Wales, the Office of National Statistics release mid-year estimates of population at the level of Lower Super Output Area (LSOA). These estimates are broken down by gender which gives an additional demographic variable.

LSOA’s have been developed to provide areas that contain an average of 1500 residents and 650 households, whilst other characteristics such as social homogeneity also play a role in defining their boundaries.

Additional information can also be found at this level, such as indices of deprivation.

It is unlikely that you will be able to find a free datasource at a lower level than this for the England and Wales.

‘Loose’ mapping.

Now that we have our demographic detail we can look at mapping it into our dataset.

In this example I will show a scenario where we want to assess the potential footfall for each of our Tesco stores alongside some demographic attributes, specifically age and gender.

My methodology is something like this…

  1. Take the store trade area and spatially match them against the LSOA areas2017-01-26_18-46-01.png
  2. Create, for each matched LSOA, an intersection object between itself and the trade area
  3. Identify the size of the original LSOA shape
  4. Identify the size of the intersection object shape
  5. From these two values calculate the % of the total LSOA shape that is within the trade area
  6. Multiply this value by the demographic variables we may have
  7. Done.

So how do we do that ^ in Alteryx.

Alteryx work.

1st things first we need to input our trade areas.

Then we need to input our LSOA shape files, which can be found here. Whilst I had this object I also used the spatial info tool to return the area value.

Now we need to spatially match these two objects. In my case I configured the spatial match to return objects that ‘touched or intersected’ the second object.


For each of your trade areas you will now have a list of the LSOA areas that are, at least in part, within your trade areas.


Now we know which LSOA objects are linked to which trade areas, alongside the actual spatial objects for each, we can create our intersect object using the Spatial Process tool.


Now we have our intersect object we can use the Spatial Info tool to identify the size of this area before calculating our % overlap.



Once we have these values it is just a case of identifying the demographic attributes we wish to bring in, join these to this file and multiply it by the percentage overlap.

You can find my complete workflow here, and I have also produced an Alteryx application which returns demographic information (age and gender) for each individual shape that exists within a shape file, that can be found here (note this will only enrich shape files within England and Wales).



Finding the last instance of a character within a text string in Alteryx

Early I was doing some playing around, looking to manipulate file path strings to develop an application that unions all files in the same location and of the same type as a template file (yes I know these macros already exist but I wanted to develop my own to enable me to understand the workings).

The route I went for was that the user browsed to one of the files, and that I would then manipulate this path and essentially replace the filename with the Alteryx wildcard character ‘*’.

This means that I wanted to manipulate a string that looks like this…

C:\Users\Ben Moss\Desktop\filenamehere.filetypehere

into something that looked like this…

C:\Users\Ben Moss\Desktop\*.filetypehere

Given the syntax within filepaths I knew that anything after the last backslash (\) character and the first (and only) full stop represented the filename.

So my challenge was to acknowledge the positions of these characters.

Finding the location of a character is relatively simple, and most of you will probably know that, in Alteryx, the following formula…


Will return me the position of the first full stop within the field.

It is possible, using a series of nested findstrings with a combination of the right and length functions that we can find the nth occurence of the character.

Something like


//position of 1st backslash



//position of 2nd backslash in string trimmed from after first back slash


//add character because the first backslash is taken out of the equation but it still forms part of the string

Would return the 2nd instance for example. However, what if we don’t know the value for N. We don’t know the value of N (filepaths can contain any number of back slashes), but we do know we need the final one, thats for sure.


That’s when, whilst browsing Alteryx’s string functions, I fell upon the ‘ReverseString()’ function, which does exactly what it says on the tin.

ReverseString(‘Ben’) for instance would return ‘Neb’.


ReverseString(C:\Users\Ben Moss\Desktop\filenamehere.filetypehere)

would return

erehepytelif.erehemanelif\potkseD\ssoM neB\sresU\:C

Which whilst looking likely absolute nonsense, this is actually very useful as we can now use the findstring() function to return the position of the 1st backslash in our reverse string, before then taking that away from the length of the string to return it’s actual position.


Now we have acknowledged our position we just need to combine all of our formulas to develop our final wildcard path.



And there you go, that’s how I resolve my string issue.