Text Mining in Alteryx. Parsing hashtags from a list of tweets.

As part of this weeks #MakeoverMonday series, Tableau zen Andy Kriebel, and his partner in crime Eva Murray posted a data set which listed a series of tweets with which president Donald Trump has either tweeted himself, or retweeted, over the last 6 years, during his run up to taking the White House hot seat.

Immediately I saw 3 ways of analysing this data.

  1. Time series analysis
  2. Text analysis
  3. Time series and text analysis

The notion of text analytics is something I, and most of us, find fascinating, but it is notoriously difficult to get real insight from.

In the visualisation I created below I went for something quite simple. I looked to analyse hashtags used within those tweets.

2017-01-20_10-40-51.png

In this post I will show you, how in two Alteryx tools, and one Regex formula I managed to create a list of all hashtags used within the data.


2017-01-20_07-00-04

  1. Input your data (obviously), the data used in this project can be found here.
  2. Use a regex tool to tokenize and parse the text column, by looking for hashtags and from this point taking anything between the hashtag and the next space.#[[:alnum:]]+# = look for hashtag
    [[:alnum:]] = look for alphanumeric characters
    + = look for any number of consecutive alphanumeric characters

    2017-01-20_07-50-22

    You will also note that I used the split to rows tool. This is because a tweet can contain any number of hashtags. If we wished to have a row per tweet we can always bring out data back into that structure using the crosstab tool.

    We can now run the workflow and…

    2017-01-20_08-39-57.png


There was some further data preparation needed to get the data into the correct shape to build the visualisation above and the finished workflow can be found here.

2017-01-20_10-20-34.png

Safe.

Ben

Leave a comment