Twitter Analysis for #Emcworld 2014 using SpringXD, Pivotal, Hawq, Isilon and Tableau


 

Did you ever wonder what people were mainly talking about in #EMCWORLD? what are the most products that were grabbing most of the attention of attendees? who were the top Twitter contributors? how many Tweets did we get for #EMCWORLD this year?

#EMCWORLD 2014 Twitter Analysis Results

Disclaimer

EMC World ran from 5th of May to the 8th of May 2014 in Las Vegas, I was able to collect around 30,000 tweets, unfortunately there were many missing tweets due to the twitter API limitation at this point and not being able to consistently get all updates as Twitter restricts you by the rate of data being pulled, but at least I am covering no less than 70% of the twitter data in that period which is usually enough to make an idea of the top stuff that was going on.

I have done my best efforts to make this simple, however due to limited time to put this together they might be some missing pieces or not in depth explained steps that I will be refining over the next couple of weeks, therefore I appreciate your patience.

I have used many products in order to get this going, make sure you read the terms and conditions before downloading and agree on the terms on your own risk (although most of them are free for at least 30 days trial, and the rest is free for personal/educational purposes).

I have not covered Isilon Storage Nodes piece in this article, however I will be updating it shortly to show you how to connect isilon into PHD 1.1 so please stay tuned.

Introduction

Running analysis on Twitter is one emerging way of analysing human responses to products, politics and events in general, for many organization out there understanding the market behaviour is through what people mention on Twitter, who are the most people that tweet with the most influence…etc

The outcome of such analysis is usually GOLD to marketing departments and support organization that can make sure customer satisfaction is guaranteed by capturing outcome of marketing events,surveys, product support, general product satisfaction..etc .etc

The Hadoop Part

In this exercise all tweets from a specific hash-tag will be streamed to Hadoop, in my case I am using Pivotal HD and storing the data on HDFS Isilon Storage (not local disk), usually having a Scale Out HDFS storage like Isilon is an ease of mind for enterprises as you separate your compute layer from the storage layer, better protection of data with tiering of disk pools, adding more compute with no storage, and you can still serve file-server  / Home Directories / and lots of more workload on the same storage, for more information on why Isilon and Hadoop click here.

SpringXD

Another piece of the puzzle is SpringXD, it allows you to connect many sources as Mail,Http,Twitter,Syslogs,Splunk….etc it has ready made APIs to connect to sources and grab you the information you are after, then get sinked to sink of your choice e.g. HDFS,FILE,JDBC,Splunk..etc, in our case the source is Twitter and the  Sink is HDFS for more information on how to connect different sources and sinks click here.

Hawq

Once SpringXD is up and Running we need a way to query data, and that is why we are using Pivotal HAWQ which is the fastest SQL-Like implementation on Hadoop until now, Hawq will enable us to create external tables through PXF on the fly, once we are done we can create views and manipulate the data in SQL if we want to and export it later on.

Tableau

Tableau will come in place after all data is gathered so we can create a pretty visualization of a mixture of text data coming out from SpringXD and table data coming out of Hawq. you can always use different software like SpotFire from TIBCO or Cognos.

Now lets start

Prep Work

– Download PHD VM 1.1 (currently its up to 2.0 but unfortunately SpringXD only supports PHD 1.1 at the moment same applies to Isilon) from here
– Download SpringXD M6 from here
– Create a Twitter account if you dont have one
– Sing-in to dev.twitter.com, create an application and generate Keys, take note of all the keys.

– Download WINSCP and transfer SpringXD to your Pivotal HD host from here

winscp2

Deploying Pivotal HD

This is a straight forward process when you just fire the VM up with Vmware workstation and it should properly work, make sure you assign enough Memory and CPU by default it will take 4GB of RAM and 1 CPU,1 Core, I would bring it up to 8GB and 2 CPU or at least 2 cores if possible.

btw, credentials are usually root/password and gpadmin/Gpadmin1

Deploying SpringXD

Once you have PHD 1.1 running, using root user move the springXD folder to /home/gpadmin then start the sequence of the following commands:
using root:

#chown gpadmin:hadoop /home/gpadmin/<path to springXD>
#chmod 770 /home/gpadmin/<path yo SpringXD>

using gpadmin:
#unzip <home/gpadmin/<path to sprinXD>

unzipSpringXD
#vi ~/.bashrc
once you are in the file using vi add the following row
export XD_HOME=<path to SpringXD>
export JAVA_HOME=<path to JDK directory>

springxd export

We now need to start our HDFS platform and SpringXD (using gpadmin):

#~/Desktop/start_all.

startup_all

#vi /<path-to-xd>/xd/config/servers.yml

uncomment all the lines from spring to fsUri

servers-yml-config
#~/<path to SpringXD>/xd/bin/start-singlenode –hadoopDistro phd1

starting_springxd

Twitter’s API

Twitter uses two APIs for communication, TwitterSearch and TwitterStream, you can read in depth on the differences here but in my case TwitterStream kept coming back with rate-limit issues due to the amount of tweets I was getting, but I will still show you how to create both of them but before that lets check that everything is ready for SpringXD.

Reformatting Tweets with Python

using gpadmin, #hadoop fs -ls / , at this point you should see something like this:
hadoop_ls

if you cant see it, then PHD was not started, so start it up first, after that lets work on a little script that I have actually used from the guys here, this script that you can download from here will allow us to reformat the tweets from json, however we will need to modify it a little bit to get the text of the tweets as the script below

import groovy.json.JsonSlurper

def slurper = new JsonSlurper()
def jsonPayload = slurper.parseText(payload)
def fromUser = jsonPayload?.fromUser
def hashTags = jsonPayload?.entities?.hashTags
def followers = jsonPayload?.user?.followersCount
def createdAt = jsonPayload?.createdAt
def languageCode = jsonPayload?.languageCode
def retweetCount = jsonPayload?.retweetCount
def retweet = jsonPayload?.retweet
def text1 = jsonPayload?.text
def id = jsonPayload?.id

def result = “”
if (hashTags == null || hashTags.size() == 0) {
result = result + jsonPayload.id + ‘\t’ + fromUser + ‘\t’ + createdAt + ‘\t’ + ‘-‘ + ‘\t’ + followers + ‘\t’ + text1 + ‘\t’ + retweetCount + ‘\t’ + retweet
} else {
hashTags.each { tag ->
if (result.size() > 0) {
result = result + “\n”
}
result = result + jsonPayload.id + ‘\t’ + fromUser + ‘\t’ + createdAt + ‘\t’ + tag.text.replace(‘\r’, ‘ ‘).replace(‘\n’, ‘ ‘).replace(‘\t’, ‘ ‘) + ‘\t’ + followers + ‘\t’ + text1 + ‘\t’ + retweetCount + ‘\t’ + retweet
}
}

return result

save this script under ~/<path to springXD>/xd/modules/processors/scripts/reformat.script

Creating SpringXD Streams

Now we are ready to create our first stream, lets start with TwitterSearch, its usually easier and get the job done straight forward, remember the API keys are masked in this example therefore make sure you replace them with the one you have, as for this example we will query a hashtag let it be #Australia, which means we will be able to get all tweets that contains the #Australia hashtag, make sure you set the delay for at least 10000 otherwise you will get an error of rate-limit very soon:

using gpadmin, run to ~/<path-to-springxd>/shell/bin/xd-shell

xd-shell> hadoop config fs –namenode local

springxd-hadoop-setup

xd-shell> stream create –name Emc–definition “twittersearch –consumerKey=xxxxxoxRYkIQZzDgMicxxxx –consumerSecret=bZvxxxxx51hld3UXObHlCy4rc6bmXJhEMH3TjGIndDxxxxxN –query=’#Australia –fixedDelay=10000 –outputType=application/json | transform –script=reformat.script | hdfs –rollover=1000” –deploy

Now lets create a stream using the TwitterStreamAPI in addition to the search one that we have just did:

xd-shell> stream create –name TestTwitterStream –definition “twitterstream –consumerKey=xxxxxoxRYkIQZzDgMicxxxx –consumerSecret=bZvxxxxx51hld3UXObHlCy4rc6bmXJhEMH3TjGIndDxxxxxN –accessToken=xxxxx71851-pf9IWlpWiKT9mgk84pR9K6GFOrZiyWxxxxxxxxx –accessTokenSecret=pgQ2Yxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx –track=’#Emc’ | hdfs –rollover=1000” –deploy

 

by the way, SpringXD is supported to run on windows, so you are welcome to run it from an external client as long as you configure the IP address or the dns name when you start xd-shell using the command:

 

hadoop config fs –namenode hdfs://<hdfs-IP-ADDRESS-OR-DNS>:8020

Validating the Data

Once we have all tweets required for the analysis, lets say you were gathering tweets for hours, days…etc on a specific subject (In my case I was tracking #Emcworld event in Las Vegas for 4 days)

lets see how my files are looking like from the shell by using the following command with gpadmin user:

#hadoop fs -ls /xd/<the folder name>

hadoop files

you can calculate the sizes using the following command

#hadoop fs -du sh /xd/

Connecting to HAWQ

Now, all data looking good, I need to create my tweets table using the information I have so I can do some awesome queries and fetch some results, I have used the same command used by some pivotal experts like here

CREATE EXTERNAL TABLE tweets(
id BIGINT, from_user VARCHAR(255), created_at TIMESTAMP, hash_tag VARCHAR(255),
followers INTEGER, tweet_text VARCHAR(255), retweet_count INTEGER, retweet BOOLEAN)
LOCATION (‘pxf://pivhdsne:50070/xd/<directory of your tweet texts>*.txt?Fragmenter=HdfsDataFragmenter&Accessor=TextFileAccessor&Resolver=TextResolver’)
FORMAT ‘TEXT’ (DELIMITER = E’\t’);

The output will show you a table that consists of all the HDFS files ingested into a the HAWQ table, from there you can start querying the table and get the information you are after, sometimes and depending on how big the files are you might need to clean up the files, mainly with another python script or manually using excel but this means you have to export the data merge them and reformat them with excel then back to the cluster once you are done, most of the clean up tasks will be around deleting null or empty values, you might wanna remove duplicates if any or re-sort the tweets.

Now, lets look on how Hawq is querying the data, lets for example see how many tweets do I have (to enter the psql prompt, jst type psql)

lets see what is the total tweets that we have received during EMCWORLD

 

count

 

Now lets see who was the top tweeters

no_of_tweets_per_user

I was wondering how many retweets we had vs original tweets

non-retweeted

 

retweeted

 

 

 

Now, I wanted to know how many mentions we had for most of EMC products, so I have created a table including all the products, and then created a view joining the tweets table the and product table to get the mentions per product code below:

create view products_tweets as select b.name product, count(a.*) product_count from tweets a,emc_products b where a.tweet_text like ‘%’||b.name||’%’ group by b.name order by count(a.*);

The result was fascinating
products_count

Exporting Data to Tableau

After we had all the fun with SQL commands and queries, I wanted to have a proper visualisation of the results, I had exported the files and the output of view table to text file and ingest it into Tableau, started playing with Tableau and that was the outcome.

If you are new to Tableau, you can go through some free tutorials and examples that will give you an idea how to start, usually its pretty simple unless you started to have more complex models, then a proper training is usually recommended.

 

below is one of the working sheets in Tableau showing the Top Hashtags being used without running a sql query.Tableau Results

Top Hashtags

Remember to comment or like my post if you enjoyed it, please feel free to ask any question, I am regularly updating this post to make sure there are not issues with downloading and I am refining the commands regularly to make it easier to deploy such analysis.

 

4 thoughts on “Twitter Analysis for #Emcworld 2014 using SpringXD, Pivotal, Hawq, Isilon and Tableau

  1. www.loocs.org

    Reaching this point you will have to be as careful as you were when choosing your
    lender. In personal finance, the earlier you start the better it is, and the simpler it
    is than when you start at later date. That way a lot of furniture can be bought up front alongside lesser financing.
    But here’s the good news, under the SBL program only a 25% guarantee is required.

    Reply
  2. Ali

    And the ability for both of you to exchange information easily
    online makes your being in-house less of a need.
    For example, in a nation like Britain, company tax return contains completed Corporation Tax Return (CT 600) and the
    yearly monetary documents and accounts, which support the tax estimation. Maintaining records, handling invoices, and reconciling bank transactions should be done
    by a professional accountant who is adept at online bookkeeping services.
    An example of a valuation account is the accumulated depreciation,
    which is classified as a contra-asset account and presented in the
    Balance Sheet report.

    Reply
  3. serviced apartments london art galleries

    Modern day London is a huge metropolis with everything on offer
    from history to cutting edge modernity as well as a vast selection of places to stay including the
    numerous grand plaza apartments London offers. Many people
    think that serviced apartments only cater to these two niche markets.
    Serviced apartments in the Thai capital offer incomparable
    home-style living in the heart of Bangkok. Whether you are with your family or alone, an apartment in London provides you the privacy and freedom you want in the city.

    Reply
  4. yadushare

    Thank you for the detailed information on this. I have a question though..I am new to Hadoop and wanted to know how are you ingesting real time data into the external table. Since the external table was created at a particular point of time,could you please let me know how can we ingest the real time data and keep the external table updated?

    Thanks,
    Yadu

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s