Tag Archives: twitter

Using Apache Nifi to Stream Live Twitter Feeds to Hadoop

nifi

Introduction

with data sources are producing more data over time,with big data evolvement and IoT, with the continuous new sources that enterprises are trying to capture, a mechanism of architecting, visualising the data flow, monitoring and watching the noise that would become signal and a direction for a decision next day, along with enterprise requirements for security like encryption, and data quality at read rather than post process that requires an extensive amount of time from resources.

Apache Nifi (Acquired recently by Hortonworks) comes along with  a web based data flow management and transformation tool, with unique features like configurable back pressure, configurable latency vs. throughput, that allows Nifi to tolerate fails in network, disks, software crashes or just human mistakes…

a full description and user guide can be found here

in this example, we will show how to build an easy twitter stream data flow that will collect tweets that mentions the word “hadoop” over time and push these tweets into json file in HDFS.

Prerequisites:

  • Hortonworks Sandbox
  • 8GB RAM memory and preferably 4 processor cores.
  • Twitter dev account with Oauth token, follow steps here to set it up if you done have one.

Installation:

once you have downloaded and started the Hortonworks Sandbox, you can proceed with ssh connectivity to the sandbox, once you are in you would need to download Apache Nifi using the following command:

cd /hadoop
wget http://apache.uberglobalmirror.com//nifi/0.2.1/nifi-0.2.1-bin.tar.gz
tar -xvzf nifi-0.2.1-bin.tar.gz

after Apache Nifi is now extracted, we need to change the http web port no. from 8080 to 8089 so it doesn’t conflict with Ambari, you can do this by editing the file under /hadoop/nifi-0.2.1/conf/nifi.properties

nifi.web.http.port=8089

Confirm installation is successful by starting Apache Nifi and logging to the webserver

/hadoop/nifi-0.2.1/bin/nifi.sh start

now point your browser to the IP address of the Sandbox followed by the port no. 8089, in my setup it looks like the following:

http://172.16.61.130:8089/nifi

Creating a Data Flow

In order to connect to Twitter, we would either create a whole data workflow from scratch, or use a twitter template that is already available under the templates from Apache Nifi here, from these templates we will download the pull_from_twitter_garden_hose.xml file and place it on your computer.

wget https://cwiki.apache.org/confluence/download/attachments/57904847/Pull_from_Twitter_Garden_Hose.xml?version=1&modificationDate=1433234009000&api=v2

once the template is downloaded we can go to the webserver and add the template by clicking the template button on the left side corner and browse for the downloaded template to add it.

Template Uploading

now Browse to the xml file downloaded in the previous step and upload it here.

Template Confirmationyou will see the template that you have installed as marked in red, once this step is completed, lets add the template to the workspace and start configuring the processors, you can do that by adding a template using the button on the right top corner as followingAdd TemplateChoose Template

once you add the template you will end up with something like this:

Twitter Templateyou can start discovering every processor, but in a nutshell this is what every processor is doing:

Grab Garden Hose: Connecting To twitter and downloading the tweets based on the search terms provided using the twitter stream api.

Pull Key Attributes: evaluates one or more expressions against the set values, in our case its SCREEN_NAME,TEXT,LANGAUGE,USERID and we want to make sure the value is not NULL for those otherwise it will not mean much to us, the criteria is set to on “Matched” which means only Matched criteria will be passed to the next processor.

Find only Tweets: filters tweets that has no body message, retweets..etc

copy of Tweets: this is an output port that copy these into the Apache Nifi folder under /$NIFI_HOME/content_repository

now, lets create a new processor that copies the data to HDFS, but before we do that lets create the deistination folder on Hadoop by using the following command

hadoop fs -mkdir -p /nifi/tweets

now lets add the processor by clicking on the processor button on the top right corner

Add Processor

Now, lets choose the PutHDFS Processor, you can easily search for it in the search bar on the top.

putHDFS Processor

connect the HDFS Processor to the Find Only Tweet Processor and choose “tweet” as a relationship


Screenshot 2015-09-02 10.56.28
Screenshot 2015-09-02 10.19.33
now right click on the putHDFS processor and configure, you need to choose how the processor will terminate the flow, and  since this is the last processor in the flow and it wont pass any data beyond, tick all the auto-termination boxes.

Screenshot 2015-09-02 10.59.49

and go to the properties tab and add the hdfs-site.xml and core-site.xml locations, usually they are under /etc/hadoop/2.3.0.0-2557/0/core-site.xml if you are downloading the latest 2.3 sandbox/ , also dont forget to add the folder that we have created earlier on HDFS.

Screenshot 2015-09-02 11.00.51

hopefully after this you should get a red stop button instead of a warning logo on the processor box, if not check what the error is by keeping the cursor on the warning logo.

lets configure the twitter processor and add the consumer key and the access token information, notice that the consumer secret and the token secret are encrypted and wont be visible, also make sure you change the twitter end point to “Filter Endpoint” otherwise your search terms wouldn’t be active.

Screenshot 2015-09-02 10.12.52

Screenshot 2015-09-03 08.15.40

as you can see in the search terms we have added the word “hadoop”, once you are done, verify with the warning logo disappearing from the box.

now we are ready to start the flow, just simply right-click on each box and start, you should start seeing numbers and back pressure information, after a while you can stop it and verify that json files are now stored on HDFS.

Screenshot 2015-09-02 10.30.09

nifi-output

you can always add more processors to encrypt, compress or transform the files to Avro or SEQ files for example prior to dropping them in HDFS, a complete list of the available processors can be found here