with data sources are producing more data over time,with big data evolvement and IoT, with the continuous new sources that enterprises are trying to capture, a mechanism of architecting, visualising the data flow, monitoring and watching the noise that would become signal and a direction for a decision next day, along with enterprise requirements for security like encryption, and data quality at read rather than post process that requires an extensive amount of time from resources.
Apache Nifi (Acquired recently by Hortonworks) comes along with a web based data flow management and transformation tool, with unique features like configurable back pressure, configurable latency vs. throughput, that allows Nifi to tolerate fails in network, disks, software crashes or just human mistakes…
a full description and user guide can be found here
in this example, we will show how to build an easy twitter stream data flow that will collect tweets that mentions the word “hadoop” over time and push these tweets into json file in HDFS.
- Hortonworks Sandbox
- 8GB RAM memory and preferably 4 processor cores.
- Twitter dev account with Oauth token, follow steps here to set it up if you done have one.
once you have downloaded and started the Hortonworks Sandbox, you can proceed with ssh connectivity to the sandbox, once you are in you would need to download Apache Nifi using the following command:
cd /hadoop wget http://apache.uberglobalmirror.com//nifi/0.2.1/nifi-0.2.1-bin.tar.gz tar -xvzf nifi-0.2.1-bin.tar.gz
after Apache Nifi is now extracted, we need to change the http web port no. from 8080 to 8089 so it doesn’t conflict with Ambari, you can do this by editing the file under /hadoop/nifi-0.2.1/conf/nifi.properties
Confirm installation is successful by starting Apache Nifi and logging to the webserver
now point your browser to the IP address of the Sandbox followed by the port no. 8089, in my setup it looks like the following:
Creating a Data Flow
In order to connect to Twitter, we would either create a whole data workflow from scratch, or use a twitter template that is already available under the templates from Apache Nifi here, from these templates we will download the pull_from_twitter_garden_hose.xml file and place it on your computer.
once the template is downloaded we can go to the webserver and add the template by clicking the template button on the left side corner and browse for the downloaded template to add it.
now Browse to the xml file downloaded in the previous step and upload it here.
you will see the template that you have installed as marked in red, once this step is completed, lets add the template to the workspace and start configuring the processors, you can do that by adding a template using the button on the right top corner as following
once you add the template you will end up with something like this:
Grab Garden Hose: Connecting To twitter and downloading the tweets based on the search terms provided using the twitter stream api.
Pull Key Attributes: evaluates one or more expressions against the set values, in our case its SCREEN_NAME,TEXT,LANGAUGE,USERID and we want to make sure the value is not NULL for those otherwise it will not mean much to us, the criteria is set to on “Matched” which means only Matched criteria will be passed to the next processor.
Find only Tweets: filters tweets that has no body message, retweets..etc
copy of Tweets: this is an output port that copy these into the Apache Nifi folder under /$NIFI_HOME/content_repository
now, lets create a new processor that copies the data to HDFS, but before we do that lets create the deistination folder on Hadoop by using the following command
hadoop fs -mkdir -p /nifi/tweets
now lets add the processor by clicking on the processor button on the top right corner
Now, lets choose the PutHDFS Processor, you can easily search for it in the search bar on the top.
connect the HDFS Processor to the Find Only Tweet Processor and choose “tweet” as a relationship
now right click on the putHDFS processor and configure, you need to choose how the processor will terminate the flow, and since this is the last processor in the flow and it wont pass any data beyond, tick all the auto-termination boxes.
and go to the properties tab and add the hdfs-site.xml and core-site.xml locations, usually they are under /etc/hadoop/18.104.22.168-2557/0/core-site.xml if you are downloading the latest 2.3 sandbox/ , also dont forget to add the folder that we have created earlier on HDFS.
hopefully after this you should get a red stop button instead of a warning logo on the processor box, if not check what the error is by keeping the cursor on the warning logo.
lets configure the twitter processor and add the consumer key and the access token information, notice that the consumer secret and the token secret are encrypted and wont be visible, also make sure you change the twitter end point to “Filter Endpoint” otherwise your search terms wouldn’t be active.
as you can see in the search terms we have added the word “hadoop”, once you are done, verify with the warning logo disappearing from the box.
now we are ready to start the flow, just simply right-click on each box and start, you should start seeing numbers and back pressure information, after a while you can stop it and verify that json files are now stored on HDFS.
you can always add more processors to encrypt, compress or transform the files to Avro or SEQ files for example prior to dropping them in HDFS, a complete list of the available processors can be found here