Tag Archives: ned shawa

Using Apache Nifi to Stream Live Twitter Feeds to Hadoop



with data sources are producing more data over time,with big data evolvement and IoT, with the continuous new sources that enterprises are trying to capture, a mechanism of architecting, visualising the data flow, monitoring and watching the noise that would become signal and a direction for a decision next day, along with enterprise requirements for security like encryption, and data quality at read rather than post process that requires an extensive amount of time from resources.

Apache Nifi (Acquired recently by Hortonworks) comes along with  a web based data flow management and transformation tool, with unique features like configurable back pressure, configurable latency vs. throughput, that allows Nifi to tolerate fails in network, disks, software crashes or just human mistakes…

a full description and user guide can be found here

in this example, we will show how to build an easy twitter stream data flow that will collect tweets that mentions the word “hadoop” over time and push these tweets into json file in HDFS.


  • Hortonworks Sandbox
  • 8GB RAM memory and preferably 4 processor cores.
  • Twitter dev account with Oauth token, follow steps here to set it up if you done have one.


once you have downloaded and started the Hortonworks Sandbox, you can proceed with ssh connectivity to the sandbox, once you are in you would need to download Apache Nifi using the following command:

cd /hadoop
wget http://apache.uberglobalmirror.com//nifi/0.2.1/nifi-0.2.1-bin.tar.gz
tar -xvzf nifi-0.2.1-bin.tar.gz

after Apache Nifi is now extracted, we need to change the http web port no. from 8080 to 8089 so it doesn’t conflict with Ambari, you can do this by editing the file under /hadoop/nifi-0.2.1/conf/nifi.properties


Confirm installation is successful by starting Apache Nifi and logging to the webserver

/hadoop/nifi-0.2.1/bin/nifi.sh start

now point your browser to the IP address of the Sandbox followed by the port no. 8089, in my setup it looks like the following:

Creating a Data Flow

In order to connect to Twitter, we would either create a whole data workflow from scratch, or use a twitter template that is already available under the templates from Apache Nifi here, from these templates we will download the pull_from_twitter_garden_hose.xml file and place it on your computer.

wget https://cwiki.apache.org/confluence/download/attachments/57904847/Pull_from_Twitter_Garden_Hose.xml?version=1&modificationDate=1433234009000&api=v2

once the template is downloaded we can go to the webserver and add the template by clicking the template button on the left side corner and browse for the downloaded template to add it.

Template Uploading

now Browse to the xml file downloaded in the previous step and upload it here.

Template Confirmationyou will see the template that you have installed as marked in red, once this step is completed, lets add the template to the workspace and start configuring the processors, you can do that by adding a template using the button on the right top corner as followingAdd TemplateChoose Template

once you add the template you will end up with something like this:

Twitter Templateyou can start discovering every processor, but in a nutshell this is what every processor is doing:

Grab Garden Hose: Connecting To twitter and downloading the tweets based on the search terms provided using the twitter stream api.

Pull Key Attributes: evaluates one or more expressions against the set values, in our case its SCREEN_NAME,TEXT,LANGAUGE,USERID and we want to make sure the value is not NULL for those otherwise it will not mean much to us, the criteria is set to on “Matched” which means only Matched criteria will be passed to the next processor.

Find only Tweets: filters tweets that has no body message, retweets..etc

copy of Tweets: this is an output port that copy these into the Apache Nifi folder under /$NIFI_HOME/content_repository

now, lets create a new processor that copies the data to HDFS, but before we do that lets create the deistination folder on Hadoop by using the following command

hadoop fs -mkdir -p /nifi/tweets

now lets add the processor by clicking on the processor button on the top right corner

Add Processor

Now, lets choose the PutHDFS Processor, you can easily search for it in the search bar on the top.

putHDFS Processor

connect the HDFS Processor to the Find Only Tweet Processor and choose “tweet” as a relationship

Screenshot 2015-09-02 10.56.28
Screenshot 2015-09-02 10.19.33
now right click on the putHDFS processor and configure, you need to choose how the processor will terminate the flow, and  since this is the last processor in the flow and it wont pass any data beyond, tick all the auto-termination boxes.

Screenshot 2015-09-02 10.59.49

and go to the properties tab and add the hdfs-site.xml and core-site.xml locations, usually they are under /etc/hadoop/ if you are downloading the latest 2.3 sandbox/ , also dont forget to add the folder that we have created earlier on HDFS.

Screenshot 2015-09-02 11.00.51

hopefully after this you should get a red stop button instead of a warning logo on the processor box, if not check what the error is by keeping the cursor on the warning logo.

lets configure the twitter processor and add the consumer key and the access token information, notice that the consumer secret and the token secret are encrypted and wont be visible, also make sure you change the twitter end point to “Filter Endpoint” otherwise your search terms wouldn’t be active.

Screenshot 2015-09-02 10.12.52

Screenshot 2015-09-03 08.15.40

as you can see in the search terms we have added the word “hadoop”, once you are done, verify with the warning logo disappearing from the box.

now we are ready to start the flow, just simply right-click on each box and start, you should start seeing numbers and back pressure information, after a while you can stop it and verify that json files are now stored on HDFS.

Screenshot 2015-09-02 10.30.09


you can always add more processors to encrypt, compress or transform the files to Avro or SEQ files for example prior to dropping them in HDFS, a complete list of the available processors can be found here

PACS Systems and Scale out NAS

Today I had an interesting meeting with an intelligent engineer from a major PACS provider discussing how the technology is evolving and how a true scale-out NAS can add value and reduce the amount of work needed and providing a better use experience when it comes to PACS.

I have personally worked around HIS,RIS and Imaging systems for a long time, only knowing the basics, I was responsible of the designing and implementing the infrastructure for over 20 Hospitals that will eventually host HIS,LAB,RIS,PACS…etc .

our discussion today was around  the challenges, how it integrates with RIS…etc etc I started the conversation with “How can a storage help?”, the first answer he came up with was “By avoiding PACS migrations!!” Then we started talking about the pains in PACS migrations, and what are the problems everyone faces, before I get to migrations, let me take you through PACS and the components around it:


PACS stands for Picture Archiving and Communication System, from the name you can tell we are manging pictures mainly, the source of these pictures can be a variety of imaging machines like CT Scan, MRI, X-Ray..etc, PACS organizes the pictures into a specific order and connects them to RIS via certain keys (Patient No,, Visit,..etc), although you can run PACS as a standalone system for small practices where RIS is not necessary.

PACS uses DICOM as a standard for transmitting, receiving and storing the images. PACS usually is a application/database software runs a normal transactional database (e.g. Oracle/SQL..etc)  most of the times PACS runs a standalone database to store DICOM images headers.

PACS has many vendors in the market such as GE, AGFA, FUJI, KODAK, MCKESSON, ACUO…etc Many vendors are now moving to virtualizing PACS by providing it in a Virtual Appliance, to easy deployment and reduce costs, however the integration part with RIS/HIS is still manual due to the different HIS/RIS suppliers out there and different communication methods.


RIS (Radiology Information System) usually takes care of scheduling machines availability and appointments, results and findings store, billing (specially if its a stand alone RIS/PACS with no HIS), and further features can be derived from the HIS (Hospital Information System) if available and integrated with RIS. Integration with RIS is through another standard called HL7.


Digital Imaging and Communication in Medicine, which is mainly a standard to handle all different operations from transmitting,recieving,storing images with its own file format, its a standard developed by members of NEMA (National Electrical Manufacturers Association) and it currently holds the copyrights for DICOM.

How does it work?

Once a Modality ( Medical term referring to imaging devices) operator starts the imaging procedure, The modality will start transmitting the images according to the PACS system, the PACS system responds with error messages if any failure occurred during the transmission phase. once all the images are received in the PACS, it starts creating headers (Metadata) for the images, the header will usually be derived from RIS/HIS or manually if RIS/HIS doesn’t exist, the header is written according to DICOM format, and usually contain TAGS, which are usually numbers that translates into a standard key attributes (e.g. 008,0070 refers to Modailty Manufacturer) same for patient ID, Visit and over another 100 attribute (mostly fetched from HIS/RIS), eventually all the attributes are stored on the PACS database.


So why do we need to migrate our PACS data?

There are many reason to migrate PACS like vendor change, VNA, Archiving…etc which will all take us back to the same basic length element in migrations:

Storage LimitationMost of the non-scale out storage are usually a victim, once storage maximum capacity is reached or its time for a new storage or you are migrating archival PACS to a separate storage, or the new vendor suggested another storage to work with or changing protocols, eventually , you have to take the old storage out and get the new one, but guess what!! you have to migrate all data to the new storage first 🙂

PACS Migration

imagine an MRI imaging device producing 400 image approximately per patient, each image is around 4MB, its booked for 1 hour per visit, and you can have 8 patients a day, lets say the imaging center is moving to a new storage after 3 years, so according to simple math we have to worry about around 10TB of data, Now start add CT scanning (around 100 images), and X-RAY and multiply by the no. of the machines. its not rare to see Petabytes at some hospitals, depending on how long the images are retained and the compliance regulations around retention the data stored can get really massive.

Migrating data takes two approaches, the first is the easy one which is mainly images with no headers, usually this is a quick and easy it depends on how big is your data and how fast you can transmit them (what media you are using). The other approach is the dirty one, which uses the vendor migration tools and can takes several months to years, especially when moving through vendors, each image header is inspected, database is queried, new database record in the new system is inserted, an complete pre-assessment engagement usually takes  a place first to check the integrity of the files and headers and after cleanup and assessment the migration can start in phases, with a complete assessment after each phase is completed with logs revision…etc

Common Issues with PACS Migration

One of the most problems that affects data integrity is the header (Metadata) of the images, lets say the patient ID was 1234, and there was no HIS/RIS integration and the operator manually added the wrong patient ID at the time of the scan 1235, once the images are transmitted to PACS. The header of the image will have 1234 as a Patient ID as well as the PACS , based on DICOM, some tags can never change, therefore even if you update the PACS with the new patient ID only a  reference with the new patient will be written in the PACS reference the wrong patient ID in the header so 1234 in PACS and 1235 in the header, but PACS creates a link. such problems are fixed during the pre-assessment stage and therefore it may be time consuming and costly.

Choosing the right storage

Scale-out NAS storage over traditional storage will eliminate the need for storage migrations due to the following reasons:

– Can expand on the fly with no interruption

– Can be upgraded on the fly with no interruption

– Can have slower nodes with cheaper disk for archiving on the fly using file policies to move the images to slower disk without affecting the path for the files (so database can still reference the images in the same path)

Building Smart Home Labs

of course there will be many people out there saying why do we need a home lab? and we can leverage virtualization to build a lab in our laptop, you know what you are completely right, but how far can you go?

Now usually home labs are not for everyone, first you need to be passionate about technology, trying new stuff, making stuff work, testing, testing,testing until you achieve what you are trying to do. at some point I was able to demonstrate a full NAS home directories and shares with WIN/MAC integrated over active directory with a virtual VNX storage and all from my lab through a VPN, the customer thought it was our EMC lab in Germany 🙂 I was impressed.

Below is one of the impressive home lab from another blog


Building your first home lab can look deceiving at the beginning and you will think its a piece of cake, and yes you might be an SE who did many architectures for banks and governments, but believe me, a smart home lab is nothing you thought about, and wait  until your problems start to arise and threats your investments in the equipment you already bought.

Identifying the purpose

You want to experiment latest IT technologies? Or maybe you are after a specific experience with a specific software? Network? Virtualization? Storage? Or even gaming? Of course each purpose has a completely different shopping list, design and putting stuff together eventually.


As a rule of thumb, make sure you look at second-hand prices on eBay, if there is a %30 difference than buying new, then eBay, unless you are in a good financial position with a comfy budget to build your lab, usually a home labs will cost you anything between $300 – $20,000, start identifying what are you after.

In general, you have to watch for the following each time you plan to buy a new toy for your lab:

  • Power Consumption (Watts)
  • Heat in BTU/hr
  • Noise in dbs

Power Consumption

Usually this is one of the most constant running expense that you will be facing after the lab is up and running and based on the power consumption you will start deciding how long the lab should be powered on for, with the right components you may be able to keep your lab up and running most of the time with minimal effect on your monthly electricity bill.

Most of the recent Intel CPUs are anywhere between %80-%95 power efficient, meaning the power will be withdrawn upon usage of the cores rather than on powering up the CPU, so in an idle state you are looking at a great power efficiency when it comes to your CPUs.

As an example of what wattage you should be looking for, a micro server with a Centrino CPU consumes around 35W, an single CPU workstation is around 135W, a dual CPU workstation is around 160W, a 4 CPU Enterprise class ever can go up to 700W


Unless you have a garage space or a sound isolated room, Noise is one of the important factors for you and your house mates, running noisy labs can lead to serious relationship problems! Most of the enterprise class servers can be noisy, this is caused by the powerful redundant fans in the server, try to avoid enterprise class since enterprise features are not a real concern in home labs, but of course this is not the case if you can spare a couple of meters in your garage for the lab.


One of the common problems people run into after setting home labs is the amount of heat it produce, make sure to watch out for BTU/hr to avoid over heating the room and raising the temperature.

Servers Vs Workstations

The first thing you might get to think about is whether you buy a server or a workstation, this is going to be your decision as there are some good and bad about each but due to the diversity of products and manufacturers it’s really difficult to have a list of the good and bad, I personally tend to like workstations since they can be more roomy in IO slots (PCI,Disks..etc) but on the other hand sometimes you end up cabling the disks and PCI cards in a mickey mouse way just to get stuff working (e.g. lack of disk power cables,SAS cables, extensions, fitting..etc)

Servers and Workstations most of the time carry the same CPUs, the motherboards are usually different allowing more memory to be held in servers while in a 3 years old workstation the maximum average you might get is around 32GB due to the smaller motherboard and DIMMs slots limitation.

An average dual CPU workstation shouldn’t cost you more than $800 with at least 16GB of Memory, in home labs memory is more important than CPU if you are after a multipurpose lab unless you will be running many parallel compute farms (e.g. VMs) then more CPU is defiantly required.

Servers on the other hand is a great option if noise is not an issue, they are usually scalable, they can host more memory DIMMs and CPUs, upgradable and easy to work with they might be a bit more power-hungry than other workstations in some cases.

 but again, since we are talking smart home labs here then WorkStations is the way to go for me.

The Shopping List


  • Management PC

This can usually be your desktop at home or your personal laptop, I still prefer having something fixed at home but no harm using a laptop for that.

No need for powerful PCs or workstations here, minimal to run an operating system, 4GB of memory should be sufficient.

  • Server/Workstation

Here where the grunt should be, try to invest in this since it will last with you for years, compare the prices, my preference is usually workstations due to the lower power consumption, noise and heat, make sure you compare CPUs, more cores is better, more memory is more important, small things to look for would be no. of PCIe slots, the more the better, no. of supported disks (Caddies) the more the better, check the prices of after market memory since you might need to expand later, check parts availability on eBay, and finally bargain,bargain,bargain, all eBay sellers out there are negotiable, make sure you talk to many sellers before you make the purchase and make sure the seller gives you at least 30 days warranty.

I can’t tell you which CPU and how many,I would prefer a 2 sockets capable machine so I can expand if I needed later but memory should not be less than 8GB at least, 12 GB is preferable but again depends on what are you planning to run on it.

  • Storage (I will elaborate a bit here, I am a NAS SE at the end of the day 🙂 )

Here where it becomes a bit confusing, the options are endless, but as a summary you can have your storage as follows:

DAS, you can use the internal disks of your workstations, make sure you have different speeds and capacity, e.g. SSD and SATA mixed, check your options with SATA 3 vs. SATA 2, this will depends whether you want to get SATA3 support or not, disks are usually cheaper for SATA2, but the difference is massive when it comes to SSDs, you can always buy $30 PCIe SATA3 cards from eBay, so dont worry much about your workstation native SATA3 support.

USB 3.0 cant be neglected, speeds are up to 170Mbs USB 3.1 specs approved for 10Gbs which will make a big difference in the future, it will get messier with USB due to the scalability and the mesh you will end up with when you expand, plus you need a USB3 PCIe card, again $15.

SAN, if you have the budget, nothing more appealing than having the speed and latency of the FiberChannel, I rarely see home labs with fiber due to the price and equipment, cause you usually need a storage controller, usually they are enterprise, and you wont be able to have a smart home lab with an enterprise storage, but since 8Gb/s and 16Gb/s is what the business is using now, you can grab real bargains on 4Gb/s FC these days.

NAS, usually this is the way to go, its scalable, you can connect your home stuff to it as well, its easy to deploy and migrate later on and finally its cheap, make sure it supports SATA3 interfaces and 1GB interfaces as well, make sure it supports jumbo frame if you care about jumbo frames.

IOMEGA,Lenovo,…etc are selling ready NAS solutions that can scale from a single disk up over 48 disks, built in OS is usually used to provide full NAS capabilities e.g. NFS,CIFS and some SAN capability as ISCSI.


  • Low cost, as a single disk NAS can start from $150
  • Built in RAID capabilities for selected models
  • Leverage Network architecture
  • Some models supports Jumbo Frames**
  •   Low power consumption


  • Gets more costly once advance capabilities are needed(e.g. Jumbo Frames, 10GbE,more disk drives support,snapshots,RAID)
  •  Locked into certain OS, each manufacturer supports their OS on the NAS box, so no room for manual intervention, CLI, advanced and customized features (e.g. pools of storage,tiering..etc)
  •  Selected type of disks supported on different models
  •  Mostly SW raid is used which affect the overall performance.

MYO (Make your own)

Many tech savvy have started to build their own NAS box with their flavor of the OS on top of it to provide NAS features (e.g. FREE NAS,ZFS,Linux,Windows..etc),


  •  The most cost-effective option
  •  No locking into a certain NAS OS with certain capabilities
  • You can choose what features you need.
  •  Your choice of OS.
  •  Your choice of disks


  • You might need to buy a HW raid card if you want protection with low impact on the performance
  • Might get a bit technical, and required advance knowledge in different operating systems
  • Assembling all the parts together might be time-consuming, especially if you are getting used equipment.
  • Network switch

Switch is important, and no you can’t skip it, there are many on eBay starting from $10 used, get at least 4 ports, 8 or higher is recommended, stay away from enterprise switches even if they were cheap, they are usually power consuming and noisy, jumbo frame supported switch is a great advantage if you will care about changing network frame sizes in the future for testing purposes, managed switch is great, VLANs are a nice to have, 1GbE support is a must, and make sure you buy your cables also, local shops are usually expensive when it comes to cables so buy online as well.

  • your choice of disks (min. 2)

What ever you buy for your storage, make sure you get it disk less unless it was a bargain, choose your own disks, MSI in Australia is a great supplier with very competitive pricing over online and you get warranty through them as well, try to avoid used disks due to the wear, SSD has limited writes, and spindle disks rotates which affect the spindle arm and there is MTBF for every disk. when you buy SATA disks, make sure its 7.2 RPM with cache, it’s usually the same price as well, I usually prefer barracuda or seagate, support for SATA3 doesn’t matter for SATA disks since you will never be able to saturate SATA2 interface with 7.2 spindle disks anyway, but for SSD it makes a big difference, Samsung is my favorite SSD, but you can buy any brand.

  • Software Licenses

Can you imagine that most of the cost sits here,  unless planned well you will end up paying top $$$ just for licenses, Lucky for me I work for EMC and I get VMware licenses for free, but if you don’t work for EMC or VMware, try to get workstations with a licenses for windows, if you are planning to use Linux its much better, Ubunto and Mint are my favorite, VMware will still provide you 60 days licenses for any product.


  • RAID card

whether its for your workstation if you are using DAS or for your storage if you are MYO, raid cards are a good way of having proper HW level raid with lower penalty on performance, having cache on raid cards is good but usually costly, keep your eye on LSI MegaRaid cards on eBay, grab a used one for a bargain, you might need to re-image or upgrade the firmware which is  not easy but it does worth it.

  • Multiple Monitors

it might be a bit fancy or luxurious, but it’s always handy to have more than one display, displays are cheap, you can get a used one for $50, but you must have a display card that supports it, NVS series from NVIDIA are usually cheap and handy, many of them are on eBay, for higher video demand for gaming if you care would be more costly but again depends on how you use your home lab.

Finally, posted below my home lab design and a picture of how it looks like, I am still looking at buying a half-size cabinet to keep all my workstations stacked in it, and I keep growing it every month, so good luck with yours.