Category Archives: General

Apache Zeppelin Walk Through with Hortonworks

Screenshot 2015-10-27 11.54.33

Introduction

Apache Zeppelin (Incubator at the time of writing this post) is one of my favourite tools that I try to position and present to anyone interested in Analytics, Its 100% open source with an intelligent international team behind it in Korea (NFLABS)(Moving to San Francisco soon),  its mainly based on interpreter concept that allows any language/data-processing-backend to be plugged into Apache Zeppelin.

Very similar to IPython/Jupyter except that the UI is probably more appealing and the amount of interpreters supported are richer, at the time of writing this Blog Zeppelin supported:

with this rich set of interpreters provided, it makes on boarding platforms like Apache Hadoop or Data Lake concepts much easier where data is sitting and consolidated somewhere and different organizational units with different skill sets needs to access the data and perform their day to day duties on it as data discovery, queries, data modelling, data streaming and finally Data Science using Apache Spark.

Apache Zeppelin Overview

With the notebook style editor and the ability to save notebooks on the fly, you can end up with some really cool notebooks, whether you are a data engineer, data scientist or a BI specialist.

Zeppelin Notebook Example

Dataset showing the Health Expenditure of the Australian Government over time by state.

Zeppelin also got a basic clean visualization views integrated with it, it also gives you control over what do you want to include in your graph by dragging and dropping fields in your visualization as below:

Zeppelin Drag and Drop
The sum of government budget healthcare expenditure in Australia by State

Also when you are done with your awesome notebook story, you can easily create a report out of it and either print it or send it out.

Car Accidents Fatalities in Melbourne

Car Accident Fatalities related to Alcohol driving , showing the most fatal days on the streets and the most fatal car accident types during Alcohol times

Playing with Zeppelin

If you have never played with Zeppelin before then visit this link for a quick way  to start working it out using the latest Hortonworks tutorial we are including Zeppelin as part of HDP as a technical preview, which may supporting it officially may follow, check it out  Here try out the different interpreters and how it interacts with Hadoop.

Zeppelin Hub

I was recently given access to the beta version of Hub, Hub is supposed to make life in organizations easier when it comes to sharing notebooks between different departments or pepole within the organization.

Lets assume an Organization got Marketing, BI and Data Science practices, the three departments overlaps with each other when it comes to the datasets being used, therfore there is no need anymore for each department to work completely isolated from the others, as they can share their experience together, brag about their notebooks, work together on the same notebook when trying to work on either complicated notebook or different skills are required.

Zeppelin Hub UI

Zeppelin Hub UI

Lets have a deeper look at Hub…

Hub Instances

Instance is backed by a Zeppelin installation somewhere (server,laptop,hadoop..etc), every time you create a new Instance a new Token is generated, this token should be added in your local Zeppelin installation under folder /incubator_zeppelin/conf/zeppelin-env.sh e.g.

export ZEPPELINHUB_API_TOKEN="f41d1a2b-98f8-XXXX-2575b9b189"

Once the token is added, you will be able to see the notebooks online whenever you connect to Hub (http://zeppelin.hub.com).

Hub Spaces

once an instance is added, you will be able to see all the notebook for each instance, and since every space is actually either a dept. or a category of notebooks that needs to be shared across certain people, you can easily drag and drop notebooks into spaces making them shared across this specific space.

Adding a Notebook to a Space

Adding a Notebook to a Space

Showing a Notebook inside Zeppelin Hub

Showing a Notebook inside Zeppelin Hub

Very cool !

Since its beta, there is still much of work to be done like executing notebooks from Hub directly, resizing and formatting and some other minor issues, I am sure the All Stars team @nflabs will make it happen very soon as they always did.

if you are interested in playing with Beta, you may request access on Apache Zeppelin website here

Hortonworks and Apache Zeppelin

Hortonworks is heavily adopting Apache Zeppelin, that showed in the contribution they have made into the product and into Apache Ambari, Ali Bajwa one of Rockstars at Hortonworks created an Apache Zeppelin View on Ambari, which gives Zeppelin authentication and allows users to have a single pane of glass when it comes to uploading datasets using HDFS view on Apache Ambari Views and other operational needs.

Apache Ambari with Zeppelin View Integration

Apache Ambari with Zeppelin View Integration

Screenshot 2015-10-27 11.25.15

Apache Zeppelin Notebook editor from Apache Ambari

If you want to integrate Zeppelin in Ambari with Apache Spark as well, just easily follow the steps  on this link

Helium

Project Helium is a revolutionary change in Zeppelin, Helium allows you to integrate almost any standard html, css, javascript as a visualization or a view inside Zeppelin.

Helium Application would consists of an View, Algortihm and an Access to the resource, you can get more information of Helium here

Using Apache Nifi to Stream Live Twitter Feeds to Hadoop

nifi

Introduction

with data sources are producing more data over time,with big data evolvement and IoT, with the continuous new sources that enterprises are trying to capture, a mechanism of architecting, visualising the data flow, monitoring and watching the noise that would become signal and a direction for a decision next day, along with enterprise requirements for security like encryption, and data quality at read rather than post process that requires an extensive amount of time from resources.

Apache Nifi (Acquired recently by Hortonworks) comes along with  a web based data flow management and transformation tool, with unique features like configurable back pressure, configurable latency vs. throughput, that allows Nifi to tolerate fails in network, disks, software crashes or just human mistakes…

a full description and user guide can be found here

in this example, we will show how to build an easy twitter stream data flow that will collect tweets that mentions the word “hadoop” over time and push these tweets into json file in HDFS.

Prerequisites:

  • Hortonworks Sandbox
  • 8GB RAM memory and preferably 4 processor cores.
  • Twitter dev account with Oauth token, follow steps here to set it up if you done have one.

Installation:

once you have downloaded and started the Hortonworks Sandbox, you can proceed with ssh connectivity to the sandbox, once you are in you would need to download Apache Nifi using the following command:

cd /hadoop
wget http://apache.uberglobalmirror.com//nifi/0.2.1/nifi-0.2.1-bin.tar.gz
tar -xvzf nifi-0.2.1-bin.tar.gz

after Apache Nifi is now extracted, we need to change the http web port no. from 8080 to 8089 so it doesn’t conflict with Ambari, you can do this by editing the file under /hadoop/nifi-0.2.1/conf/nifi.properties

nifi.web.http.port=8089

Confirm installation is successful by starting Apache Nifi and logging to the webserver

/hadoop/nifi-0.2.1/bin/nifi.sh start

now point your browser to the IP address of the Sandbox followed by the port no. 8089, in my setup it looks like the following:

http://172.16.61.130:8089/nifi

Creating a Data Flow

In order to connect to Twitter, we would either create a whole data workflow from scratch, or use a twitter template that is already available under the templates from Apache Nifi here, from these templates we will download the pull_from_twitter_garden_hose.xml file and place it on your computer.

wget https://cwiki.apache.org/confluence/download/attachments/57904847/Pull_from_Twitter_Garden_Hose.xml?version=1&modificationDate=1433234009000&api=v2

once the template is downloaded we can go to the webserver and add the template by clicking the template button on the left side corner and browse for the downloaded template to add it.

Template Uploading

now Browse to the xml file downloaded in the previous step and upload it here.

Template Confirmationyou will see the template that you have installed as marked in red, once this step is completed, lets add the template to the workspace and start configuring the processors, you can do that by adding a template using the button on the right top corner as followingAdd TemplateChoose Template

once you add the template you will end up with something like this:

Twitter Templateyou can start discovering every processor, but in a nutshell this is what every processor is doing:

Grab Garden Hose: Connecting To twitter and downloading the tweets based on the search terms provided using the twitter stream api.

Pull Key Attributes: evaluates one or more expressions against the set values, in our case its SCREEN_NAME,TEXT,LANGAUGE,USERID and we want to make sure the value is not NULL for those otherwise it will not mean much to us, the criteria is set to on “Matched” which means only Matched criteria will be passed to the next processor.

Find only Tweets: filters tweets that has no body message, retweets..etc

copy of Tweets: this is an output port that copy these into the Apache Nifi folder under /$NIFI_HOME/content_repository

now, lets create a new processor that copies the data to HDFS, but before we do that lets create the deistination folder on Hadoop by using the following command

hadoop fs -mkdir -p /nifi/tweets

now lets add the processor by clicking on the processor button on the top right corner

Add Processor

Now, lets choose the PutHDFS Processor, you can easily search for it in the search bar on the top.

putHDFS Processor

connect the HDFS Processor to the Find Only Tweet Processor and choose “tweet” as a relationship


Screenshot 2015-09-02 10.56.28
Screenshot 2015-09-02 10.19.33
now right click on the putHDFS processor and configure, you need to choose how the processor will terminate the flow, and  since this is the last processor in the flow and it wont pass any data beyond, tick all the auto-termination boxes.

Screenshot 2015-09-02 10.59.49

and go to the properties tab and add the hdfs-site.xml and core-site.xml locations, usually they are under /etc/hadoop/2.3.0.0-2557/0/core-site.xml if you are downloading the latest 2.3 sandbox/ , also dont forget to add the folder that we have created earlier on HDFS.

Screenshot 2015-09-02 11.00.51

hopefully after this you should get a red stop button instead of a warning logo on the processor box, if not check what the error is by keeping the cursor on the warning logo.

lets configure the twitter processor and add the consumer key and the access token information, notice that the consumer secret and the token secret are encrypted and wont be visible, also make sure you change the twitter end point to “Filter Endpoint” otherwise your search terms wouldn’t be active.

Screenshot 2015-09-02 10.12.52

Screenshot 2015-09-03 08.15.40

as you can see in the search terms we have added the word “hadoop”, once you are done, verify with the warning logo disappearing from the box.

now we are ready to start the flow, just simply right-click on each box and start, you should start seeing numbers and back pressure information, after a while you can stop it and verify that json files are now stored on HDFS.

Screenshot 2015-09-02 10.30.09

nifi-output

you can always add more processors to encrypt, compress or transform the files to Avro or SEQ files for example prior to dropping them in HDFS, a complete list of the available processors can be found here

Connecting to Hive Thrift Server on Hortonworks using Squirrell Sql Client

 hive_logo

Introduction

Hive is one of the most common used databases on Hadoop, users of Hive are doubling  per year due to the amazing enhancements and the addition of Tez and Spark that enabled Hive to by pass the MR era to a an in-memory execution that changed how people are using Hive.

in this blog post, I will show you how to connect squirrel Sql Client to Hive, the concept is similar to any other clients out there as long as you are using the open-source libraries that matches the ones here you should be fine.

Prerequisite

Download Hortonworks Sandbox with HDP 2.2.4Squirrel SQL Client

Step 1

Follow the Squirrel documentation and run it on your Mac or PC.

Step 2

Follow the Hortonworks HDP Installation on VritualBox, VMware or Hyper-V and start up the virtual Instance.

Step 3

once you are HDP is up and running, connect it it using SSH as it shows on the console, once you are connected you need to download some JAR files in order to establish the connection.

Step 4

if you are using MacOS, simply while you are connected to you HDP instance search for the following JARs using the command:

root> find / -name JAR_FILE

once you find the file needed, easily copy it using SCP to your laptop/PC

root> scp JAR_FILE yourMacUser@yourIPAddress:/PATH_TO_JARS

the files you should look for are the following (versions will differ base on which Sandbox you are running but different versions are unlikely to cause a problem)

  • commons-logging-1.1.3.jar
  • hive-exec-0.14.0.2.2.4.2-2.jar
  • hive-jdbc-0.14.0.2.2.4.2-2.jar
  • hive-service-0.14.0.2.2.4.2-2.jar
  • httpclient-4.2.5.jar
  • httpcore-4.2.5.jar
  • libthrift-0.9.0.jar
  • slf4j-api-1.7.5.jar
  • slf4j-log4j12-1.7.5.jar
  • hadoop-common-2.6.0.2.2.4.2-2.jar

if you are running windows you might need to install winSCP in order to grab the files from their locations.

Step 5

Once all Jars above are downloaded into your local machine, Open up Squirrell and go to Drivers and Add New Driver.

JDBC Driver Configuration for Hive

Name: Hive Driver (could be anything else you want)
Example URL: jdbc:hive2://localhost:10000/default
Class Name: org.apache.hive.jdbc.HiveDriver
go to Extra Class Paths and add all the JARS you downloaded

you may change the port no or IP addresses if you are not running with the defaults.

Step 6

login to you Hadoop Sandbox and verify that HIVESERVER2 is running using:

netstat -anp | grep 10000

if there was nothing running you can hiveserver2 manually

hive> hiveserver2
Step 7

once you verify hiveserver2 is up and running you are ready to test the connection on Squirrel by creating a new Alias as following

Alias Creation for JDBC connection

you are now ready to connect, once connection is successful you should get a screen like this

Connection Established Screen

Step 8 (Optional)

With your first Hive Query, Squirrel can be buggy and complain about memory and heap size, if this ever occurred, if you are on Mac, right click on the app icon–>show package contents–>open info.plist and add the following snippet

<key>Java</key> 
 <dict>
 <key>VMOptions</key> 
 <array> 
 <string>-Xms128m</string> 
 <string>-Xmx512m</string> 
 </array> 
</dict>

Now you can enjoy…

Hadoop+Strata San Jose 2015 Highlights

20150220_170719

Being at Hadoop Strata 2015 at San Jose is like being transported to the future and see how the software technology will look like less than a decade from now.

I have visited Hadoop Strata all the way from Melbourne – Australia to try to catch up with all the new vendors in the market and bring this expertise to the rest of world as many of these products are focusing in their region.

Now lets get into business,…..

Apache Spark

spark.jpg

Apache Spark is still looking like the way to go when it comes to the fading map reduce andApache Mahout being transitioned to spark MLLib, we had a lack of presistant tables storage for databases in spark, we were able to get away with registering temporary tables but now all changed for Spark with the announced BlinkDB that will be holding the metastore and tables storage.

spark_roadmap.jpgR on spark is a very aggressive move as it eliminates the need to program R outside of Spark eco-system, allowing analytics within the eco system using the same RDDs (Resilient Distributed Data sets) that are living in memory.

the main benefits of the spark platform is the ability to stream data using Spark Streaming, query data using SparkSQL (previously Shark) and apply analytics using MLLib and and R within the same platform utilizing the same RDD that is occupying the memory, therefore boosting the performance since the data is moved to memory at the beginning eliminating any need to read from spinning disks as long as their is a sufficient memroy to hadnle the data or even partially in memory as Spark by nature spills over to Disk when there is insufficient memory to keep all the datasets in.

the question that I get asked all the time, can spark used independently from Hadoop? and the answer is Yes, Spark can run in a standalone mode or using a cluster like Apache Mesos, the only problem is Spark will become another silo rather than being integrated in Hadoop and utilizing the different data stored from the mixed workloads and the Data Lake.

Unified Data Storage on Hadoop

hadoop.jpg

As part of setting standards in working with Hadoop files, there is a big move and efforts to push for data standardization using Avro and Parquet in order to avoid the chaos of the different files format and extensions.

Avro is used for row based storage format for Hadoop and Parquet is used for columnar based storage, depending on how the data will be dealt with and what type of a file its we can easily decide how to store it, for most of the time Avro is the way to go unless we are working with a dataset that consist of plenty of columns and I want to avoid scanning un-needed columns as they wont be used much, the I can easily select a Parquet format and pre-select my columns in order to optimize my queries for examples in Hadoop.

Cloudera was showing the Kite SDK that can easily convert the data into Avro or Parquet format, we can also integrate Kite to do the conversion at the point of ingestion if needed. Most of the current databases in Hadoop can read Avro and Parquet formats (Pivotal Hawq, Cloudera Impala, Hive, Spark…etc)

Hive on Tez vs Hive on Spark

hive_logo.pngthe future of relational databases is here now, from my point of view its only a matter of time until companies starts replacing their existing traditional ones to the Hadoop ones like Pivotal Hawq or Apache Hive…etc. I have attended a smart talk by a smart performance engineer Mostafa Mokhtar from Hortonworks, they were trying to benchmark every aspect of Hive stand alone compared to Hive on Tez and Hive on Spark, we were hearing a lot of noise on how disruptive Hive on Spark would be, surprisingly Hortonworks was able to show that Hive on Tez is more than 70% faster than Hive on Spark, although Spark-SQL is more than %60 faster than hive on Spark and probably faster than Hive on Tez as well.

this is a great news for current Hive users as its obvious that Hive on Tez is the way to go on this one.

Capture.PNG.png

The Rise of Analytics, Hadoop and The Data Lake – Analytics

In my current role, I am being so lucky getting insights from different customers around the rise of their analytics story, and the baby steps companies out there are taking in order to get faster insights on their customer base before some competitor does in order to retain the customer base and possibly grow it, along with the baby steps some giant mistakes are occurring returning the company a thousand step backwards, especially in a field where FINALLY application and storage meet together outside performance requirement and IOPS discussions but talking about the governance of the data, storage , access, protection and most importantly Automation.

In the last 2 decades of data storage and with more and more application being developed in organizations, data ended up in silos of storage, some are still being tier one served from a tier one SAN storage other are shared folders and web logs being stored in a slower more economic and denser tier which could be a NAS storage.

The last big step in the database world was data warehousing, where many of databases where living isolated from each other, not allowing for a collaborative way of achieving analytics efficiently in an organization, over the last decade most of the CIOs task list was about the Data Warehousing Strategy.

Analytics on databases is a way of creating complex (sometime not) queries or searches from a data source(s), others may call it data mining well all , in my opinion, There are three different stages for analytics.

Stage 1 started with single database analytics, Stage 2 started a decade ago with analytics performed on multiple data sources and databases mostly using data warehousing where the data was mostly structured, Stage 3 that is discovering or just about to implement it which is analytics based on the same old structured data that we use to query in addition to the unstructured data.

Handling structured data is so mature now that most people understands that I need a relational database to handle my customer orders records for example (Although NOSQL databases are a valid replacement for this market), unstructured data came in with its own challenges, I can write a an 3000 word article on each of them but shortly for this one its data ingestion and flow where tools like Apache storm and Pivotal springXD excels at, software storage layer where a file system like Hadoop is the preferred way to go depends on the case you may use MongoDB, Cassandra…etc a friend of mine actually forwarded me a great article on when to use whathttp://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis

Unstructured data varies from machine generated data (e.g. server logs) web logs (cookies, customers movement on the web, tracking..etc) social media like twitter, facebook, mobile phones beacons (Bluetooth and Wifi) and any other information that relates to your business, customers or product.

These days organizations are taking analytics really seriously, I can see these job ads everywhere for data engineers and data scientists (yes there is a major different), I have met tons of these data scientists here and no not all of them are R experts, some of them only know couple of visualization tools like Tableau but they are able to get great insights from the data, some of them doesn’t even know basic shell commands on unix, and some of them are completely Microsoft windows only users….

Retail, City Councils, Airports, Insurance, Finance, Banking, Government, Defense, and even small startups have started doing something around analytics, infact some of them talk about innovation and Gartner Pace Layered approach (Open Data Lake), in Airports we talk about flight delays prediction, Bluetooth beacons, city councils does parking sensors on the streets for sending the inspector when time is expired for a certain occupied bay and traffic light management using beacons, Retail needs better insights and not the overnight ones from EDW,also their massively increasing foot print of data, insurance collects data from mobile apps in order and correlate customers to their information in order to retain their footprint, health insurance are giving away fit bits to offer better discounts for active healthy customers, cant talk about government and defense.

in the next episodes I will be  connecting more dots together in technology prespecitves, hadoop, in-memory databases, mpp databases,reporting tools,lambda architecture,gartner stuff, …etc stay tuned…

Building Smart Home Labs

of course there will be many people out there saying why do we need a home lab? and we can leverage virtualization to build a lab in our laptop, you know what you are completely right, but how far can you go?

Now usually home labs are not for everyone, first you need to be passionate about technology, trying new stuff, making stuff work, testing, testing,testing until you achieve what you are trying to do. at some point I was able to demonstrate a full NAS home directories and shares with WIN/MAC integrated over active directory with a virtual VNX storage and all from my lab through a VPN, the customer thought it was our EMC lab in Germany 🙂 I was impressed.

Below is one of the impressive home lab from another blog

51ba47e4dbd0cb03a6000349._w.540_s.fit_

Building your first home lab can look deceiving at the beginning and you will think its a piece of cake, and yes you might be an SE who did many architectures for banks and governments, but believe me, a smart home lab is nothing you thought about, and wait  until your problems start to arise and threats your investments in the equipment you already bought.

Identifying the purpose

You want to experiment latest IT technologies? Or maybe you are after a specific experience with a specific software? Network? Virtualization? Storage? Or even gaming? Of course each purpose has a completely different shopping list, design and putting stuff together eventually.

Considerations

As a rule of thumb, make sure you look at second-hand prices on eBay, if there is a %30 difference than buying new, then eBay, unless you are in a good financial position with a comfy budget to build your lab, usually a home labs will cost you anything between $300 – $20,000, start identifying what are you after.

In general, you have to watch for the following each time you plan to buy a new toy for your lab:

  • Power Consumption (Watts)
  • Heat in BTU/hr
  • Noise in dbs

Power Consumption

Usually this is one of the most constant running expense that you will be facing after the lab is up and running and based on the power consumption you will start deciding how long the lab should be powered on for, with the right components you may be able to keep your lab up and running most of the time with minimal effect on your monthly electricity bill.

Most of the recent Intel CPUs are anywhere between %80-%95 power efficient, meaning the power will be withdrawn upon usage of the cores rather than on powering up the CPU, so in an idle state you are looking at a great power efficiency when it comes to your CPUs.

As an example of what wattage you should be looking for, a micro server with a Centrino CPU consumes around 35W, an single CPU workstation is around 135W, a dual CPU workstation is around 160W, a 4 CPU Enterprise class ever can go up to 700W

Noise

Unless you have a garage space or a sound isolated room, Noise is one of the important factors for you and your house mates, running noisy labs can lead to serious relationship problems! Most of the enterprise class servers can be noisy, this is caused by the powerful redundant fans in the server, try to avoid enterprise class since enterprise features are not a real concern in home labs, but of course this is not the case if you can spare a couple of meters in your garage for the lab.

Heat

One of the common problems people run into after setting home labs is the amount of heat it produce, make sure to watch out for BTU/hr to avoid over heating the room and raising the temperature.

Servers Vs Workstations

The first thing you might get to think about is whether you buy a server or a workstation, this is going to be your decision as there are some good and bad about each but due to the diversity of products and manufacturers it’s really difficult to have a list of the good and bad, I personally tend to like workstations since they can be more roomy in IO slots (PCI,Disks..etc) but on the other hand sometimes you end up cabling the disks and PCI cards in a mickey mouse way just to get stuff working (e.g. lack of disk power cables,SAS cables, extensions, fitting..etc)

Servers and Workstations most of the time carry the same CPUs, the motherboards are usually different allowing more memory to be held in servers while in a 3 years old workstation the maximum average you might get is around 32GB due to the smaller motherboard and DIMMs slots limitation.

An average dual CPU workstation shouldn’t cost you more than $800 with at least 16GB of Memory, in home labs memory is more important than CPU if you are after a multipurpose lab unless you will be running many parallel compute farms (e.g. VMs) then more CPU is defiantly required.

Servers on the other hand is a great option if noise is not an issue, they are usually scalable, they can host more memory DIMMs and CPUs, upgradable and easy to work with they might be a bit more power-hungry than other workstations in some cases.

 but again, since we are talking smart home labs here then WorkStations is the way to go for me.

The Shopping List

Mandatory

  • Management PC

This can usually be your desktop at home or your personal laptop, I still prefer having something fixed at home but no harm using a laptop for that.

No need for powerful PCs or workstations here, minimal to run an operating system, 4GB of memory should be sufficient.

  • Server/Workstation

Here where the grunt should be, try to invest in this since it will last with you for years, compare the prices, my preference is usually workstations due to the lower power consumption, noise and heat, make sure you compare CPUs, more cores is better, more memory is more important, small things to look for would be no. of PCIe slots, the more the better, no. of supported disks (Caddies) the more the better, check the prices of after market memory since you might need to expand later, check parts availability on eBay, and finally bargain,bargain,bargain, all eBay sellers out there are negotiable, make sure you talk to many sellers before you make the purchase and make sure the seller gives you at least 30 days warranty.

I can’t tell you which CPU and how many,I would prefer a 2 sockets capable machine so I can expand if I needed later but memory should not be less than 8GB at least, 12 GB is preferable but again depends on what are you planning to run on it.

  • Storage (I will elaborate a bit here, I am a NAS SE at the end of the day 🙂 )

Here where it becomes a bit confusing, the options are endless, but as a summary you can have your storage as follows:

DAS, you can use the internal disks of your workstations, make sure you have different speeds and capacity, e.g. SSD and SATA mixed, check your options with SATA 3 vs. SATA 2, this will depends whether you want to get SATA3 support or not, disks are usually cheaper for SATA2, but the difference is massive when it comes to SSDs, you can always buy $30 PCIe SATA3 cards from eBay, so dont worry much about your workstation native SATA3 support.

USB 3.0 cant be neglected, speeds are up to 170Mbs USB 3.1 specs approved for 10Gbs which will make a big difference in the future, it will get messier with USB due to the scalability and the mesh you will end up with when you expand, plus you need a USB3 PCIe card, again $15.

SAN, if you have the budget, nothing more appealing than having the speed and latency of the FiberChannel, I rarely see home labs with fiber due to the price and equipment, cause you usually need a storage controller, usually they are enterprise, and you wont be able to have a smart home lab with an enterprise storage, but since 8Gb/s and 16Gb/s is what the business is using now, you can grab real bargains on 4Gb/s FC these days.

NAS, usually this is the way to go, its scalable, you can connect your home stuff to it as well, its easy to deploy and migrate later on and finally its cheap, make sure it supports SATA3 interfaces and 1GB interfaces as well, make sure it supports jumbo frame if you care about jumbo frames.

IOMEGA,Lenovo,…etc are selling ready NAS solutions that can scale from a single disk up over 48 disks, built in OS is usually used to provide full NAS capabilities e.g. NFS,CIFS and some SAN capability as ISCSI.

Advantages

  • Low cost, as a single disk NAS can start from $150
  • Built in RAID capabilities for selected models
  • Leverage Network architecture
  • Some models supports Jumbo Frames**
  •   Low power consumption

Disadvantages

  • Gets more costly once advance capabilities are needed(e.g. Jumbo Frames, 10GbE,more disk drives support,snapshots,RAID)
  •  Locked into certain OS, each manufacturer supports their OS on the NAS box, so no room for manual intervention, CLI, advanced and customized features (e.g. pools of storage,tiering..etc)
  •  Selected type of disks supported on different models
  •  Mostly SW raid is used which affect the overall performance.

MYO (Make your own)

Many tech savvy have started to build their own NAS box with their flavor of the OS on top of it to provide NAS features (e.g. FREE NAS,ZFS,Linux,Windows..etc),

Advantages

  •  The most cost-effective option
  •  No locking into a certain NAS OS with certain capabilities
  • You can choose what features you need.
  •  Your choice of OS.
  •  Your choice of disks

Disadvantages

  • You might need to buy a HW raid card if you want protection with low impact on the performance
  • Might get a bit technical, and required advance knowledge in different operating systems
  • Assembling all the parts together might be time-consuming, especially if you are getting used equipment.
  • Network switch

Switch is important, and no you can’t skip it, there are many on eBay starting from $10 used, get at least 4 ports, 8 or higher is recommended, stay away from enterprise switches even if they were cheap, they are usually power consuming and noisy, jumbo frame supported switch is a great advantage if you will care about changing network frame sizes in the future for testing purposes, managed switch is great, VLANs are a nice to have, 1GbE support is a must, and make sure you buy your cables also, local shops are usually expensive when it comes to cables so buy online as well.

  • your choice of disks (min. 2)

What ever you buy for your storage, make sure you get it disk less unless it was a bargain, choose your own disks, MSI in Australia is a great supplier with very competitive pricing over online and you get warranty through them as well, try to avoid used disks due to the wear, SSD has limited writes, and spindle disks rotates which affect the spindle arm and there is MTBF for every disk. when you buy SATA disks, make sure its 7.2 RPM with cache, it’s usually the same price as well, I usually prefer barracuda or seagate, support for SATA3 doesn’t matter for SATA disks since you will never be able to saturate SATA2 interface with 7.2 spindle disks anyway, but for SSD it makes a big difference, Samsung is my favorite SSD, but you can buy any brand.

  • Software Licenses

Can you imagine that most of the cost sits here,  unless planned well you will end up paying top $$$ just for licenses, Lucky for me I work for EMC and I get VMware licenses for free, but if you don’t work for EMC or VMware, try to get workstations with a licenses for windows, if you are planning to use Linux its much better, Ubunto and Mint are my favorite, VMware will still provide you 60 days licenses for any product.

Optional

  • RAID card

whether its for your workstation if you are using DAS or for your storage if you are MYO, raid cards are a good way of having proper HW level raid with lower penalty on performance, having cache on raid cards is good but usually costly, keep your eye on LSI MegaRaid cards on eBay, grab a used one for a bargain, you might need to re-image or upgrade the firmware which is  not easy but it does worth it.

  • Multiple Monitors

it might be a bit fancy or luxurious, but it’s always handy to have more than one display, displays are cheap, you can get a used one for $50, but you must have a display card that supports it, NVS series from NVIDIA are usually cheap and handy, many of them are on eBay, for higher video demand for gaming if you care would be more costly but again depends on how you use your home lab.

Finally, posted below my home lab design and a picture of how it looks like, I am still looking at buying a half-size cabinet to keep all my workstations stacked in it, and I keep growing it every month, so good luck with yours.

homelab