Tag Archives: emc

The Rise of Analytics, Hadoop and The Data Lake – Analytics

In my current role, I am being so lucky getting insights from different customers around the rise of their analytics story, and the baby steps companies out there are taking in order to get faster insights on their customer base before some competitor does in order to retain the customer base and possibly grow it, along with the baby steps some giant mistakes are occurring returning the company a thousand step backwards, especially in a field where FINALLY application and storage meet together outside performance requirement and IOPS discussions but talking about the governance of the data, storage , access, protection and most importantly Automation.

In the last 2 decades of data storage and with more and more application being developed in organizations, data ended up in silos of storage, some are still being tier one served from a tier one SAN storage other are shared folders and web logs being stored in a slower more economic and denser tier which could be a NAS storage.

The last big step in the database world was data warehousing, where many of databases where living isolated from each other, not allowing for a collaborative way of achieving analytics efficiently in an organization, over the last decade most of the CIOs task list was about the Data Warehousing Strategy.

Analytics on databases is a way of creating complex (sometime not) queries or searches from a data source(s), others may call it data mining well all , in my opinion, There are three different stages for analytics.

Stage 1 started with single database analytics, Stage 2 started a decade ago with analytics performed on multiple data sources and databases mostly using data warehousing where the data was mostly structured, Stage 3 that is discovering or just about to implement it which is analytics based on the same old structured data that we use to query in addition to the unstructured data.

Handling structured data is so mature now that most people understands that I need a relational database to handle my customer orders records for example (Although NOSQL databases are a valid replacement for this market), unstructured data came in with its own challenges, I can write a an 3000 word article on each of them but shortly for this one its data ingestion and flow where tools like Apache storm and Pivotal springXD excels at, software storage layer where a file system like Hadoop is the preferred way to go depends on the case you may use MongoDB, Cassandra…etc a friend of mine actually forwarded me a great article on when to use whathttp://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis

Unstructured data varies from machine generated data (e.g. server logs) web logs (cookies, customers movement on the web, tracking..etc) social media like twitter, facebook, mobile phones beacons (Bluetooth and Wifi) and any other information that relates to your business, customers or product.

These days organizations are taking analytics really seriously, I can see these job ads everywhere for data engineers and data scientists (yes there is a major different), I have met tons of these data scientists here and no not all of them are R experts, some of them only know couple of visualization tools like Tableau but they are able to get great insights from the data, some of them doesn’t even know basic shell commands on unix, and some of them are completely Microsoft windows only users….

Retail, City Councils, Airports, Insurance, Finance, Banking, Government, Defense, and even small startups have started doing something around analytics, infact some of them talk about innovation and Gartner Pace Layered approach (Open Data Lake), in Airports we talk about flight delays prediction, Bluetooth beacons, city councils does parking sensors on the streets for sending the inspector when time is expired for a certain occupied bay and traffic light management using beacons, Retail needs better insights and not the overnight ones from EDW,also their massively increasing foot print of data, insurance collects data from mobile apps in order and correlate customers to their information in order to retain their footprint, health insurance are giving away fit bits to offer better discounts for active healthy customers, cant talk about government and defense.

in the next episodes I will be  connecting more dots together in technology prespecitves, hadoop, in-memory databases, mpp databases,reporting tools,lambda architecture,gartner stuff, …etc stay tuned…

PACS Systems and Scale out NAS

Today I had an interesting meeting with an intelligent engineer from a major PACS provider discussing how the technology is evolving and how a true scale-out NAS can add value and reduce the amount of work needed and providing a better use experience when it comes to PACS.

I have personally worked around HIS,RIS and Imaging systems for a long time, only knowing the basics, I was responsible of the designing and implementing the infrastructure for over 20 Hospitals that will eventually host HIS,LAB,RIS,PACS…etc .

our discussion today was around  the challenges, how it integrates with RIS…etc etc I started the conversation with “How can a storage help?”, the first answer he came up with was “By avoiding PACS migrations!!” Then we started talking about the pains in PACS migrations, and what are the problems everyone faces, before I get to migrations, let me take you through PACS and the components around it:

PACS

PACS stands for Picture Archiving and Communication System, from the name you can tell we are manging pictures mainly, the source of these pictures can be a variety of imaging machines like CT Scan, MRI, X-Ray..etc, PACS organizes the pictures into a specific order and connects them to RIS via certain keys (Patient No,, Visit,..etc), although you can run PACS as a standalone system for small practices where RIS is not necessary.

PACS uses DICOM as a standard for transmitting, receiving and storing the images. PACS usually is a application/database software runs a normal transactional database (e.g. Oracle/SQL..etc)  most of the times PACS runs a standalone database to store DICOM images headers.

PACS has many vendors in the market such as GE, AGFA, FUJI, KODAK, MCKESSON, ACUO…etc Many vendors are now moving to virtualizing PACS by providing it in a Virtual Appliance, to easy deployment and reduce costs, however the integration part with RIS/HIS is still manual due to the different HIS/RIS suppliers out there and different communication methods.

RIS

RIS (Radiology Information System) usually takes care of scheduling machines availability and appointments, results and findings store, billing (specially if its a stand alone RIS/PACS with no HIS), and further features can be derived from the HIS (Hospital Information System) if available and integrated with RIS. Integration with RIS is through another standard called HL7.

DICOM

Digital Imaging and Communication in Medicine, which is mainly a standard to handle all different operations from transmitting,recieving,storing images with its own file format, its a standard developed by members of NEMA (National Electrical Manufacturers Association) and it currently holds the copyrights for DICOM.

How does it work?

Once a Modality ( Medical term referring to imaging devices) operator starts the imaging procedure, The modality will start transmitting the images according to the PACS system, the PACS system responds with error messages if any failure occurred during the transmission phase. once all the images are received in the PACS, it starts creating headers (Metadata) for the images, the header will usually be derived from RIS/HIS or manually if RIS/HIS doesn’t exist, the header is written according to DICOM format, and usually contain TAGS, which are usually numbers that translates into a standard key attributes (e.g. 008,0070 refers to Modailty Manufacturer) same for patient ID, Visit and over another 100 attribute (mostly fetched from HIS/RIS), eventually all the attributes are stored on the PACS database.

PACS

So why do we need to migrate our PACS data?

There are many reason to migrate PACS like vendor change, VNA, Archiving…etc which will all take us back to the same basic length element in migrations:

Storage LimitationMost of the non-scale out storage are usually a victim, once storage maximum capacity is reached or its time for a new storage or you are migrating archival PACS to a separate storage, or the new vendor suggested another storage to work with or changing protocols, eventually , you have to take the old storage out and get the new one, but guess what!! you have to migrate all data to the new storage first 🙂

PACS Migration

imagine an MRI imaging device producing 400 image approximately per patient, each image is around 4MB, its booked for 1 hour per visit, and you can have 8 patients a day, lets say the imaging center is moving to a new storage after 3 years, so according to simple math we have to worry about around 10TB of data, Now start add CT scanning (around 100 images), and X-RAY and multiply by the no. of the machines. its not rare to see Petabytes at some hospitals, depending on how long the images are retained and the compliance regulations around retention the data stored can get really massive.

Migrating data takes two approaches, the first is the easy one which is mainly images with no headers, usually this is a quick and easy it depends on how big is your data and how fast you can transmit them (what media you are using). The other approach is the dirty one, which uses the vendor migration tools and can takes several months to years, especially when moving through vendors, each image header is inspected, database is queried, new database record in the new system is inserted, an complete pre-assessment engagement usually takes  a place first to check the integrity of the files and headers and after cleanup and assessment the migration can start in phases, with a complete assessment after each phase is completed with logs revision…etc

Common Issues with PACS Migration

One of the most problems that affects data integrity is the header (Metadata) of the images, lets say the patient ID was 1234, and there was no HIS/RIS integration and the operator manually added the wrong patient ID at the time of the scan 1235, once the images are transmitted to PACS. The header of the image will have 1234 as a Patient ID as well as the PACS , based on DICOM, some tags can never change, therefore even if you update the PACS with the new patient ID only a  reference with the new patient will be written in the PACS reference the wrong patient ID in the header so 1234 in PACS and 1235 in the header, but PACS creates a link. such problems are fixed during the pre-assessment stage and therefore it may be time consuming and costly.

Choosing the right storage

Scale-out NAS storage over traditional storage will eliminate the need for storage migrations due to the following reasons:

– Can expand on the fly with no interruption

– Can be upgraded on the fly with no interruption

– Can have slower nodes with cheaper disk for archiving on the fly using file policies to move the images to slower disk without affecting the path for the files (so database can still reference the images in the same path)

Building Smart Home Labs

of course there will be many people out there saying why do we need a home lab? and we can leverage virtualization to build a lab in our laptop, you know what you are completely right, but how far can you go?

Now usually home labs are not for everyone, first you need to be passionate about technology, trying new stuff, making stuff work, testing, testing,testing until you achieve what you are trying to do. at some point I was able to demonstrate a full NAS home directories and shares with WIN/MAC integrated over active directory with a virtual VNX storage and all from my lab through a VPN, the customer thought it was our EMC lab in Germany 🙂 I was impressed.

Below is one of the impressive home lab from another blog

51ba47e4dbd0cb03a6000349._w.540_s.fit_

Building your first home lab can look deceiving at the beginning and you will think its a piece of cake, and yes you might be an SE who did many architectures for banks and governments, but believe me, a smart home lab is nothing you thought about, and wait  until your problems start to arise and threats your investments in the equipment you already bought.

Identifying the purpose

You want to experiment latest IT technologies? Or maybe you are after a specific experience with a specific software? Network? Virtualization? Storage? Or even gaming? Of course each purpose has a completely different shopping list, design and putting stuff together eventually.

Considerations

As a rule of thumb, make sure you look at second-hand prices on eBay, if there is a %30 difference than buying new, then eBay, unless you are in a good financial position with a comfy budget to build your lab, usually a home labs will cost you anything between $300 – $20,000, start identifying what are you after.

In general, you have to watch for the following each time you plan to buy a new toy for your lab:

  • Power Consumption (Watts)
  • Heat in BTU/hr
  • Noise in dbs

Power Consumption

Usually this is one of the most constant running expense that you will be facing after the lab is up and running and based on the power consumption you will start deciding how long the lab should be powered on for, with the right components you may be able to keep your lab up and running most of the time with minimal effect on your monthly electricity bill.

Most of the recent Intel CPUs are anywhere between %80-%95 power efficient, meaning the power will be withdrawn upon usage of the cores rather than on powering up the CPU, so in an idle state you are looking at a great power efficiency when it comes to your CPUs.

As an example of what wattage you should be looking for, a micro server with a Centrino CPU consumes around 35W, an single CPU workstation is around 135W, a dual CPU workstation is around 160W, a 4 CPU Enterprise class ever can go up to 700W

Noise

Unless you have a garage space or a sound isolated room, Noise is one of the important factors for you and your house mates, running noisy labs can lead to serious relationship problems! Most of the enterprise class servers can be noisy, this is caused by the powerful redundant fans in the server, try to avoid enterprise class since enterprise features are not a real concern in home labs, but of course this is not the case if you can spare a couple of meters in your garage for the lab.

Heat

One of the common problems people run into after setting home labs is the amount of heat it produce, make sure to watch out for BTU/hr to avoid over heating the room and raising the temperature.

Servers Vs Workstations

The first thing you might get to think about is whether you buy a server or a workstation, this is going to be your decision as there are some good and bad about each but due to the diversity of products and manufacturers it’s really difficult to have a list of the good and bad, I personally tend to like workstations since they can be more roomy in IO slots (PCI,Disks..etc) but on the other hand sometimes you end up cabling the disks and PCI cards in a mickey mouse way just to get stuff working (e.g. lack of disk power cables,SAS cables, extensions, fitting..etc)

Servers and Workstations most of the time carry the same CPUs, the motherboards are usually different allowing more memory to be held in servers while in a 3 years old workstation the maximum average you might get is around 32GB due to the smaller motherboard and DIMMs slots limitation.

An average dual CPU workstation shouldn’t cost you more than $800 with at least 16GB of Memory, in home labs memory is more important than CPU if you are after a multipurpose lab unless you will be running many parallel compute farms (e.g. VMs) then more CPU is defiantly required.

Servers on the other hand is a great option if noise is not an issue, they are usually scalable, they can host more memory DIMMs and CPUs, upgradable and easy to work with they might be a bit more power-hungry than other workstations in some cases.

 but again, since we are talking smart home labs here then WorkStations is the way to go for me.

The Shopping List

Mandatory

  • Management PC

This can usually be your desktop at home or your personal laptop, I still prefer having something fixed at home but no harm using a laptop for that.

No need for powerful PCs or workstations here, minimal to run an operating system, 4GB of memory should be sufficient.

  • Server/Workstation

Here where the grunt should be, try to invest in this since it will last with you for years, compare the prices, my preference is usually workstations due to the lower power consumption, noise and heat, make sure you compare CPUs, more cores is better, more memory is more important, small things to look for would be no. of PCIe slots, the more the better, no. of supported disks (Caddies) the more the better, check the prices of after market memory since you might need to expand later, check parts availability on eBay, and finally bargain,bargain,bargain, all eBay sellers out there are negotiable, make sure you talk to many sellers before you make the purchase and make sure the seller gives you at least 30 days warranty.

I can’t tell you which CPU and how many,I would prefer a 2 sockets capable machine so I can expand if I needed later but memory should not be less than 8GB at least, 12 GB is preferable but again depends on what are you planning to run on it.

  • Storage (I will elaborate a bit here, I am a NAS SE at the end of the day 🙂 )

Here where it becomes a bit confusing, the options are endless, but as a summary you can have your storage as follows:

DAS, you can use the internal disks of your workstations, make sure you have different speeds and capacity, e.g. SSD and SATA mixed, check your options with SATA 3 vs. SATA 2, this will depends whether you want to get SATA3 support or not, disks are usually cheaper for SATA2, but the difference is massive when it comes to SSDs, you can always buy $30 PCIe SATA3 cards from eBay, so dont worry much about your workstation native SATA3 support.

USB 3.0 cant be neglected, speeds are up to 170Mbs USB 3.1 specs approved for 10Gbs which will make a big difference in the future, it will get messier with USB due to the scalability and the mesh you will end up with when you expand, plus you need a USB3 PCIe card, again $15.

SAN, if you have the budget, nothing more appealing than having the speed and latency of the FiberChannel, I rarely see home labs with fiber due to the price and equipment, cause you usually need a storage controller, usually they are enterprise, and you wont be able to have a smart home lab with an enterprise storage, but since 8Gb/s and 16Gb/s is what the business is using now, you can grab real bargains on 4Gb/s FC these days.

NAS, usually this is the way to go, its scalable, you can connect your home stuff to it as well, its easy to deploy and migrate later on and finally its cheap, make sure it supports SATA3 interfaces and 1GB interfaces as well, make sure it supports jumbo frame if you care about jumbo frames.

IOMEGA,Lenovo,…etc are selling ready NAS solutions that can scale from a single disk up over 48 disks, built in OS is usually used to provide full NAS capabilities e.g. NFS,CIFS and some SAN capability as ISCSI.

Advantages

  • Low cost, as a single disk NAS can start from $150
  • Built in RAID capabilities for selected models
  • Leverage Network architecture
  • Some models supports Jumbo Frames**
  •   Low power consumption

Disadvantages

  • Gets more costly once advance capabilities are needed(e.g. Jumbo Frames, 10GbE,more disk drives support,snapshots,RAID)
  •  Locked into certain OS, each manufacturer supports their OS on the NAS box, so no room for manual intervention, CLI, advanced and customized features (e.g. pools of storage,tiering..etc)
  •  Selected type of disks supported on different models
  •  Mostly SW raid is used which affect the overall performance.

MYO (Make your own)

Many tech savvy have started to build their own NAS box with their flavor of the OS on top of it to provide NAS features (e.g. FREE NAS,ZFS,Linux,Windows..etc),

Advantages

  •  The most cost-effective option
  •  No locking into a certain NAS OS with certain capabilities
  • You can choose what features you need.
  •  Your choice of OS.
  •  Your choice of disks

Disadvantages

  • You might need to buy a HW raid card if you want protection with low impact on the performance
  • Might get a bit technical, and required advance knowledge in different operating systems
  • Assembling all the parts together might be time-consuming, especially if you are getting used equipment.
  • Network switch

Switch is important, and no you can’t skip it, there are many on eBay starting from $10 used, get at least 4 ports, 8 or higher is recommended, stay away from enterprise switches even if they were cheap, they are usually power consuming and noisy, jumbo frame supported switch is a great advantage if you will care about changing network frame sizes in the future for testing purposes, managed switch is great, VLANs are a nice to have, 1GbE support is a must, and make sure you buy your cables also, local shops are usually expensive when it comes to cables so buy online as well.

  • your choice of disks (min. 2)

What ever you buy for your storage, make sure you get it disk less unless it was a bargain, choose your own disks, MSI in Australia is a great supplier with very competitive pricing over online and you get warranty through them as well, try to avoid used disks due to the wear, SSD has limited writes, and spindle disks rotates which affect the spindle arm and there is MTBF for every disk. when you buy SATA disks, make sure its 7.2 RPM with cache, it’s usually the same price as well, I usually prefer barracuda or seagate, support for SATA3 doesn’t matter for SATA disks since you will never be able to saturate SATA2 interface with 7.2 spindle disks anyway, but for SSD it makes a big difference, Samsung is my favorite SSD, but you can buy any brand.

  • Software Licenses

Can you imagine that most of the cost sits here,  unless planned well you will end up paying top $$$ just for licenses, Lucky for me I work for EMC and I get VMware licenses for free, but if you don’t work for EMC or VMware, try to get workstations with a licenses for windows, if you are planning to use Linux its much better, Ubunto and Mint are my favorite, VMware will still provide you 60 days licenses for any product.

Optional

  • RAID card

whether its for your workstation if you are using DAS or for your storage if you are MYO, raid cards are a good way of having proper HW level raid with lower penalty on performance, having cache on raid cards is good but usually costly, keep your eye on LSI MegaRaid cards on eBay, grab a used one for a bargain, you might need to re-image or upgrade the firmware which is  not easy but it does worth it.

  • Multiple Monitors

it might be a bit fancy or luxurious, but it’s always handy to have more than one display, displays are cheap, you can get a used one for $50, but you must have a display card that supports it, NVS series from NVIDIA are usually cheap and handy, many of them are on eBay, for higher video demand for gaming if you care would be more costly but again depends on how you use your home lab.

Finally, posted below my home lab design and a picture of how it looks like, I am still looking at buying a half-size cabinet to keep all my workstations stacked in it, and I keep growing it every month, so good luck with yours.

homelab