10 parameters for Big Data networks

Big Data and Hadoop clusters involve heavy volume of data and in many instances high velocity in bursty traffic patterns. With these clusters finding in-roads in enterprise data centers, the network designers have a few more requirements to take care. Listed below are 10 parameters to evaluate while designing a network for Big Data and Hadoop cluster.

10) Available and resilient

- Allows network designs with multiple redundant paths between the data nodes than having one or two points of failure.
- Supports upgrades without any disruption to the data nodes

9) Predictable

- Right sizing the network configuration (1GbE/10GbE/100GbE switch capacity) to achieve predictable latency in network
- real time latency may not be required for batch processing

8) Holistic network

- one network can support all workloads : Hadoop, NoSQL, Warehouse, ETL, Web
- support Hadoop and existing storage systems like DAS, SAN, or NAS

7) Multitenant

- be able to consolidate and centralize Big Data projects
- have capability to leverage the fabric across multiple use cases

6) Network partitioning

- support separate Big Data infrastructure from other IT resources on the network
- support privacy and regulatory norms

5) Scale Out

- provide seamless transition as projects increase in size and number
- accommodate new traffic patterns and larger, more complex workloads

4) Converged/ unified fabric network

-  target a flatter and converged network with Big Data as an additional configurable workload
- provide virtual chassis architecture with provision to logically manage access to multiple switches as a single device

3) Network intelligence

- carry any-to-any traffic flows of a Big Data as well as traditional cluster over an Ethernet connection
-  manage single network fabric irrespective of data requirements or storage design

2) Enough bandwidth for data node network

- provision data nodes with enough bandwidth for efficient job completion
- do cost/benefit trade-off on increasing data node uplinks

1) Support bursty traffic

- support loading files into HDFS which triggers replication of data blocks or writing mapper output files and lead to higher network use in a short period of time causing bursts of traffic in the network.
- provide optimal buffering in network devices to absorb bursts

Read more »

Introducing wearable Hadoop technology

While Hadoop and MapReduce have been harping on distributed parallel processing on community hardware for long, some Hadoop enthusiasts have taken this too far. Enter Datasayer from Edward J Yoon who has built wearable Hadoop technology. This means every time you walk, jump, blink your eyes or move your hand, you would be using your kinetic energy to run a MapReduce job.

Read below to understand this breathtaking innovation.

  • Hadoop Glass– Inspired by Google glass, this forms the client layer of your Hadoop cluster. Using the eye wear interface, you can trigger a MapReduce job or fire a Pig, Hive or Hama query.
  • Hadoop Watch– Inspired by Samsung Gear (Watch) technology, this forms the name node of your Hadoop cluster and stores the meta data for the data stored in data nodes. Further, the job tracker is also hosted on the watch itself and using MRv1, controls the execution of tasks on different nodes.

  • Hadoop Shoes– This forms the data nodes layer of the cluster. By default the replication factor is 2 for each block resident in data node shoe. Each time you walk or jump, the kinetic energy is converted to CPU cycles and powers the tasks running via tasktracker on each shoe.

  • Hadoop Kinect- Inspired by Microsoft’s Kinect technology, you can configure the shoes a.k.a data nodes of another person in vicinity of 100 meters. The architecture also leverages advanced wireless technology to communicate between nodes. From a scalability perspective, all you need to do is have more people jumping or walking with data node shoes in the vicinity.

If all this sounds too good to be true, well, you guessed it right. This is an All Fools Day prank by Datasayer. 

Follow @datasayer

Read more »

Solving big aviation mysteries with Big Data

While the focus for last few weeks has been on aviation sector, let's take a quick look at the opportunities and challenges that the sector offers for Big Data analytics. It may not be far fetched to say that we could do better with integrated systems in aviation. Focus should shift to establishment of industry wide IT governance and data sharing platforms. To that end, newer 21st century technologies should be leveraged. 

Shown below is an infographic giving an overview of Big Data opportunity in aviation sector.
Solving Big Aviation Mysteries with Big Data
(Click on image to download full size infographic)
Read more »

Decoding Hadoop ETL

Continuing the Q&A with Syncsort CEO Lonne Jaffe, we explore the ETL use cases in Hadoop ecosystem. Lonne explains some of the key distinguishing characteristics of ETL solutions and how they make a compelling use case with inexpensive implementations.



What makes a great ETL solution for Hadoop? Can you tell us the important characteristics?

Some legacy ETL products are too heavyweight to work well with Hadoop – they don’t run natively, they sit on edge nodes or they generate a lot of inefficient code that needs to be maintained. 

Our enterprise-grade Hadoop-based transformation engine sits on each node in a cluster to deliver accelerated performance and avoids generating code. We made an open source contribution that enabled our engine to run natively in the Hadoop environment, which was committed as MAPREDUCE-2454 in early 2013. We’re now delivering, to some of the largest and most sophisticated users of Hadoop in the world, a product that can handle complex data models that include disparate structured and unstructured data from varied data sources, including the mainframe. 

We’re also focusing our organic investments on making it as easy as possible to move legacy workloads and data into Hadoop. For example, we created a SQL analyzer that scans and creates maps of existing legacy SQL and assists in efficiently recreating those SQL-based legacy workloads in Hadoop. We also built a product that analyzes SMF records on the mainframe to identify the mainframe workloads that are best-suited to moving to Hadoop to save money, improve performance, and make the data accessible to next-generation analytics.

Expect continued improvements in these existing capabilities and more offerings like this from us. 


How critical is pricing in your segment? Do the customers bend backwards on the price point?

- In 2013, many companies had no budget for Hadoop.  Customers were just testing – no real dollars were committed to building out production Hadoop clusters.  That has changed substantially in 2014.  Organizations can now much more easily justify investment in Hadoop because they immediately realize cost reductions in legacy data warehouses, legacy ETL tools and mainframes, saving much more money than it costs to create their Hadoop environment. Groups within large enterprises that are running offload projects are aggregating power and budget since they’re freeing up so much annual spend. Offloading legacy workloads and data into Hadoop doesn’t only save money, but, it also brings a new class of analytical compute power to the data. These organizations can quickly demonstrate the competitive benefits of the advanced analytics that Hadoop makes possible, giving them insights that can help grow the business – contributing to the top line. Anything that can generate top line revenue growth and lower costs simultaneously is very valuable to enterprises today.


Among the various use cases of your products in Hadoop ecosystem, can you tell us about your most fascinating one?

- What’s most fascinating about our customers’ use cases is how they have changed the economics of managing data. One customer who measured the cost of managing a terabyte of data in their enterprise data warehouse at $100,000 was able to offload and manage data in Hadoop at a cost of only $1,000 per terabyte -- ETL-like workloads can represent as much of 40-60% of the capacity of legacy enterprise data warehouses systems. In another case, by offloading mainframe data and processing into Hadoop, a major bank was able to save money and make new analytical capabilities available. Customers can also offload workloads from legacy ETL products into Hadoop.
We’re focusing much of our R&D and acquisition bandwidth going forward on building unique products that make this offload-to-Hadoop process as seamless as possible.

<< Previous - Offloading with Hadoop

Read more »