Turning your big data into actionable, fast data
OCTOBER 1, 2015
by Jim Campigli
Chief Product Officer and Co-Founder, WANdisco, Inc.
Hadoop was designed to make it easy and inexpensive to store and process petabytes of structured, unstructured and semi-structured data—like clickstream data, financial ticker data, system logs and sensor data—generated at incredible speeds and written thousands of times per second. Once written to Hadoop, it becomes “data-at-rest” that is retrieved and analyzed later using batch applications built with MapReduce that run for hours, or with NoSQL databases for interactive access.
However, the ultimate benefits of big data are lost if fresh, fast-moving data is dumped into the Hadoop Distributed File System (HDFS) instead of analyzed as it happens, because the ability to act now is lost. In contrast, fast data is about “data-in-motion” that demands an immediate response and action. The collection process for data-in-motion is the same as data-at-rest, but the primary difference is the analysis occurs in real time as data is generated and captured. Businesses can react instantly to changing market conditions, take advantage of big data velocity and make decisions with a direct impact on the bottom line. And as most business is now global, or going global, the ability to react immediately to information generated simultaneously from multiple locations worldwide without downtime is vital to competitive advantage.
This is not to say that data-in-motion and data-at-rest are mutually exclusive. On the contrary, they are very much complementary. Data-at-rest provides historical context, while data-in-motion tells you what’s happening now. The combination of the two allows decisions to be made in real time within a historical context, instead of a single-point-in-time vacuum.
What’s required is a toolset that handles both data-at-rest and data-in-motion, and Spark has proven itself to be the toolset of choice. Spark’s in-memory framework enables it to handle streaming data in real time before it’s written to HDFS, as well as process data-at-rest at speeds up to 100 times faster than MapReduce to perform regression analysis that highlights patterns in historical data. In addition, it has its own high-performance SQL engine for interactive queries against HDFS, and it can write to HDFS as well as other data stores. With the Spark framework and the data visualization and analytics applications that work with it, businesses can react instantly to changing market conditions, take advantage of big data velocity, and make informed in-context decisions based on historical data to immediately impact the bottom line.
The most effective way to achieve the benefits of fast data lies in a hybrid approach that includes active-active replication to integrate in-house big data deployments running any Hadoop distribution—Cloudera's Distribution Including Apache Hadoop (CDH), Hortonworks Data Platform (HDP), MapR, Pivotal, IBM, Apache—on HDFS or any Hadoop Compatible File System (HCFS) with cloud-based Spark-as-a-Service enabling real-time analytics applications. This allows an immediate ramp-up without having to bring hard-to-find Spark-trained staff and additional hardware and other infrastructure in-house. This architecture also makes it possible to easily move raw data as well as end results into and out of the cloud in near-real time, without any disruption to existing on-premises Hadoop operations.
Active-active replication across in-house data centers into and out of a Spark as a Service cloud addresses two other key requirements global organizations have for gaining real competitive advantage from fast data:
- The elimination of downtime and data loss
- The ability to handle huge volumes of data generated continuously across multiple locations
Eliminating downtime and data loss is critical for any application having stringent service-level agreements (SLAs) and regulatory compliance mandates associated with it. In the case of fast data applications, the negative impact of downtime and data loss is orders of magnitude greater due to the increased risk of missed opportunity. As a by-product of ensuring data consistency across clusters and data centers, true peer-to-peer active-active replication would deliver continuous hot backup by default to protect against data loss. It would leverage this to provide automated failover and recovery for the lowest-possible recovery point objective (RPO) and recovery time objective (RTO) to eliminate downtime. With an active-active solution, when a cluster or entire data center went offline due to either scheduled maintenance, or hardware and network failures, users would still have read/write access to their data at other locations. After an outage, resynchronization of clusters that were offline would happen automatically, reducing the risk of human error during recovery.
In addition to eliminating downtime, global organizations must be able to process and analyze massive volumes of data generated on a 24x7 basis wherever it originates. These capabilities are necessary when dealing with industrial sensor data and other Internet of Things use cases where data is generated everywhere and timeliness and accuracy are vital. To address these requirements, organizations must:
- Ensure critical, time-sensitive information is processed close to the source, without the delay and risk involved in moving data over a write pipeline for analysis
- Replicate it to other locations as needed at the same time it’s analyzed and ingested at the point of origin
- Combine in-house big data-at-rest that provides historical context with fast data-in-motion, so informed, real-time decisions can be made with an immediate positive impact