Did someone tell your data to shelter in place? That wouldn’t make any sense, would it? Ironically, for vast troves of valuable enterprise data, that might as well be the case, because massive, compute-bound data silos are practically everywhere in the corporate world.
Hadoop played a role in creating this scenario, because many large organizations sought to leverage the twin promises of low-cost storage and massively parallel computing for analytics. But a funny thing happened to the yellow elephant: It was largely obviated by cheap cloud storage.
Seemingly overnight, the price of cloud storage dropped so precipitously that the cost-benefit analysis of using the Hadoop Distributed File System (HDFS) on-premises for new projects turned upside down. Even the term “Hadoop” disappeared from the names of major conferences.
That’s not to say there isn’t valuable data in all those HDFS repositories, however. Many important initiatives used this technology in hopes of generating useful insights. But with budgets moving away from Hadoop, another strategy is required to be successful.
What about computing? Suffice to say, cloud providers now offer a robust service in this territory. And relatively recent innovations such as separating computing from storage have also played a part in paving the way for cloud-based computing to take on all manner of workloads.
So the cloud now easily eclipses most on-premises environments in all the major categories: speed, cost, ease of use, maintenance, scalability. But there are barriers to entry; or at least pathways that should be navigated carefully while making the move. Mistakes can be very costly!
But how do you get the data there? Amazon actually offers a “snow truck” that will come to your data center, load it up one forklift at a time, and haul it, old-school, to its facility. That approach can certainly work for a quick-and-relatively-dirty solution, but it ignores the magic of cloud.
Seeing Triple
As the concept of “cloud-native” gets hashed out on whiteboards in boardrooms around the business world, the reality taking shape is that a whole new generation of solutions is being born. These systems are augmented with high-powered analytics and artificial intelligence.
This new class of application is almost exclusively built on a microservices architecture with Kubernetes as the foundation. There is tremendous value to this approach, because scalability is built into its DNA. Taking advantage of this new approach requires a commitment to change.
Simply shipping your data and applications en toto to a cloud provider absolutely does not solve this challenge. In fact, it will likely result in a significant rise in total cost of ownership (TCO), thus undermining a major driver for moving to the cloud.
Another strategy involves porting specific data sets into the cloud to deploy the power of all that computation. This often involves making copies of the data. While change-data-capture can be used to keep these disparate environments in sync, there are downsides to this approach.
In the first place, CDC solutions always need to be meticulously managed. Small amounts of data drift can quickly become larger problems. This is especially problematic when the derived analytics are used for mission-critical business decisions, or customer-experience initiatives.
Secondly, by going down this road, organizations risk the proliferation of even more data silos–this time in the cloud. And while cloud storage is getting cheaper, the cost of egress can creep up and throw budgets sideways; this is not good in a post-COVID world.
Remember, the standard redundancy of Hadoop was to have three copies of every datum, which is good for disaster recovery but rather taxing overall, both in terms of throughput and complexity. While moving into the new world of cloud computing, we should avoid old errors.
Agile Defined
A different approach to bridging the worlds of on-prem data centers and the growing variety of cloud computing services is offered by a company called Alluxio. From their roots at the Berkeley Amp Labs, they’ve been focused on solving this problem.
Alluxio decided to bring the data to computing in a different way. Essentially, the technology provides an in-memory cache that nestles between cloud and on-prem environments. Think of it like a new spin on data virtualization, one that leverages an array of cloud-era advances.
According to Alex Ma, director of solutions engineering at Alluxio: “We provide three key innovations around data: locality, accessibility and elasticity. This combination allows you to run hybrid cloud solutions where your data still lives in your data lake.”
The key, he said, is that “you can burst to the cloud for scalable analytics and machine-learning workloads where the applications have seamless access to the data and can use it as if it were local–all without having to manually orchestrate the movement or copying of that data.”
In this sense, Alluxio’s approach bridges the best of both worlds: You can preserve your investments in on-prem data lakes while opening a channel to high-powered analytics in the cloud, all without the encumbrance of moving massive amounts of data here or there.
“Data locality means bringing the data to compute, whether via Spark, Presto, or Tensorflow,” Ma said. In this scenario, Alluxio is installed alongside the compute framework and deploys unused resources on those servers to provide caching tiers for the data.
Options, Options
There are various ways to get it done, depending upon the topology of the extant information architecture. In some environments, if Presto is using a lot of memory, Alluxio can allocate SSDs on the appropriate machines for optimized caching.
If you’re tying into HDFS, Presto can make the request, and Alluxio’s intelligent multi-tiering then uses whatever the most efficient approach might be–spanning memory, SSD or spinning disc. It all can be optimized as Alluxio monitors data access patterns over time.
Regardless of which tools an organization uses–Tensorflow, Presto, Spark, Hive–there will be different usage patterns across CPU, GPU, TPU and RAM. In the case of RAM and available disk types, Alluxio can work with whatever resources are available.
“Spark is less memory-intensive,” Ma said, “so we can allocate some memory. So you have choice to figure out what you want to allocate and where. Alluxio allows you to seamlessly access the data in the storage area, wherever it may be.”
There’s also the concept of a Unified Name Space. “What it allows you to do is have a storage configuration that’s centrally managed,” Ma said. “You’re not going into Spark and Presto to set it all up; you’re able to configure Alluxio once, and then Spark or Presto communicate to Alluxio.”
The general idea is to create a high-speed repository of data that allows analysts to get the speed and accuracy they demand without giving into the temptation of data silos. Think of it as a very large stepping stone to the new normal of multi-cloud enterprise computing.
“With Alluxio, we sit in the middle and offer interfaces on both sides; so we can talk to a variety of different storage layers,” Ma said. “We act as a bridging layer, so you can access any of these technologies.” In short, you can have your data cake and eat it too.
Like any quality abstraction layer, solving the data challenge in this manner enables companies to leverage their existing investments. Data centers will have a very long tail, and cloud services will continue to evolve and improve over time. Why not get the best of both worlds?
Eric Kavanagh is CEO of The Bloor Group, a new-media analyst firm focused on enterprise technology. A career journalist with more than two decades of experience in print, broadcast and Internet media, he also hosts DM Radio, and the ongoing Webcast series for the Global Association of Risk Professionals (GARP).