Around and around and around it goes … where it stops, nobody knows! At least not until now.
“It,” of course, is data, and its mechanism of transport has long been the tried-and-relatively-true practice of extract, transform, load, aka ETL. That’s now finally changing.
Granted, there have been other ways of moving data: Change data capture (CDC), one of the leanest methods, has been around for decades and remains a very viable option; the old File Transfer Protocol (FTP) can’t be overlooked; nor can the seriously old-fashioned forklifting of DVDs.
Go here to see eWEEK’s listing of Top ETL Tool Vendors.
Data virtualization 1.0 brought a novel approach as well. This approach leveraged a fairly sophisticated system of strategic caching. High-value queries would be preprocessed, and certain VIP users would benefit from a combination of pre-aggregation and stored result sets.
During the rise of the open-source Hadoop movement about a decade ago, some other curious innovations took place, notably the Apache Sqoop project. Sqoop is a command-line interface application for transferring data between relational databases and Hadoop. Sqoop proved very effective at pulling data from relational sources and dropping it into HDFS. That paradigm has somewhat faded, however.
But a whole new class of technologies–scalable, dynamic, increasingly driven by artificial intelligence–now threatens the status quo. So significant is this change that we can reasonably anoint a new term in the lexicon of information management: data orchestration.
There are several reasons why this term makes sense. First and foremost, as an orchestra comprises many different instruments–all woven together harmoniously. Today’s data world suddenly boasts many new sources, each with its own frequency, rhythm and nature.
Go here to see eWEEK’s Resource Page on Intent-Based Networking.
Secondly, the concept of orchestration implies much more than integration, because the former connotes significantly more complexity and richness. That maps nicely to the data industry these days: The shape, size, speed and use of data all vary tremendously.
Thirdly, the category of data orchestration speaks volumes about the growing importance of information strategy, arguably among the most critical success factors for business today. It’s no longer enough to merely integrate it, transport it or change it; data must be leveraged strategically.
Down the Batch!
For the mainstay of data movement over the past 30 years, ETL took the lead. Initially, custom code was the way to go, but as Rick Sherman of Athena IT Solutions once noted: “Hand coding works well at first, but once the workloads grow in size, that’s when the problems begin.”
As the information age matured, a handful of vendors addressed this market in a meaningful way, including Informatica in 1993, Ab Initio (a company that openly eschews industry analysts) in 1995, then Informix spin-off Ascential (later bought by IBM) in 2000.Those were the heydays of data warehousing, the primary driver for ETL.
Companies realized they could not effectively query their enterprise resource planning (ERP) systems to gauge business trajectory, so the data warehouse was created to enable enterprise-wide analysis.
The more people got access to the warehouse, the more they wanted. This resulted in batch windows stacking up to the ceiling. Batch windows are the time slots within which data engineers (formerly called ETL developers) had to squeeze in specific data ingestions.
Within a short span of years, data warehousing became so popular that a host of boutique ETL vendors cropped up. Then, around the early- to mid-2000s, the data warehouse appliance wave hit the market, with Teradata, Netezza, DATAllegro, Dataupia and others climbing on board.
This was a boon to the ETL business but also to the Data Virtualization 1.0 movement, primarily occupied by Composite Software (bought by Cisco, then spun out, then picked up by TIBCO) and Denodo Technologies. Both remain going concerns in the data world.
Big Data Boom
Then came big data. Vastly larger, much more unwieldy and in many cases faster than traditional data, this new resource upset the apple cart in disruptive ways. As mega-vendors such as Facebook, LinkedIn and others rolled their own software, the tech world changed dramatically. The proliferation of database technologies, fueled by open-source initiatives, widened the landscape and diversified the topography of data. These included Facebook’s Cassandra, 10gen’s MongoDB and MariaDB (spun out by MySQL founder Monty Widenius the day Oracle bought Sun Microsystems)–all of which are now pervasive solutions.
Let’s not forget about the MarTech 7,000. In 2011, it was the MarTech 150. By 2015, it was the MarTech 2,000. It’s now 7,000 companies offering some sort of sales or marketing automation software. All those tools have their own data models and their own APIs. Egad!
Add to the mix the whole world of streaming data. By open-sourcing Kafka to the Apache Foundation, LinkedIn let loose the gushing waters of data streams. These high-speed freeways of data largely circumvent traditional data management tooling, which can’t stand the pressure.
Doing the math, we see a vastly different scenario for today’s data, as compared to only a few years ago. Companies have gone from relying on five to 10 source systems for an enterprise data warehouse to now embracing dozens or more systems across various analytical platforms.
Meanwhile, the appetite for insights is greater than ever, as is the desire to dynamically link analytical systems with operational ones. The end result is a tremendous amount of energy focused on the need for … (wait for it!) … meaningful data orchestration.
For performance, governance, quality and a vast array of business needs, data orchestration is taking shape right now out of sheer necessity. The old highways for data have become too clogged and cannot support the necessary traffic. A whole new system is required.
To wit, there are several software companies focused intently on solving this big problem. Here are just a few of the innovative firms that are shaping the data orchestration space. (If you’d like to be included in this list, send an email to info@insideanalysis.com.)
Ascend
Going all the way upstream to solve the data provisioning challenge, Ascend uses AI to dynamically generate data pipelines based upon user behavior. In doing so, the company cut code by an estimated 80%, and is able to very quickly provision not only the pipelines themselves, but also the infrastructure necessary to run them. Ascend also enables the querying of data pipelines themselves, thus providing valuable visibility into the movement and transformation of information assets. This is a very clever approach for tackling the preponderance of data provisioning for an organization.
Dell Boomi
An early player in the cloud integration business, Boomi was kept as a strategic component of Dell’s software portfolio while other assets were sold off, arguably to help finance the 2016 acquisition of EMC. Boomi uses a comprehensive approach to hybrid cloud integration that includes a MasterData Hub, API management, AI-augmented workflow management and app development and B2B/EDI management. As such, it has become a premier vendor in the large-scale cloud integration space, especially for business-to-business connections.
GeminiData
Call it Data Virtualization 2.0. GeminiData touts a “Zero Copy” approach that grabs data as needed, as opposed to the ETL-oriented data warehouse model. The company’s founders have a history with Splunk and leveraged their experience with the systems monitoring giant by enabling the rapid integration of log data. They also have ties to Zoomdata, which created the “micro-query” concept. By allowing users to mix and match both structured and unstructured data, in a real-time sandbox that can connect to a whole range of different sources and targets, GeminiData has opened the door to a whole new architectural approach for data discovery.
HVR
Hailing from the original real-time database offering, Goldengate (later acquired by Oracle), the founders of HVR focused on the multisource, multitarget reality of today’s data world. Using log-based, change-data-capture for data replication, the company can solve all manner of data-provisioning challenges. Whether for data lakes or legacy operational systems, HVR gives its users the ability to replicate large data sets very quickly. The company prides itself on the durability and security of its platform, which is designed to serve as a marshaling area for a wide range of enterprise data.
Streamsets
Another innovator in this fast-developing space, Streamsets touts what it calls the industry’s first data operations platform. Recognizing the multifaceted, multivariate nature of modern data, the company built a control panel for data movement. Akin to a sophisticated water management system for a large, high-end spa, this platform builds upon several key design principles: low -code; minimalist schema design; both streaming and batch as first-class citizens; dev-ops capabilities for data; awareness of data drift; decouple data movement and infrastructure; enable inspection of data flows; and data “sanitization” upon ingest.
Eric Kavanagh is CEO of The Bloor Group, a new-media analyst firm focused on enterprise technology. A career journalist with more than two decades of experience in print, broadcast and Internet media, he also hosts DM Radio for Information Management Magazine, and the ongoing Webcast series for the Global Association of Risk Professionals (GARP).