In 2013, Judith Hurwitz and other market experts proclaimed the beginning of the Big Data Era. They perceived that “big data enables organizations to store, manage, and manipulate vast amounts of data at the right speed and at the right time to gain the right insights.”
They were candid that Big Data doesn’t represent a single technology and instead, was a heterogeneous set of data management technologies with their roots in several previous technology transformations.
The question now is: Where is Big Data today? And what is needed to mature its application?
To be fair, recent analyst surveys have found that big data has not yet led to big business outcomes. Despite all the hype, most corporate employees still do not have easy access to the information to get their jobs done. The problem continues to center around getting the right information to the right people at the right time as the number of information sources, uses, and users grows.
Data Warehouses vs. Data Lakes vs. Data Fabric
To house all this data, storage and management systems have sprung up, like the data warehouse, data lake, and data fabric, “organizations will need some form of all three of these,” says former CIO Tim McBreen. “But a Data Fabric will be required as an umbrella for all data integration, management, and governance across the enterprise at the solution and platform levels. Cohesion across enterprises is a must.”
“It is often not feasible to centralize data,” adds CIO Carrie Schumaker. “Or, the analysis is prototyped using services to access disparate data sources, and then if it proves fruitful and business needs dictate it. The centralization is done later.”
Hurwitz Analyst Dan Kirsch sees a connection between the data decentralization trend and data fabric. “We’ve seen a data fabric approach growing in popularity because it’s not realistic to have one central repository where all of your data can be up to date, governed, and clean,” he shares. “For this reason, data fabrics need to allow for heterogeneous data locations. I think a data fabric approach helps with the challenge of shared responsibility — each team is responsible for their own data and then connects it versus dumping data into a data lake. AWS may say a Data Lake is the only path for analytics success. And of course, they want organizations to dump all their data into the AWS cloud.”
Former VP for Data and Analytics at Gartner, Nick Heudecker, agrees and argues that all of these trends are important. “Each concept serves different users and use cases,” he points out. “Data warehouses for high performance, repeatable analytics. Data Lakes for question development/experimentation. Data mesh for consumption of distributed data with governance oversight.” So there is no confusion, Gartner considers data fabrics and data meshes to be equivalent concepts.
Centralizing Your Big Data Strategy Around One Platform
The experts leverage dual strategies but stick to a single platform. Former CIO McBreen says that he likes to have “two strategies. One strategy is for productions, and one is for analytics. Each has their own core hub platform and support for multiple data repositories. Then there is an ETL platform (real, near, batch) between the 2 core hubs.”
But which vendor provides the bulk of these services? “I haven’t seen any yet that I thought were good enough on their own to be the complete platform,” McBreen laments.
Shumaker concurs when she jokes, “does multiple data repositories often include a few spreadsheets?” For this reason, CIO Deb Gildersleeve says, “in a lot of ways it’s less about centralizing data and more about integrating it. How can you get all your data integrated so you can visualize it and connect it to your other systems (whether that be on premises or cloud)?”
“Centralizing all your data creates cost, governance and security headaches,” Kirsch shares. “Data is locked into line-of-business applications, on premises and within cloud ecosystems. Connecting to data where it resides helps to eliminate risk and increase speed to insights.”
“I don’t think this is a single vendor solution story,” Heudecker agrees. “Some provide query capabilities, but the governance story hasn’t been fleshed out by anyone yet. The ‘big’ in big data makes moving things around a challenge. Multiple platforms is the norm. If you’re lucky, you can normalize around tooling and skills.”
A data fabric, therefore, is a data management concept for attaining flexible, reusable and augmented data integration pipelines, services and semantics, in support of various operational and analytics use cases delivered across multiple deployments and orchestration platforms.
Ensuring Adherence to Data Governance and Data Privacy Rules
To govern data effectively, businesses must have a clear grasp of what data they have.Organizations need to “understand what types of data is in their data lake or data fabric,” says Kirsch. “If PII is involved in a specific app or new endeavor, businesses need to assign an executive to oversee the appropriate use of personal data. The executive can also help address the question of what’s possible with data versus what’s appropriate.”
Stewards play a vital governance role. So it comes as no surprise that McBreen says it is important to define “stewards whose whole job is to access and manage corrections to information at its initial source. They rotate out of business teams and KPI’s are in place. We review monthly and adjust as needed.”
”It’s important to define stewards up front and know how to check in with them along the way,” Gildersleeve states.” Getting stewards’ feedback on UX design is also important. Shumaker adds that she likes to have “data stewards sign off on the high-level design. Depending on the data type there is mandatory training on access and compliance to get access to any data set, and for more specialized data sets there may be additional training.”
Impact of Cloud on Big Data Strategy?
“Cloud is becoming another form of compute and storage rather than a separate environment,” Kirsch insists. “Cloud Management and visibility is important. Assuming the cloud is a quick way to blow a budget. In many cases there’s no reason to move some apps to the cloud. Being able to do proofs of concepts and experimentation instantly on the cloud is huge. Grabbing GPUs for example on the cloud versus purchasing physical infrastructure.
Gildersleeve agrees, saying “cloud allows organizations to try new things as well as add and remove compute power as needed without having to wait for physical work to be done.”
Where Are Data Processes Maturing?
Processes require a foundation of clearly defined terms. For Gildersleeve, “starting in the transactional systems is critical. If the data starts out wrong, a lot of time is spent scrubbing and enhancing that data. Shumaker agrees and says that “it’s not sexy but organizations need to agree upon data definitions that are shared and maintained.”
For this reason, Kirsch suggests that it is time to “change data processes by adopting processes like DataOps. These will become important for data-driven organizations. It won’t be overnight. Businesses are still struggling with DevOps. Data Literacy is critical to delivering success as well. Business school students shouldn’t get their MBA without some understanding of data.”
Heudecker doesn’t disagree when he says, “most maturity is needed in areas that facilitate sharing context around data, so things like data literacy. DataOps can help with resiliency, but it’s still an overwhelmingly technical practice.”
Parting Words
Clearly, Big Data lies in what analysts call the “Trough of Disillusionment.” While data-driven companies will be long term winners, there is work to do.
Winners need to put in the data governance needed to make data sufficient to task and protected. They also need to improve their data processes. Together DataOps and Data Governance can help. To do this, data winners will create what Jeanne Ross and Martin Mocker call “Operational and Digital Backbones.”