The Future of the Modern Data Stack
The modern data stack is cloud-native open data platforms and services. The modern data stack has made a giant leap, but it is still nascent.
So what is the future of the modern data stack?
There are seven meaningful and exciting areas for the future: holistic analytics, focus on value, multi-cloud virtualization, open data platform, open source strategy, speed is king, and rising of SQL.
Holistic Data Analytics (HDA)
Data analytics is not just BI. With the current data processing and machine learning capabilities, we can move from BI or traditional analytics to advanced holistic data analytics.
We can integrate business intelligence and intelligent analytics to enable holistic analytics, including descriptive, diagnostic, predictive, and prescriptive analytics.
Can we use past and current data to generate future information without explicit training or serving? ML analytics (predictive and prescriptive) will be the trend for data analytics and platforms.
Focus on Value with Engineering
How to unlock business value out of data is always the primary goal of data platforms. The journey from data to business value often involves multiple people or teams. Should we provide different tools for each team? The answer should be no.
There are three practical ways to accelerate data business value:
Close the data flow with an end-to-end loop to maximize business value. Although we were aware of this approach in the industry several years ago, there is no straightforward solution. It requires connecting all relevant points and flowing high-quality data over the nodes in the pipeline.
Integrate data stacks and unify data to simplify the process and operation. This can effectively reduce the overall cost and improve productivity.
Provide low code/no code tools to democratize. We moved the data stack and platform to the cloud and enabled it as Data as a Service. For most business users, this may not be enough. Low-code/no-code is the solution and can provide a painless out-of-the-box experience for them.
Multi-Cloud Virtualization
The cloud has significantly enhanced the scalability and reliability of the modern data stack. But as a matter of fact, there are and will be multiple cloud providers. Unfortunately, they do not have direct connections like shared domain layers or secure tunnels.
In addition, some data must be stored locally or regionally due to specific regulations. More and more enterprises are adopting a multi-cloud strategy. Check out the detail of the multi-cloud strategy here. With these large cloud providers, public and private clouds need to be federated efficiently and securely.
It will form a connected virtual cloud layer on top of multiple public and private clouds. It seems interesting to virtualize the virtualized clouds. Two inspiring frontlines may exist for dual virtualization:
Consensus meta platform for multiple clouds. In this approach, data can be shared and used without moving. The meta platform layer contains the consensus functions of governance, observability, and discoverability.
Multi-cloud resource virtualization with a unified orchestration. This will be more challenging than managing resources across VPCs, which has become feasible within a single cloud provider. For the data storage resource, it may be relatively simpler to distribute and retrieve via a few proxy interfaces. Furthermore, we can run queries on top of them.
Flexible but Cohesive Open Platform
The future of the data stack will be an open platform for easy integration, secure sharing, low latency, high reliability, and consistent governance.
For example, it should be simple and with no extra effort for the data flow from data source to storage via ETL, data transformation via in-place ELT, and feedback with results via reverse ETL.
It should dramatically enhance data quality and engineering productivity through observability and discoverability. And then, data lineage, semantics, statistics, metrics, and other knowledge information can become first-class citizens.
Also, it can extend from ML analytics to AI engineering and machine learning, such as graph knowledge. So it’ll be flexible but cohesive open platforms.
Open Source Strategy
Big data and data platforms should start from open source and move to the cloud later. As the nature of the open source and the limitation of the cloud, the future of the modern data stack should embrace both cloud and open source for business success and user engagement.
Now, startups with open source strategies are highly attractive to venture capital. e.g., TDengine from TAOS Data has grown exponentially through open source since its inception. There are many other successful open source stories in the community of the modern data stack, such as Databricks, Starburst, and Dremio.
Speed is King
Over the past few decades, we have significantly improved the performance of data processing and querying. In the modern data stack, speed is still king. It’s not just about user experience but decision speed and cost. We still remember that we preferred Spark over Hadoop because of its performance. It’s the same story for Snowflake over Redshift. With the large volume and complexity of data, a new breakthrough velocity will be the next milestone in the modern data stack. e.g., Firebolt has risen rapidly for its potentially higher speed.
Rising of SQL
SQL stemmed from data management and databases. Its elegant simplicity and widely-used standards make it the most common language in the modern data stack. More and more data services and platforms are starting to support SQL. For example, querying and analyzing streaming data using SQL is not new; a few startups adopt SQL to retrieve instant predictive analytics results. We can expect increasing SQL in data engineering and Python in AI engineering.
We can see many opportunities for the future of data platforms. The above seven areas come up based on the core perspectives of business value, data infrastructure, user experience, and team collaboration.
For more insights, please refer to the article: The Future of the Modern Data Stack.