More than ever, we are seeing companies use data to make business decisions in real-time. This ubiquitous access makes it imperative for organizations to move beyond legacy architectures that can’t handle their workloads.
Ronald van Loon is an HPE partner and spoke with Matt Maccaux recently. Matt is the global field CTO of the Ezmeral Enterprise Software BU at Hewlett-Packard Enterprise, who provided meaningful insights on the challenges of moving to a cloud-native analytics environment as well as potential steps that companies can take to make this transition along with some key technology trends.
“It’s not trivial, it is not a simple process because these data-intensive applications don’t tend to work in those cloud-native environments,” Matt says about companies moving their advanced analytics infrastructure to the cloud. This increased need for instant access to data, the high velocity of new information, and low tolerance for latency has forced companies of all sizes to reevaluate how they build their IT infrastructure.
The Challenges of Supporting Real-Time Analytics
Data volumes have increased exponentially, with more than 90% of the data in the world today having been created in the past two years alone. In 2020, 64.2 zettabytes of data was generated or replicated, and this growth is attributed to the amount of people learning, training, interacting, working, and entertaining themselves from their homes. Most companies do not store all of their raw data indefinitely – so how can they analyze it to deliver business insights? Analyzing high velocity, big data streams using traditional data warehousing and analytics tools has proven to be challenging.
To analyze data at the speed of business, companies need real-time analytics solutions that can ingest large volumes of data in motion as it is constantly generated by devices, sensors, applications and machines. In addition to processing data in real-time (also known as “streaming”), the solution must be able to capture and store data when it is not in motion for analytics on “batch” data.
This presents a significant challenge because most existing data warehousing and business intelligence tools were designed primarily for analysis of historical, stored data, and are typically not optimized for low-latency access to streaming data.
Transitioning to a Cloud-Native Environment
The reason it’s particularly challenging for companies to shift from an on-premises environment to a cloud-native environment is scale. The vast majority of companies have invested heavily in on-premises hardware, software and skills over the years, but they must now overhaul their IT infrastructure to deal with workloads that simply could not be handled when those investments were made.
In addition, although today’s data volumes are massive, they will be dwarfed by the data created when the Internet of Things (IoT), 5G and other major technology shifts take hold.
Making Big Changes with Small Steps
As a result, it makes sense to start building an architecture that will support your workloads—whether or not they are currently being processed in the cloud—rather than start from scratch. This is where small steps come into play: start with a data warehouse in the cloud, and then add real-time analytics capabilities on top of it.
Many companies are already making this transition, but they are moving at an agonizingly slow pace because of the massive challenge such a change presents.
Separating Compute and Storage
Separating compute and storage in a cloud-environment can result in a cloud-native data analytics platform that can perform real-time and near real-time analysis on both streaming and stored data while also enabling different teams to have access to their own raw data at any time. The compute, storage, security and networking functions of the on-premises environment are encapsulated by an elastic container running in the cloud, while an intelligent gateway with built-in algorithms ingests each dataset into the cloud and exposes it to users for analysis.
The combination of a modern data warehouse architecture (either in the cloud or on-premises) and real-time analytics enables low-latency access to your data from nearly any device or location. It also allows you to start analyzing your data in near real-time and store it for future analysis, be it batch or offline analytics.
Cloud native compute containers
Containers are a key part of cloud-native architectures because they enable the rapid deployment of applications without requiring installation, configuration and ongoing maintenance of an operating system.
Deploying containers in production
Once a data analytics workload has been migrated to the cloud, you can start deploying containers for that workload. The container should be tied to your data and placed in such a way that the compute resources are elastic (meaning additional resources can be added or removed) and easily configurable.
In addition, running the compute resources in private containers so that they are protected from other workloads is recommended and you can manage them as independent services.
Managing containers
If you deploy your analytics workloads inside containers, you need to manage them. It is possible to use the same container management tools that are used for managing traditional applications to manage cloud-native assets, but it requires a different way of thinking about how they are deployed and managed.
A major advantage of using containers is that they are run in isolation, but this advantage is only fully realized if you ensure that the containers are managed with granular resource and service-level policies. This requires tighter integration between container management tools and cloud orchestration tools to enable dynamic scaling of compute resources for each workload based on demand.
The ability to reallocate resources from one workload to another as needed is particularly important in a multi-tenant environment, since you will want to avoid collocation of workloads and resource constraints.
Key Technology Trends In Modernizing Data Analytics Environments
To handle data-intensive workloads, companies are turning to open-source runtimes of Kubernetes as well as open-source runtimes of Apache Spark. They are also increasingly using container platforms, such as Docker and Kubernetes, to remove the friction of packaging applications for deployment. With recent advances in hybrid cloud, object storage, elastic compute and serverless architectures, customers are now taking advantage of these state-of-the-art technologies to modernize their data analytics environments.
- Deploying cloud native data warehouses
Accelerating the design, build and deployment of a data warehouse have been made possible by new tools built to move a company’s on-premises data warehouse to the cloud. Furthermore, companies are taking advantage of these same state-of-the-art technologies to modernize their data analytics environments.
- Data analytics on an open platform
For the first time, modernized data analytics architectures can be easily extended and managed in a cloud-native environment. This means that organizations no longer need to choose between legacy proprietary hardware and software or building their own in-house infrastructure. Providers are also taking advantage of these technologies to deploy big data solutions that are cloud native in nature. This means they can be deployed on-premises, or as a service using public clouds for the highest security and reliability.
- Hybrid cloud and multi-cloud infrastructure
With the rise of hybrid cloud, companies are deploying both on premises and in public clouds. For example, some workloads can be deployed to a private cloud for higher security requirements or performance-sensitive workloads that require a customized environment with more processing power. Cloud-native technologies like Kubernetes, Docker and Apache Spark can help move these workloads to the cloud.
Creating a Future-Proof Advanced Analytics Environment
A modern data analytics environment leverages an elastic container running in a private cloud to encapsulate compute, storage, networking and security functions of a data warehouse architecture. This results in more agile development and testing cycles as well as faster time-to-production when compared with traditional approaches.