The unprecedented level of data workloads has led to a shift in the architecture of hybrid data. For this reason, Cloudera has decided to integrate the Iceberg format within its Data Platform and explains why.
Over the past decade, the success of customers deploying large-scale data platforms has acted as a multiplier, prompting demand to bring in even more data, apply more sophisticated analytics, and hire many new data professionals, from analysts business to data scientists. This unprecedented level of data workloads is not without challenges. The data architecture layer is one such area where growing datasets have pushed the boundaries of scalability and performance. For this, Cloudera has decided to integrate the Iceberg format within its Cloudera Data Platform. Let’s analyze the various elements, paying attention to openness and adherence to market standards.
Apache Iceberg: Create a data lakehouse anywhere
Apache Iceberg was born at Netflix to solve the problems associated with large tables at the petabyte scale. Netflix then donated it to the open-source community in 2018 as a project within Apache Incubator. Cloudera was instrumental in expanding the industry-standard Apache Iceberg, a high-performance format for massive analytic tables. Those familiar with traditional Structured Query Language (SQL) will recognize the format of the Iceberg tables immediately. This allows multiple applications – Hive, Impala, Spark, Trino, Flink and Presto – to work on the same data simultaneously. It also tracks the state of evolution of the dataset and other changes over time.
Open approach to hybrid data
Iceberg is a core element of the Cloudera Data Platform (CDP), enabling users to build an open data lakehouse architecture to deliver analytics multifunctional on large datasets both streamed and archived. All in a cloud-native object store that works both on-premises and across multiple clouds. By optimizing various CDP data services, including Cloudera Data Warehousing (CDW), Cloudera Data Engineering (CDE), and Cloudera Machine Learning (CML), users can define and manipulate datasets with SQL commands. Users can also build complex data pipelines using features such as time travel and deploy machine learning (ML) models built on data contained in Iceberg tables.
The benefits for productivity
By contributing to the open-source community, Cloudera has extended support to Hive and Impala. So he built an all-in-one analytics data architecture that could handle large-scale data engineering, BI, fast querying, and ML workloads. Cloudera has integrated Iceberg into CDP’s Shared Data Experience (SDX) layer to accelerate the productivity and performance benefits of the open table format. Additionally, Iceberg’s native integration benefits from various enterprise-grade features of SDX, such as data lineage, audit, and security features. Cloudera ensures that organizations can build an open lakehouse anywhere, on any public cloud or on-premises. Furthermore, the open approach ensures the freedom to choose the preferred analysis tool without any lock-in.
Apache Ranger: Policy management for the entire hybrid environment
Apache Ranger is a software framework that enables, monitors and manages comprehensive data security in the CDP platform. It is the tool for creating and managing access policies to data and services of the CDP stack. Security administrators can define security policies at the database, table, column and file level and administer permissions for specific groups or individuals. Ranger handles the entire process authentication of users and access rights to data resources. For example, a particular user might be allowed to create a policy and view reports, but not to edit users and groups.
Apache Atlas: Metadata Management and Governance
Apache Atlas is a metadata management and governance system used to help find, organize, and manage data assets. In essence, it works like a traffic cop within a data architecture. By creating metadata representations of objects and operations within the data lake, Atlas helps users understand why models produce specific results, all the way back to the starting data source. Using the collected metadata, Atlas creates relationships between data assets.
Open approach to hybrid data
When Atlas receives query information, it annotates the query input and output and generates a path map that tracks data usage and transformation over time. This visualization of data transformations enables governance teams to quickly identify a data source and understand the impact of data and schema changes.
Apache Ozone: Open source answer for high-density on-premises storage
Separating compute and data resources in the cloud offers many benefits to a CDP implementation. It features more options for allocating compute and storage resources, and allows you to shut down server clusters to avoid unnecessary compute expense while leaving data available to other applications. Additionally, resource-intensive workloads can be isolated on dedicated, separate compute clusters for different workloads.
To make these benefits consistent everywhere, even on premises, CDP Private Cloud, the on-premises version of CDP, uses Apache Ozone to separate storage from compute. Apache Ozone is a high-performance, scalable, and distributed on-premises object store that supports the same interaction model as AWS S3, Microsoft Azure Data Lake Storage (ADLS), or Google Cloud Storage (GCS).
Cloudera has always focused on the industrialization of open source data management and analytical innovation. It is a strategy in which we strongly believe, since companies give the right weight to the economic aspect. AND hardly they reward closed or proprietary platforms or those built by a single vendor without an extensive ecosystem. Cloudera is one of 20 vendors in the crowded cloud database management systems (CDMS) market.
Choosing a vendor to meet your specific needs can seem like a daunting task. But opting for a provider that has an open approach is crucial. Regardless of the type of business implementation, data comes from many different sources and must work with the source and target systems in a much more open way. Any software used must be built with this in mind.
Cloudera: open approach to hybrid data
In this context, Cloudera intends exploit multiple open-source systems to provide hybrid multi-cloud solutions and the greatest choice to customers. This in order to allow them to always be one step ahead in terms of innovation and interoperability. Large enterprises with significant amounts of data see Cloudera as the right company to manage end-to-end data on premises or in the public cloud, or even collect data from a SaaS application. Cloudera is doing a great job of bringing it all together under one hat, with a single solution for data management at scale.