Main menu

Pages

Cloud data leader Apache Iceberg

 

Cloud data leader Apache Iceberg

Cloud data leader Apache Iceberg

The advent of the cloud has opened the door to new analytics use cases that leverage data lakes, data meshes, and other modern architectures as data teams can ingest vast amounts of data and store it at a reasonable cost. However, when the size of data is very large, general data storage also has difficulties and limitations in terms of accessing, managing, and using data.

A typical blob storage system in the cloud doesn't have the information needed to show relationships between files or table correspondences, making the query engine's job much more difficult. Also, the file itself makes it difficult to change the schema of a table or “time travel” across tables. Each query engine must have its own view of how to query a file. This is the point where the data architecture, which seemed easy to implement at first glance, becomes more difficult than expected at some point.

In this case, it is very useful to apply a tabular format to the data. A table format explicitly defines the table, its metadata, and the files that make up the table. Rather than applying the schema when reading the data, the client already knows the schema before the query is executed. Table metadata can also be stored in a more granular way. Therefore, applying a tabular format to your data has several advantages:

Faster performance through improved filtering or partitioning

Easy schema evolution

Ability to “time shift” across tables to view data at a specific point in time

Table ACID Compliance

Iceberg's strengths

Selecting the data format to use is an important decision, as features are enabled or limited depending on the data format. It is worth paying attention to Apache Iceberg, an open data table format. It has built a strong support base over the past 20 years. Apache Iceberg was first developed by Netflix, open-sourced as an Apache Incubator project in 2018, and graduated from the incubator program in 2020.

Iceberg was created from scratch to address several issues, including scale, usability, and performance issues encountered when dealing with very large data sets in Apache Hive. As a Netflix engineer at the time said, tabular forms of very large data sets should behave reliably and predictably like SQL 'without unexpected problems'. There are many options, but Iceberg is superior to other open table formats for five reasons:

A clean break with the past: The past can have a huge impact on how the current tabular form works. Among the tabular forms is from the old technique. In some cases, it has evolved, while others have broken the link with the past and started anew. Iceberg belongs to the latter. It was built from scratch to address the shortcomings of Apache Hive, avoiding some of the undesirable features that have plagued data lakes in the past. A good example is how to handle schema changes, such as renaming columns.

This also means that Iceberg doesn't have to think about a reasonable way to further disconnect from its tools without causing problems in production data applications. Over time, other tabular formats will likely catch up, but for now, Iceberg is looking back and focusing on providing new features rather than fixing old problems.

Agnostic processing engine and file format : Iceberg decouples processing engine from tabular format, providing greater flexibility and choice. Engineers can choose the best tool for the task at hand without being forced to be tied to a single processing engine. This option is important for two reasons. First, the engines that businesses use to process data can change over time. For example, many companies have switched from Hadoop to Spark or Trino. Second, it is common to use multiple technologies in large enterprises, but if you have a choice, you can switch between different tools.

Iceberg also supports several file formats, including Apache Parquet, Apache Avro, and Apache ORC. There's some immediate flexibility, but it's also better in terms of long-term connectivity possibilities for new file formats that may come out in the future.

It is a well-maintained open source project.: The Iceberg project is managed by the Apache Software Foundation, so it adheres to several important Apache Ways, including earned authority and consensus decision-making. While some self-proclaimed "open source" projects aren't, Apache Iceberg makes project management public so you can see who's running the project. It differs from other tabular formats in that it does not disclose who has decision-making authority. Tabular format is the default choice for data architecture, so choosing a project that is truly open and collaborative can significantly lower the risk of unintended dependencies.

Iceberg's collaboration is a source of new ideas and help : There are several signs that the collaborative community around Apache Iceberg is benefiting users and increasing the project's long-term success potential. For users, both Slack channels and GitHub repositories are highly engaged in supporting new ideas and existing features. It is important to note that participation is industry-wide, not just one group or the original creators of Iceberg.

This high level of collaboration is also beneficial to the technology itself. Projects are increasingly populated with multiple proposals that address different use cases with different mindsets. New projects and ideas such as Project Nessie, Puffin Spec and Open Metadata API are also derived from this project.

Includes features that are paid for in other tabular formats: Unlike some other table projects, Iceberg has built-in performance-focused features from the start, which benefits users in several ways. First, users often assume that projects using public code will include features for performance, but in reality they are not, or only vague promises that they will be included in the future. Second, if you want to move your workloads (which should be easy if you use a tabular format), you are much less likely to face a big difference in your Iceberg implementation. Third, once you start using the open-source Iceberg, you seldom experience the necessary features hidden behind a wall of paid payments. The distinction between what is public and what is not is not even a matter of time.

As a public project from the start, Iceberg was created to solve practical problems, not business use cases. It's a small but important difference. Companies that offer support for Iceberg while selling paid products such as Snowflake, AWS, Apple, Cloudera, and Google Cloud are competitive in terms of implementing the Iceberg specification well, but the Iceberg project itself is independent of the business of any particular company. was made

Snowflakes and Icebergs

Snowflake initially created its own tabular form that realized all the new features. However, companies moving to cloud data platforms have different needs and different timelines. Some companies have regulatory requirements that restrict where data is stored, while others have existing investments that need to be protected.

Supporting external tabular formats such as Iceberg allows businesses to leverage all of their data within Snowflake, even if some data needs to reside elsewhere. So Snowflake added Iceberg support as an additional table option within Snowflake earlier this year, and more recently introduced a new Snowflake table type called Iceberg Table.

Getting Started with Apache Iceberg

There are many good resources within the Apache Iceberg community to help you learn more about this project and get involved in open source activities.

The Iceberg Getting Started Guide provides examples to get you started with the pure open source Iceberg and Apache Spark.

You can participate and be active in Iceberg's many strong communities , such as the public Slack channel .

To make changes to Iceberg or to suggest a new idea, follow the Contribution Guide to create a pull request. The community regularly aggregates these requests.

Snowflake users can start supporting Snowflake Iceberg Private Preview now. Just contact the Snowflake customer team to learn about the features or sign up.

Iceberg Tables : You can try out Snowflake's new table type, which is entirely based on Iceberg and Parquet on external storage, but has similar performance and benefits to Snowflake Tables.

External table for iceberg : Snowflake external table allows easy connection from snowflake to existing iceberg table.






Comments