Thursday, December 19, 2024

How Open Will Snowflake Go at Information Cloud Summit?

Snowflake is holding its Information Cloud Summit 24 convention subsequent week, and the corporate is predicted to make a slew of bulletins, which it is possible for you to to seek out on these Datanami pages. However among the many most intently watched questions is how far Snowflake will go in embracing the Apache Iceberg desk format and opening itself as much as outdoors question engines? And is it attainable that Snowflake might attempt to “out open” its rival Databricks, whose convention is the next week?

Snowflake has advanced significantly because it burst onto the scene a handful of years in the past as a cloud knowledge warehouse. Based by engineers skilled in creating analytics databases, the corporate delivered a top-flight knowledge warehouse within the cloud with full separation of compute and storage, which was really novel on the time. Firms pissed off with Hadoop flocked to Snowflake, the place they discovered a way more welcoming and pleasant person expertise.

However whereas Snowflake lacked the technical complexity of Hadoop, it additionally lacked the openness of Hadoop. That was a tradeoff that many shoppers had been prepared to make again in 2018, when frustration with Hadoop was nearing its peak. However clients maybe aren’t so prepared to make that tradeoff in 2024, significantly as the excessive value of cloud computing has develop into a difficulty with many CFOs.

Snowflake executives initially boasted in regards to the elevated income that its lock-in created, however quickly it discovered that clients had been genuinely involved about being locked in to a proprietary format. It took concrete actions to deal with the lock-in in February 2022, when it introduced assist for Apache Iceberg, though Snowflake has but to make Iceberg assist typically obtainable.

The battle for lakehouse market dominance is being waged atop the open desk codecs

Iceberg, in fact, is the open desk format developed at Netflix to deal with knowledge correctness considerations when accessing knowledge saved in Parquet information utilizing a number of question engines, together with Hive and Presto. The metadata managed by desk codecs like Iceberg present ACID transactionality to knowledge interactions, assuaging considerations that queries would return incorrect knowledge.

Iceberg wasn’t the primary desk format–that honor goes to Apache Hudi, which engineers developed at Uber to handle knowledge of their Hadoop stack. In the meantime, the parents at Databricks created their personal desk format known as Delta in 2017 and named the information platform it created the information lakehouse.

Two years into the Iceberg experiment, Snowflake has some selections to make. Whereas it permits clients to retailer knowledge in externally managed Iceberg tables, it doesn’t supply a lot of the information administration capabilities that you’d count on from a full-fledged knowledge lakehouse–i.e. issues like desk partitioning, knowledge compaction, and cleanup. Will the corporate announce these throughout Information Cloud Summit this week?

If Snowflake goes all-in on Iceberg, it will assist differentiate it from Databricks, which has put its chips on Delta (though it introduced some capabilities to assist Iceberg and Hudi with its Common Format unveiled final June). The group appears to be coalescing round Iceberg from a recognition viewpoint, which may bolster Snowflake’s place as a top-tier Iceberg-based knowledge lakehouse.

One other query is whether or not Snowflake will permit exterior SQL engines to question Iceberg knowledge that it manages. Snowflake’s proprietary SQL engine is extremely performant when run on knowledge saved within the authentic Snowflake desk format, and the corporate has benchmark outcomes to again that up. However Snowflake doesn’t present loads of choices in relation to querying knowledge with different engines.

Snowflake provides the Snowpark API, which helps you to categorical queries utilizing Python, Java, and Scala, however that is extra designed for knowledge engineering and constructing machine studying fashions than SQL question processing. It additionally provides an Apache Spark connector that permits you to learn from and write to Snowflake utilizing Spark 3.2 by way of 3.4 (it additionally provides an Apache Kafka connector). However what clients might actually need is the flexibility to run one other SQL question engine in opposition to their knowledge.

Snowflake returns to San Francisco for Information Cloud Summit 24

One particular person who will likely be intently watching Snowflake subsequent week is Justin Borgman, the CEO and founding father of Starburst, which does loads of work creating the Trino question engine that forked from Presto and runs its personal lakehouse providing that helps Iceberg and Trino. Borgman notes that most of the first workloads run by Netflix after creating Iceberg used Presto.

“We really feel prefer it form of resets the taking part in discipline,” Borgman says of the impression of Iceberg. “It’s nearly just like the battlefield strikes from this very conventional ‘get knowledge ingested into your proprietary database after which you’ve got your buyer locked in,’ to extra of a free for all the place the information is unlocked, which is undoubtedly finest for purchasers. After which we’ll battle it out on the question engine layer, the execution layer moderately than on the storage layer. And I feel that’s only a actually fascinating improvement.”

Borgman is understandably keen to have the ability to get into Snowflake’s huge 9,800-strong buyer base through Iceberg. He claims benchmark exams present Trino outperforming Snowflake’s question engine on Iceberg tables whereas being about one-third of the associated fee. Starburst has plenty of giant clients, similar to Lyft, LinkedIn, and Netflix, utilizing the mixture of Trino and Iceberg, he says.

Snowflake may go all-in on Iceberg and open itself as much as exterior question engines, nevertheless it may nonetheless train some management over buyer workloads by way of different means. As an example, it may require that clients entry knowledge by way of its knowledge catalog, Borgman predicts.

“They could attempt to lock you in on a couple of peripheral options,” he tells Datanami. “However I feel on the finish of the day, clients gained’t tolerate that. I feel Iceberg is undoubtedly of their finest pursuits, and I feel that they’re going to only begin shifting tons of information into Iceberg format from the place they’ve the chance to decide on a distinct question engine.”

Snowflake CEO Sridhar Ramaswamy should stability the corporate’s openness with progress

If Snowflake does go absolutely open and permits exterior question engines similar to Trino, Presto, Dremio, and even Spark SQL to entry knowledge that it manages for purchasers in Iceberg tables, Snowflake clients gained’t doubtless transfer all their knowledge to Iceberg directly, says Borgman, who was a 2023 Datanami Particular person to Watch. They’ll doubtless transfer their lowest SLA (service degree settlement) queries into Iceberg first, whereas preserving the extra necessary knowledge in Snowflake’s native format and use Snowflake’s native question engine, which is quicker however dearer.

That units up an fascinating dynamic the place Snowflake may doubtlessly be hurting its potential to generate revenues whereas giving clients what they need, which is extra openness. However on the flip aspect, clients may very well reward Snowflake by shifting extra knowledge into Iceberg and letting Snowflake handle it for them. That might generate larger revenues for the Bozeman, Montana firm, though in all probability not on the similar per-customer charge that in the event that they saved all the information locked right into a proprietary format. That’s a rate-of-growth issue that new Snowflake CEO Sridhar Ramaswamy must account for.

When Ramaswamy changed Frank Slootman in February, it was anticipated that the corporate would shift some focus to AI, the place it was seen as trailing its rival Databricks. The corporate’s April launch of its Arctic giant language fashions (LLMs) exhibits the corporate is ready to transfer shortly on that entrance, nevertheless it’s core aggressive benefit over Databricks stays with SQL-based analytics and knowledge warehousing workloads.

The altering nature of information warehousing presents each challenges and alternatives for Snowflake. “Principally, the information warehouse now could be absolutely decoupled,” Borgman says. “Snowflake talked rather a lot about that 12 to fifteen years in the past, every time they first got here out, of separation of storage and compute. However the important thing was it was at all times their storage and their compute.”

No matter Snowflake chooses to do subsequent week, the cloud giants will undoubtedly be watching. To this point, they haven’t actually picked sides within the open desk format struggle that’s being waged between Iceberg and Delta, with Hudi in a distant third (though the Apache XTable format developed by Hudi-backer Onehouse threatens to make all of it moot). If the market solidifies behind Iceberg, AWS, Microsoft Azure, and Google Cloud may attempt to minimize out the intermediary by providing their very own soup-to-nuts knowledge lakehouse providing.

Borgman says this seems like a replay of the mid 2010s, when Teradata’s giant knowledge warehousing put in base was slowly eaten into by Hadoop, however with one huge distinction.

“I feel you’re going to see an identical kind of mannequin play out the place clients are motivated to attempt to scale back their knowledge warehouse prices and utilizing extra of this lake mannequin,” says Borgman, who was the CEO of Hadoop software program vendor Hadapt when it was acquired by Teradata in 2014. “However I feel one factor that’s totally different this time round is that the engines themselves, like Trino and [Presto], have improved dramatically over what Hive or Impala was again then.”

Associated Gadgets:

Snowflake Seems to be to AI to Bolster Development

Onehouse Breaks Information Catalog Lock-In with Extra Openness

Teradata Acquires Revelytix, Hadapt

 

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles