How Yelp modernized its information infrastructure with a streaming lakehouse on AWS

November 17, 2025

7

It is a visitor submit by Umesh Dangat, Senior Principal Engineer for Distributed Companies and Programs at Yelp, and Toby Cole, Precept Engineer for Knowledge Processing at Yelp, in partnership with AWS.

Yelp processes huge quantities of person information every day—over 300 million enterprise evaluations, 100,000 photograph uploads, and numerous check-ins. Sustaining sub-minute information freshness with this quantity offered a big problem for our Knowledge Processing group. Our homegrown information pipeline, in-built 2015 utilizing then-modern streaming applied sciences, scaled successfully for a few years. As our enterprise and information wants developed, we started to come across new challenges in managing observability and governance throughout an more and more complicated information ecosystem, prompting the necessity for a extra trendy method. This affected our outage incidents, making it tougher to each assess impression and restore service. On the similar time, our streaming framework struggled with Kafka for information streaming and everlasting information storage. As well as, our connectors to analytical information shops skilled latencies exceeding 18 hours.

This got here to a head when our efforts to adjust to Normal Knowledge Safety Regulation (GDPR) necessities revealed gaps in our infrastructure that will require us to wash up our information, whereas concurrently sustaining operational reliability and lowering information processing instances. One thing needed to change.

On this submit, we share how we modernized our information infrastructure by embracing a streaming lakehouse structure, reaching real-time processing capabilities at a fraction of the price whereas lowering operational complexity. With this modernization effort, we diminished analytics information latencies from 18 hours to mere minutes, whereas additionally eradicating the necessity for utilizing Kafka as a everlasting storage for our change log streams.

The issue: Why we would have liked change

We began this transformation by initiating a migration from self-managed Apache Kafka to Amazon Managed Streaming for Apache Kafka (Amazon MSK), which considerably diminished our operational overhead and enhanced safety. Amazon MSK’s specific brokers additionally supplied higher elasticity for our Apache Kafka clusters. Whereas these enhancements had been a promising begin, we acknowledged the necessity for a extra basic architectural change

Legacy structure ache factors

Let’s look at the particular challenges and limitations of our earlier structure that prompted us to hunt a contemporary answer.

The next diagram depicts Yelp’s authentic information structure.

Kafka matters proliferated throughout our infrastructure, creating lengthy processing chains. Because of this, every hop added latency, operational overhead, and storage prices. The system’s reliance on Kafka for each ingestion and storage created a basic bottleneck—Kafka’s structure, optimized for high-throughput messaging, wasn’t designed for long-term storage and to deal with complicated querying patterns.

One other problem was our customized “Yelp CDC” format—a proprietary change information seize language—was highly effective and tailor-made to our wants. Nevertheless, as our group grew and our use instances expanded, it launched complexity and a steeper studying curve for brand new engineers. It additionally made integrations with off-the-shelf techniques extra complicated and upkeep intensive.

The fee and latency trade-off

The standard trade-off between real-time processing and value effectivity had us caught in an costly bind. Actual-time streaming techniques demand important assets to take care of state inside compute engines like Apache Flink, hold a number of copies of knowledge throughout Kafka clusters, and run always-on processing jobs. Our infrastructure prices had been rising, and it was largely pushed by:

Lengthy Kafka chains: Knowledge typically traversed 4-5 Kafka matters earlier than reaching its vacation spot and every matter was replicated for reliability
Duplicate information storage: The identical information existed in a number of codecs throughout totally different techniques—uncooked in Kafka, processed in intermediate matters, and ultimate types in information warehouses and Flink RocksDB for join-like use instances
Advanced customized tooling upkeep: The proprietary nature of our instruments meant engineering assets had been centered on upkeep quite than constructing new capabilities

In the meantime, our enterprise necessities grew to become extra demanding. Groups at Yelp wanted sooner insights, near-real-time outcomes, and the flexibility to rapidly run complicated historic analyses immediately. This pushed us to form our new structure to enhance streaming discovery and metadata visibility, present extra versatile transformation tooling, and simplify operational workflows with sooner restoration instances.

Understanding the streamhouse idea

To grasp how we solved our information infrastructure challenges, it’s necessary to first grasp the idea of a streamhouse and the way it differs from conventional architectures.

Evolution of knowledge structure

To grasp why a streaming lakehouse or streamhouse was the reply to our challenges, it’s useful to hint the evolution of knowledge architectures. The journey from information warehouses to trendy streaming techniques reveals why every technology solved sure issues whereas creating new ones.

Knowledge warehouses like Amazon Redshift and Snowflake introduced construction and reliability to analytics, however their batch-oriented nature meant accepting hours or days of latency. Knowledge lakes emerged to deal with the amount and number of large information, utilizing low-cost object storage like Amazon S3, however typically grew to become “information swamps” with out correct governance. The lakehouse structure, pioneered by applied sciences like Apache Iceberg and Delta Lake, promised to mix one of the best of each, the construction of warehouses with the pliability and economics of lakes.

However even lakehouses had been designed with batch processing in thoughts. Whereas they added streaming capabilities, these had been typically bolted on quite than basic to the structure. What we would have liked was one thing totally different: a reimagining that handled streaming as a first-class citizen whereas sustaining lakehouse economics.

What makes a streamhouse totally different

A streamhouse, as we outline it, is “a stream processing framework with a storage layer that leverages a desk format, making intermediate streaming information instantly queryable.” This seemingly easy definition represents a basic shift in how we take into consideration information processing.

Conventional streaming techniques preserve dynamic tables like materialized views in databases, however these aren’t instantly queryable. You may solely eat them as streams, limiting their utility for ad-hoc evaluation or debugging. Lakehouses, conversely, excel at queries however wrestle with low-latency updates and complicated streaming operations like out-of-order occasion dealing with or partial updates.

The streamhouse bridges this hole by:

Treating batch as a particular case of streaming, quite than a separate paradigm
Making information, together with intermediate processing outcomes, queryable through SQL
Offering streaming-native options like database change-data seize (CDC) and temporal joins
Leveraging cost-effective object storage whereas sustaining minute-level latencies

Core capabilities we would have liked

Our necessities for a streaming lakehouse had been formed by years of working at scale:

Actual-time processing with minute-level latency: Whereas sub-second latency wasn’t essential for many use instances, our earlier hours-long delays weren’t acceptable. The candy spot was processing latencies measured in minutes quick sufficient for real-time decision-making however relaxed sufficient to leverage cost-effective storage.

Environment friendly CDC dealing with: With quite a few MySQL databases powering our purposes, the flexibility to effectively seize and course of database modifications was essential. The answer wanted to deal with each preliminary snapshots and ongoing modifications seamlessly, with out handbook intervention or downtime.

Price-effective scaling: The structure needed to break the linear relationship between information quantity and value. This meant leveraging tiered storage, with scorching information on quick storage and chilly information on low-cost object storage, all whereas sustaining question efficiency.

Constructed-in information administration: Schema evolution, information lineage, time journey queries, and information quality control wanted to be first-class options, not afterthoughts. Our expertise sustaining our customized Schematizer taught us that these capabilities had been important for working at scale.

The answer structure

Our modernized information infrastructure combines a number of key applied sciences right into a cohesive streamhouse structure that addresses our core necessities whereas sustaining operational effectivity.

Our expertise stack choice

We fastidiously chosen and built-in a number of confirmed applied sciences to construct our streamhouse answer.The next diagram depicts Yelp’s new information structure.

After in depth analysis, we assembled a contemporary streaming lakehouse stack, streamhouse, constructed on confirmed open supply applied sciences:

Amazon MSK continues to ship current streams as they did earlier than from supply purposes and providers.

Apache Flink on Amazon EKS served as our compute engine, a pure alternative given our current experience and funding in Flink-based processing. Its highly effective stream processing capabilities, exactly-once semantics, and mature framework made it perfect for the computational layer.

Apache Paimon emerged as the important thing innovation, offering the streaming lakehouse storage layer. Born from the Flink neighborhood’s FLIP-188 proposal for built-in dynamic desk storage, Paimon was designed from the bottom up for streaming workloads. Its LSM-tree-based structure supplied the high-speed ingestion capabilities we would have liked.

Amazon S3 serves as our streamhouse storage layer, providing extremely scalable capability at a fraction of the price. The shift from compute-coupled storage (Kafka brokers) to object storage represented a basic architectural change that unlocked huge value financial savings.

Flink CDC connectors changed our customized CDC implementations, offering battle-tested integrations with databases like MySQL. These connectors dealt with the complexity of preliminary snapshots, incremental updates, and schema modifications mechanically.

Architectural transformation

The transformation from our legacy structure to the streamhouse mannequin concerned three key architectural shifts:

1. Decoupling ingestion from storage

In our outdated world, Kafka dealt with each information ingestion and storage, creating an costly coupling. Each byte ingested needed to be saved on Kafka brokers with replication for reliability. Our new structure separated these issues: Flink CDC dealt with ingestion by instantly writing to Paimon tables backed by S3. This separation diminished our storage prices by over 80% and improved reliability by way of the 11 nines of sturdiness of S3.

2. Unified information format

The migration from our proprietary CDC format to the industry-standard Debezium format was greater than a technical change. It mirrored a broader transfer towards community-supported requirements. We constructed a Knowledge Format Converter that bridged the hole, permitting legacy streams to proceed functioning whereas new streams leveraged normal codecs. This method facilitated backward compatibility whereas paving the way in which for future simplification.

3. Streamhouse tables

Maybe essentially the most radical change was changing a few of our Kafka matters with Paimon tables. These weren’t simply storage areas—they had been dynamic, versioned, queryable entities that supported:

Time journey queries within the desk’s snapshot retention interval
Computerized schema evolution with out downtime
SQL-based entry for each streaming and batch workloads
Constructed-in compaction and optimization

Key design choices

A number of key design choices formed our implementation:

SQL as the first interface: Fairly than requiring builders to write down Java or Scala code for each transformation, SQL grew to become our lingua franca. This democratized entry to streaming information, permitting analysts and information scientists to work with real-time information utilizing acquainted instruments.

Separation of compute and storage: By decoupling these layers, we may scale them independently. A spike in processing wants now not meant provisioning extra storage, and historic information may very well be stored indefinitely with out impacting compute prices.

Embracing open supply requirements: The shift from home-grown codecs and instruments to community-supported tasks diminished our upkeep burden and accelerated characteristic growth. When points arose, our engineers may leverage neighborhood data quite than debugging in isolation.

Implementation journey

Our transition to the brand new streamhouse structure adopted a fastidiously deliberate path, encompassing prototype growth, phased migration, and systematic validation of every part.

Migration technique

Our migration to the streamhouse structure required cautious planning and execution. The technique needed to stability the necessity for transformation with the fact of sustaining important manufacturing techniques.

1. Prototype growth

Our journey started with constructing foundational elements:

Pure Java consumer library: Eradicating Scala dependencies had been essential for broader adoption. Our new library eliminated reliance on Yelp-specific configurations, permitting it to run in lots of environments.
Knowledge Format Converter: This bridge part translated between our proprietary CDC format and the usual Debezium format, ensuring current shoppers may proceed working through the migration.
Paimon ingestor: A Flink job that would ingest information from Kafka sources into Paimon tables, dealing with schema evolution mechanically.

2. Phased rollout method

Fairly than making an attempt a “large bang” migration, we adopted a per-use case method—shifting a vertical slice of knowledge quite than the whole system without delay. Our phased rollout adopted these steps:

Choose a consultant, real-world use case that gives broad protection of the present characteristic set.
- In our use case, this included information sourced from each databases and occasion streams, with writes going to Cassandra and Nrtsearch
Re-implement the use case on the brand new stack in a growth setting utilizing pattern information to check the logic
Shadow-launch the brand new stack in manufacturing to check it at scale
- This was a important step for us, as we needed to iterate by way of varied configuration tweaks earlier than the system may reliably maintain our manufacturing site visitors.
Confirm the brand new manufacturing deployment in opposition to the legacy system’s output
Change reside site visitors to the brand new system solely after each the Yelp Platform group and information homeowners are assured in its efficiency and reliability
Decommission the legacy system for that use case as soon as the migration is full

This phased method allowed our group to construct confidence, determine points early, and refine our processes earlier than touching business-critical techniques in manufacturing.

Technical challenges we overcame

The migration surfaced a number of technical challenges that required modern options:

System integration: We developed complete monitoring to trace end-to-end latencies and constructed automated alerting to detect any degradation in efficiency.

Efficiency tuning: Preliminary write efficiency to Paimon tables was suboptimal for our higher-throughput streams. After cautious evaluation, we recognized that Paimon was re-reading manifest information from S3 on each commit. To alleviate this, we enabled Paimon’s sink author coordinator cache setting, which is disabled by default. This massively diminished the variety of S3 calls throughout commits. We additionally discovered that writing parallelism in Paimon is restricted by the variety of “buckets” inside a partition. Deciding on the correct variety of buckets to assist you to scale horizontally, but additionally not unfold your information too thinly is necessary for balancing write efficiency in opposition to question efficiency.

Knowledge validation: Validating information consistency between our legacy Yelp CDC streams and the brand new Debezium-based format offered notable challenges. Through the parallel run section, we carried out complete validation frameworks to ensure the Knowledge Format Convertor precisely remodeled messages, whereas sustaining information integrity, ordering ensures, and schema compatibility throughout each techniques.

Knowledge migration complexity: For consistency, we developed customized tooling to confirm ordering ensures and carried out parallel operating of outdated and new techniques. We selected Spark because the framework to implement our validations as each information supply and sink in our framework has mature connectors, and Spark is a well-supported system at Yelp.

Sensible wins we achieved

Our implementation delivered transformative outcomes:

Simplified streaming stack: By changing a number of customized elements with standardized instruments, we averted years of technical debt in a single migration. We diminished our complexity and thereby simplified our whole streaming structure, resulting in increased reliability and fewer upkeep overhead. Our Schematizer, encryption layer, and customized CDC format had been all changed by built-in options from Paimon and normal Kafka, together with IAM controls throughout S3 and MSK.

Superb-grained entry administration: Shifting our analytical use instances learn through Iceberg unlocked an enormous win for us: the flexibility to allow AWS Lake Formation on our information lake. Beforehand, our entry administration relied on giant, complicated S3 bucket coverage paperwork that had been approaching their dimension limits. By shifting to Lake Formation we may construct an entry request lifecycle into our in-house Entry Hub to automate entry granting and revocation.

Constructed-in information administration options: Capabilities that will have required months of customized growth got here out-of-the-box, akin to computerized schema evolution, time journey queries, and incremental snapshots for environment friendly processing.

Potential for diminished operational prices: We anticipate that transitioning from Kafka storage to S3 in a streamhouse structure will considerably scale back storage prices. Avoiding lengthy Kafka chains may even simplify information pipelines and scale back compute prices.

Enhanced troubleshooting capabilities: The streamhouse structure guarantees built-in observability options that can make debugging simpler. Fairly than having to manually look by way of occasion streams for problematic information, which will be time-consuming and complicated for multi-stream pipelines, engineers can now question reside information instantly from tables utilizing normal SQL.

Classes discovered and greatest practices

All through this transformation, we gained priceless insights about each technical implementation and organizational change administration that may profit others endeavor comparable modernization efforts.

Technical insights

Our journey revealed a number of essential technical classes:

Battle-tested open supply wins: Selecting Apache Paimon and Flink CDC over customized options proved smart. The neighborhood assist, steady enhancements, and shared data base accelerated our growth and diminished danger.

SQL interfaces democratize entry: Making streaming information accessible through SQL remodeled who may work with real-time information. Engineers and analysts conversant in SQL can now perceive how streaming pipelines work. The barrier to entry has been considerably lowered as engineers now not want to know Flink-specific APIs to create a streaming utility.

Separation of storage and compute is key: This architectural precept unlocked value financial savings and operational flexibility that wouldn’t have been doable in any other case. Our groups can now optimize storage and compute independently primarily based on their particular wants.

Organizational learnings

The human facet of the transformation was equally necessary:

Phased migration reduces danger: Our gradual method allowed groups to construct confidence and experience, whereas sustaining enterprise continuity. Every profitable section created momentum for the subsequent. Constructing belief with newer techniques helps achieve velocity in later levels of migrations.

Backward compatibility allows progress: By sustaining compatibility layers, our groups may migrate at their very own tempo with out forcing synchronized modifications throughout the group.

Funding in studying pays dividends: Giving our groups house to be taught new applied sciences like Paimon and streaming SQL had some alternative value, however they paid off by way of elevated productiveness and diminished operational burden.

Conclusion

Our transformation to a streaming lakehouse structure (streamhouse) has revolutionized Yelp’s information infrastructure, delivering spectacular outcomes throughout a number of dimensions. By implementing Apache Paimon with AWS providers like Amazon S3 and Amazon MSK, we diminished our analytics information latencies from 18 hours to only minutes whereas reducing storage prices by 80%. The migration additionally simplified our structure by changing a number of customized elements with standardized instruments, considerably lowering upkeep overhead and enhancing reliability.

Key achievements embody the profitable implementation of real-time processing capabilities, streamlined CDC dealing with, and enhanced information administration options like computerized schema evolution and time journey queries. The shift to SQL-based interfaces has democratized entry to streaming information, whereas the separation of compute and storage has given us unprecedented flexibility in useful resource optimization. These enhancements have remodeled not simply our expertise stack, but additionally how our groups work with information.

For organizations dealing with comparable challenges with information processing latency, operational prices, and infrastructure complexity, we encourage you to discover the streamhouse method. Begin by evaluating your present structure in opposition to trendy streaming options, significantly these leveraging cloud providers and open-source applied sciences like Apache Paimon. Be sure to leverage safety greatest practices when implementing your answer. You could find AWS safety greatest practices right here. Go to the Apache Paimon web site or AWS documentation to be taught extra about implementing these options in your setting.

In regards to the authors

Previous articleADU 1339: Can I carry a drone on a cruise with out it getting confiscated?

Next articleX Launches ‘Chat’ Encrypted Direct Messaging Service

How Yelp modernized its information infrastructure with a streaming lakehouse on AWS