While real-time data streaming with gigabytes per second throughput is the direction all our workloads are headed, at LiveRamp we currently still manage a large number of batch processing systems built on Apache Hadoop and Apache Spark™. That’s because the AdTech ecosystem, in which a large portion of our business operates, has traditionally been oriented around encrypted batch files, where systems read one record at a time.
However, the industry is changing rapidly. Our customers increasingly need real-time solutions for time-sensitive problems like cart abandonment and ad suppression. They also need to prepare for industry headwinds such as Google Chrome’s phasing out of third-party cookies in 2024, which will drive ever more demand for more optimized identifiers.
So, we embarked on a journey to modernize our data infrastructure from batch systems to streaming data systems. We knew it was going to be a gradual process, and we started with a newer use case, our pixel traffic application — a system that helps customers understand users’ web and mobile traffic trends. We also recognized that we would need to select the streaming data platform that would drive our real-time transformation. In this post, I’ll walk you through what we did and how it’s working out for us so far.
Data collaboration platform LiveRamp helps companies build enduring brand and business value with technology solutions for customer intelligence, identity enhancement, cross-screen measurement, media networks, and more.
I first joined LiveRamp 15 years ago as an intern. Today, I am responsible for LiveRamp’s data platform architecture, with particular emphasis on ingestion and activation — in other words, all the data coming into and out of LiveRamp. This includes customer data that is processed, masked, and made available for hundreds of downstream systems.
Part of the value LiveRamp adds is connecting consumer data with durable and privacy-conscious identifiers. There is a lot of complexity in this. You might have offline attributes like names, addresses, and phone numbers, and these can change over time. The same is true for online identifiers like IP addresses and device IDs.
LiveRamp replaces these changing and inconsistent attribute sets with durable, secure and pseudonymized identifiers that are sustainable against evolving regulations and privacy policies.
This meant that the right streaming data platform would not only drive our real-time transformation but also allow us to retain data sovereignty and the privacy-first architecture we pride ourselves on.
Going from batch to streaming with Redpanda
LiveRamp developed its own batch systems, but in our migration to real-time data pipelines, we wanted to enable faster development and more efficient collaboration with partners. This led us to seek a partner with a robust streaming data ecosystem around its APIs.
Redpanda fulfilled this need with a Kafka API-compatible platform offering everything needed to stream data — brokers, HTTP proxy, Schema Registry, Raft consensus, and cluster balancing. And, because Redpanda consumes about 1/3rd of the compute resources as other vendors due to its lean design, it’s much more cost-efficient.
In addition to simplicity and cost-efficiency, Redpanda brought other benefits to the table including:
Platform neutrality: Everyone in our industry is trying to minimize data movement, so we need to be able to deploy solutions in all the major clouds to be where our customers are. Redpanda offers platform neutrality and deployment flexibility—we can run it self-managed in VMs or containers on our cloud of choice, or we can use it as a fully managed cloud service with Redpanda Cloud.
Data privacy: Keeping data within our network boundaries is a key requirement, but we also wanted a fully managed experience. Redpanda’s Bring Your Own Cloud (BYOC) model helps us do that. With Redpanda BYOC, LiveRamp’s clusters remain in our own Virtual Private Cloud (VPC), while Redpanda manages the provisioning, monitoring, and maintenance via their secure agent. Data never leaves our VPC.
Performance: Our industry is extremely latency-sensitive, so we wanted a solution with performance at its core. Redpanda is built in C++ using the Seastar framework, with a thread-per-core implementation that ekes optimal performance out of modern hardware. It minimizes thread switching, bypasses the Linux page cache, maximizes parallel processing, and uses direct memory access to make asynchronous disk more IO efficient. Our benchmarking tested Redpanda with two billion messages loaded in 50 minutes at the rate of 750,000 unique messages per second on a six-node, n1-standard-16 cluster. We were able to achieve a cluster-size read throughput of 2.9 GBps and average write throughput of 1.2 GBps, well over our requirement of 386 MBps. We were blown away by the results.
Partnership: With our steep data privacy and performance requirements, we needed a provider who was also a close partner. The Redpanda team was willing to work closely with us, and as a result, a lot of our feedback and workarounds made it into the product itself. For example, we built a Terraform wrapper layer to automate the deployment of Redpanda in our environment, and our work was later incorporated by the Redpanda team into its own BYOC deployment architecture.
How it’s going: less complexity, lower costs, happier engineers
We started with a system that collects pixel traffic for mobile applications. When that implementation proved successful, we adopted it for our application monitoring tooling and saw improved reliability. We are now expanding our use of Redpanda across the organization to gradually support all incoming and outgoing data across the more than five hundred different platform integrations we manage.
Redpanda currently sits on the edges of LiveRamp. The producers read from files and put messages on Redpanda, and consumers read the data and make outbound calls. Consumers can be simple Go or Java processes that can massage data for specific platforms like Facebook. This way, we store data from the edges and use it where it makes sense. We also push data from Redpanda to our internal warehouse built on SingleStore, where it becomes available for analytical and measurement use cases.
This event-driven architecture eliminates the complexity of our legacy batch system, which requires a complex architecture for parallel processing. As a result, we’ve boosted engineer productivity and enhanced the maintainability of our simplified codebase. We’ve also significantly lowered our infrastructure costs and reduced our carbon footprint as a result of Redpanda’s hardware-efficient design.
What’s next: a world with Wasm, simpler streaming, and no cookies
If there’s a constant in our industry, it’s change. It’s been widely reported that Chrome is phasing out third-party cookies in 2024. LiveRamp has been preparing for this future for more than five years by helping publishers, marketers, and the ecosystem at large transition to addressable audiences without relying on third-party cookies or mobile identifiers. Redpanda will remain a key partner for safely syncing data as LiveRamp advances its offline reference graph, which has thousands of data sources and over a billion syncs a day.
As streaming data continues to become the heart of our infrastructure, we’re looking to further simplify operations using WebAssembly (Wasm). With Wasm transformations directly in Redpanda, we’ll be able to read data, prepare messages, and make API calls without the “data ping pong.”
In the meantime, we’re excited to continue on our batch-to-streaming journey and help pioneer the real-time evolution of the AdTech industry.
- Abhishek Jain, Enterprise Platform Architect at LiveRamp
Abhishek Jain is an accomplished Enterprise Platform Architect at LiveRamp with a proven track record of designing and implementing robust, scalable, and innovative software solutions. With more than 15 years of experience, he combines deep technical expertise with a keen understanding of business needs and has played a pivotal role in the growth of LiveRamp from a startup to a profitable, global enterprise.
Let's keep in touch
Subscribe and never miss another blog post, announcement, or community event. We hate spam and will never sell your contact information.