Scaling the predictive analytics pipeline at Seventh Sense with Redpanda

How Seventh Sense evolved its event capture and storage system to a streaming data system with Redpanda.

Erik LaBianca

March 25, 2022

Copy link

CopIED!

Introduction

When our business grew to processing hundreds of millions of records at Seventh Sense, we looked at switching our event capture and storage systems to a streaming-first model. In this post, I’ll discuss our business needs, the transformation of our architecture, and how we came to choose Redpanda over other options.

What is Seventh Sense?

Seventh Sense is a predictive analytics platform for email marketers that makes it easy to segment and optimize marketing audiences using artificial intelligence. Seventh Sense integrates with popular CRM and marketing automation systems such as HubSpot using their native APIs, gathering customer engagement data, and analyzing it with AI and machine learning to generate data-backed email marketing recommendations and insights.

Our customers range from Fortune 50 companies to small mom-and-pop businesses. Customers find that using our system enables them to improve their email engagement rates, open rates, deliverability, and get a higher value from the effort they put into email content.

Overview of our event-driven architecture

The Seventh Sense platform is built using Scala and Typescript. Customer engagement events are captured via API integrations and delivered to Redpanda as the system of record. From Redpanda, Clickhouse’s built-in Kafka client consumes the streams and makes them available for analytics queries. Simultaneously, Apache Spark also consumes the engagement data and houses our AI and machine learning algorithms.

In the application itself, we use Redpanda for event streaming via a library called ZIO Kafka. We use Postgres as our control plane storage, and DynamoDB and ElasticSearch for higher-volume query tasks. We’re hoping to push many of these lookups to Redpanda in the future as topic indexing capabilities evolve.

As we continue to evolve our product, Seventh Sense is moving to a model where there are very few pieces of information that get stored in a SQL database. We’re focusing more on large quantities of data that aren’t transactional so we want to store that data in systems that are designed to handle large loads, using a streaming API instead of writing SQL queries.

In the process of doing that, we're using that stream as the system of record. Rather than adding on audit and logging later, we build log first, and then we use the databases as a materialized view. This model of using the log as the primary data storage and the databases as a materialized view has proven to be effective, decoupling the problem of scaling and capturing data from querying and delivering data to the end user.

The challenge: scaling to handle hundreds of millions of record updates

Seventh Sense started out as a bootstrapped startup focusing purely on sales use cases, involving only a single mailbox’s worth of data. As the company pivoted into the marketing automation space, we went from processing hundreds of thousands of records, to millions and hundreds of millions of records.

After considering several streaming data platforms and databases available at the time, we decided to instead write our own custom engine that stored ordered files on disk and managed them. We also had some locking in a database and that got us through a couple of years. However, at scale, managing a storage system, distributed systems, locking, and all of the problems of writing a real storage engine became a challenge. As we looked for ways to handle this high data volume, we realized that a streaming data solution was needed.

Why we chose Redpanda for our streaming data platform

When Redpanda came on the market, we liked that it worked well on a Docker container, was lightweight, had a fully capable local development platform, and performed well in the cloud.

One capability that first attracted us to Redpanda was its full compatibility with the standard Kafka APIs. This allowed us to seamlessly integrate it with our applications that already supported Kafka, and ensured that high-quality client libraries were available.

The operational simplicity of Redpanda was another aspect we liked. Many of the applications that we build use more than just the Kafka API. We liked that Redpanda had a schema registry, built-in HTTP API, and a single binary deployment. It was simple to put Redpanda on a development machine, put it into a production cluster, spin up multiple nodes, etc.

Redpanda’s performance capabilities were another aspect that caught our attention. We did a benchmark connecting Redpanda to our Databricks Spark cluster, and the difference in performance from Redpanda versus Kafka was orders of magnitude. We were able to completely saturate a hundred-to-one nodes on the Spark side versus the size of the Redpanda cluster.

Benefits of Redpanda to our business

There are several notable returns on investment we’ve seen from Redpanda. First, it improved our business by being an effective developer tool. We can send projects to our development team and they're able to use Redpanda right on their laptops. Our devs are able to write integration tests and efficiently learn how to work with streaming data, which is significant because event streaming is a new technology for our team.

Second, since we’re using Redpanda in the cloud, we don’t have to keep cluster quorum or manage Kubernetes or persistent volumes, which saves us a lot of time and operational headaches.

Third, in terms of performance, Redpanda is fast per dollar. This opens up new use cases for us — cases that simply weren’t feasible for us before. For example, using Redpanda as a buffer, any information that's either coming from an API or going into a system for querying goes through Redpanda. Redpanda provides a high-speed write buffer that ensures our API clients are operating at maximum performance, while allowing us to spread out the write load on our query systems such that they can be sized for cost-effective performance. We use this pattern to manage our query system performance, load, and cost, and it wouldn’t be cost effective without Redpanda.

Conclusion

For developers and engineers looking to build real-time systems, I want to advise you to think bigger than what you currently are. The first round of your streaming system will likely be focused on collecting data as it arrives, subscribing to the head of the log, writing more to the head of the log, etc.

However, if you think about the log abstraction instead as an immutable log similar to a database redo log where you have the history of all of the data that has gone through this particular system, it unlocks powerful use cases and saves you from burning down your expensive SQL server or other types of index stores. You can focus exclusively on keeping the current state of the system in memory, rather than maintaining audit logs and change data capture (CDC) logs. Thinking in this way, you open up development capabilities and performance that you wouldn't have otherwise.

To learn more about Redpanda, you can check out the platform’s documentation here or download the binary here. I’d also encourage you to join the Redpanda Slack community and join others who are building some really cool things with the streaming-first data stack! If you have questions or are interested in learning more about Seventh Sense, please connect with me here.

‍

No items found.

Join the Redpanda Community on Slack

Chat with our team, ask industry experts, and meet fellow data streaming enthusiasts.