Goldsky is a real-time data platform for the Web3 ecosystem. Our platform provides blockchain indexing, subgraphs, and data streaming pipelines that help developers build powerful dApps (decentralized applications). The trouble is that most app developers struggle with the complex technologies that power these streaming data pipelines.
Goldsky strives to democratize developer access to streaming data by providing simple API-driven tools that bring real-time data into applications. Of course, a simplified streaming data engine is fundamental to this mission, which is where Redpanda comes in.
In this post, I’ll give you a peek under the hood of Goldsky’s “streaming-first” architecture, and reveal why we chose Redpanda as the foundation for our data infrastructure—helping us make streaming data accessible for app developers.
Goldsky: a real-time data platform for blockchain
Goldsky’s real-time data platform supports more than 10 of the most popular blockchains as sources, as well as ad-hoc user-defined data sources; and we’re used by some of the biggest protocols and networks to power analytics and APIs.
For those unfamiliar with the Web3 world: our customers essentially want to get specific data out of different blockchains, which is not easy to do. Typically, you have to go through various remote procedure call (RPC) providers and APIs, which is a hard, painful, and slow process, essentially analogous to old-school API crawling.
Goldsky ingests data directly from blockchains, with embedded tools our customers use to build real-time data pipelines that power their apps. We push data directly into our customers’ data stores, allowing them to own the data and configure it for their use cases.
As an example, one of our customers is a popular betting website that tracks real-time betting trends. The bets and auctions it tracks are happening on the blockchain layer to ensure transparency and fairness. We ingest the data from the blockchain, decode it to a useful form, transform it in real time, and sync it directly to the customer’s database. This is then used for its API layer to power things like betting order books and statuses.
Let’s take a closer look at the streaming data architecture that powers Goldsky’s platform.
Goldsky’s streaming-first architecture
There has been a paradigm shift in the industry recently where people are realizing that you can use data platform technology that was previously used by internal analytics teams to power customer-facing features. The sort of data pipelines that previously only supported reporting and dashboards are now supporting web application functionality.
The trouble is that most app developers aren’t deeply familiar with the underlying technology stacks for data pipelines, which typically include complex technologies such as Apache Kafka® and Apache Flink®. And this is where platforms like Goldsky come in, to help democratize developer access to streaming data by providing simple API-driven tools that bring real-time data into applications. These “data-pipelines-as-a-service” allow app developers who don’t know the first thing about managing Kafka to have streaming data at their fingertips.
Streaming data is, of course, fundamental to our stack. In fact, we have a “streaming-first” architecture powered by Redpanda (a simpler, faster, and more durable Kafka-compatible platform) as our primary streaming data solution, with Flink used for stream processing. We use Redpanda’s managed service, Redpanda Cloud, to ensure minimal overhead.
Let’s take a walk through the platform architecture, from sources to sinks.
As mentioned, our sources are blockchains, and we ingest data from them in a few different ways. One way is called direct indexing, which is based on a popular Ethereum ETL (extract, transform, load) project. It’s able to connect directly to blockchain nodes and extract raw, low-level data structures like logs and transactions. Second, we support another popular open-source technology called Subgraphs, which makes it easy to write simple TypeScript applications to capture and process smart contract event telemetry.
Raw blockchain data is hard to interpret: you have to deserialize the payloads, and because the logs have data from all smart contracts, you have to write business logic to filter out what is specific to your actual contracts. So, part of the value we add is processing the data into a format that is directly usable by our customers.
Internally, all data sets in the end are available as topics backed by Avro schemas.
The middle of our stack is where things get interesting. We use Flink SQL for our transformation layer. Our customers can write simple SQL transformations to perform filtering or projections; they also can implement complex joins or aggregations. And, the magic is that the customer doesn’t need to know what’s under the hood at all, they just use our APIs or our GUI!
In Goldsky’s architecture, Redpanda is not just a broker, it’s where we store the blockchain data. It’s our source of truth — the primary database and data lake holding tens of terabytes of data. This is enabled through Redpanda’s S3-compatible Tiered Storage, which provides near-infinite data retention without incurring a ton of cloud infrastructure costs.
We like this approach better than the traditional alternative, which is to set retention limits in the broker and then sink the data to S3 to be read later and combined with real-time topics. Redpanda's Tiered Storage simplifies things and eliminates the usual need for additional resources to maintain archival systems.
This is an easy win for us: if our streaming data solution can also be our storage solution, why not?
We have built-in integrations to support different databases and data stores, wherever customers want to sink their data. The customer can self-service set their sink of choice, like PostgreSQL, S3, or Elasticsearch. We see PostgreSQL frequently, and it’s an extremely good sink because it can easily support both transactional and some analytical use cases. It’s easy to integrate with an existing application. In addition, modern managed PostgreSQL providers make it easy to scale databases for large data sets.
As discussed, Redpanda is foundational for our data infrastructure. Here’s how we landed on it.
When our engineering team was first building Goldsky, they knew they needed a modern streaming data platform that had minimal complexity and was cost-effective, but that could still handle high throughput reliably at scale. Some team members had used Confluent Kafka previously, but the pricing model wasn’t going to work for us because you end up paying for hardware you don’t need.
Redpanda stood out as an alternative to Kafka or Confluent because it achieves the same throughput with less hardware and has a much simpler architecture. It’s a single binary with no JVM dependency, a low memory footprint, and comes with schema registry and HTTP proxy already built-in. Plus, the hardware efficiency and Redpanda’s Tiered Storage capability add up to 3-4x cloud infrastructure savings versus Kafka.
What really made Redpanda an essential part of our stack is its durability. Using Redpanda as our source of truth is only possible because of the data safety it provides through its Jepsen-tested and Raft-native architecture, and its cost-efficient Tiered Storage.
We’re also optimistic about Redpanda’s roadmap, which is very focused on developers, including Data Transforms powered by WebAssembly (Wasm) and Apache Iceberg integration. These innovations will complement our architecture well.
For example, in the future, we will be able to simply point a Spark or Flink job to Redpanda’s Tiered Storage with Iceberg. And, with Wasm data transformations, we will be able to filter relevant blockchain data much more quickly by processing on nodes closer to the storage layer.
Goldsky and the future of Web3
When we started working on Goldsky, we thought that data streaming and stream processing could be useful building blocks. We haven’t realized how much we’ll actually rely on them.
Data streaming concepts work extremely well for blockchain data—we’re able to solve challenging problems, such as blockchain reorgs, by modeling them as well-known stream-processing problems like retractions. Building on this further lets us support even more advanced use cases, like enriching on-chain data with off-chain data, reliably calculating Top-N aggregations, and combining data from multiple blockchains together.
- Yaroslav Tkachenko, Principal Software Engineer, Goldsky
For more stories about how Redpanda has helped companies tap into simple, powerful, and cost-efficient data streaming, check out our customers page and browse the Redpanda Blog for examples and tutorials. You can also try Redpanda to see it in action for yourself! If you have questions, ask away in the Redpanda Community on Slack.
Let's keep in touch
Subscribe and never miss another blog post, announcement, or community event. We hate spam and will never sell your contact information.