From batch to real time: The evolution of streaming data

Take a stroll through streaming data and real-time infrastructure developments

Storage and processing of operational and analytics data has gone from batch to real-time streaming. In this talk, we look at how far streaming data has come, what we’re working on, and what lies ahead.

Featuring:
Alex Callego, founder and CEO, Redpanda
Bart Farrell, host of Real-time with Redpanda

What we cover in this talk

Rap and intro (bear with us)
What's an event?
Compute and storage
What are batches?
What makes streaming a “super set of batch”?
Organizing a timeline by buckets
ML and streaming data
Static and dynamic models
Why is the log a popular abstraction?
Durability and replayability
Data loss
Evolution of streaming data
Streaming data stakeholders
Redpanda and the developer experience
The future of streaming data

See the full transcript here

Bart Farrell (0:00)
Yo, what time is it? Oh, my grind is live events making sense batches don't pay the rent because I'm with Alex moves fast like Morales talent weighs heavy like bail is tough like callous imbalance. I didn't real time streaming data.

So how do you know with steaming open source high quality? You know, what's the best standard, we're going live time to thrive because it's real time with Redpanda. Welcome to the first episode of Real Time with Redpanda.

Now it is my pleasure. As I mentioned Alex Gallego founder and CEO of vectorized, also creator of Redpanda. Alex, if you could turn on your your screen right now so we can see you and bring you on formally into this very first inaugural episode of Real Time with Redpanda. Are you there, Alex?

Alexander Gallego (0:33)
Yeah. Hey, Bart, thanks for coming. I always love your intros. They're amazing and super energetic. So glad to be here.

Bart (0:46)
My pleasure. For some of those who might not know you or Redpanda just give us your background and how you got into streaming.

Alexander (0:59)
Yeah, I've been doing this for about 12 years. It's kind of insane that it's one of those problems where you get out of college, and you start working on it. And I still feel there are probably 10 different angles that haven't been solved in real time streaming.

So it's exciting. I think there's a lot of changes in hardware and software primitives that we get to build. I was the original dev with Redpanda and I really came through this as a principal engineer. So prior to this, I was the CTO of Concord, which was sold to Akamai, and then at Akamai as a principal engineer.

So I wrote the initial sketch of the storage engine and webrip and ended up being. So this really came about from an engineer by an engineer, so excited to be here and share more with the world about what real time is.

Bart (1:43)
Very good. As someone who's not terribly technical, I’d appreciate if you can walk us through some basics we're talking about in the title of this first episode – from batch to real time.

You'll draw with us, which is good. If we start from the beginning, there's this thing about events, but not like a soccer game or a wedding. Start with the basic building blocks of what's an event, I'm gonna monitor the chat. Remember, folks get your questions out in the open, Alex is here to answer them to get interaction going.

Alexander (2:56)
So let's talk about the difference between an event and data. So assuming that you have a yellow T shirt. And that's just data. What makes this an event is when you contextualize the data.

And so the contextualization could be that I bought it at target.com at 7pm Pacific Time. This is really what makes it an event. Oftentimes, there are key properties. And there are multiple types of timestamps in event streaming, and we don't really need to get around this.

But timestamps are critical. Because event streaming is often a set of technologies used to help you deal with never ending events. Classically speaking, people tend to think of event streaming with time sensitivity to the data, for example, a stock trade. If you submit to buy Apple stock today, you expect the trade executed soon and not next year. There's some time sensitivity to that.

And so that's an event. Event streaming is a set of technologies that help you reason about never ending streams. There's two parts to this. A a computational side and the storage part.

When you change the two things together, you have something like fraud detection or Uber Eats or something. End to end use cases that leverage storage and compute. Let’s give an example. Computing things here are things like Apache Flink or Spark streaming and Materialized is another one.

On the storage side you have Redpanda, Pulsar, Kafka. And there's a ton. For people new to event streaming you take data, you contextualize it, and then you need to process it. The way you process it often is the combination of storage and compute chaining together for use cases like fraud detection. Does that make sense?

Bart (4:56)
Yeah. It’s important people realize these are living, breathing things. Depending on the use cases it's not this scifi technology. Most people think Netflix, when we mention streaming. But we talk about it in contrast to batches. How do we get into batches?

Alexander (5:40)
Let's say your job was a programmer. Often, when you have events, the simplest possible thing is you put those events into a file. And your job as an engineer, is to say, “Hey, give me a report of how many yellow T-shirts we sold that day”. So you put those events into a file and you have a cron job, which is a timer.

We'll talk about why this is important and how streaming becomes a superset of batch with this primitive timer callback. When you try to solve a problem, you call my program, and say, I'm gonna give it the input file. File.txt.

The program processes the entire data and then probably publishes the results to a database of sorts. This is the classical batch processing. It's exactly like what the English word means a batch is just really a group of things.

Batch processing is when you process a group of things at a time, but very discretized. I think this is the major primitive of how we start to think about real time and explain to the world why streaming is a superset of that.

Feel free to stop me at any point if you or the audience have questions. So let's talk about an end. The simplest thing as an engineer is to batch, wrap, filter or some other kind of batch processing of data.

In a nutshell, atomic callback is what makes streaming a superset of batch. It's commonly known as punctuation. The idea is that if you're a credit card processing company, Uber Eats, Twitter or any other use case with real time interactivity. You want to make progress, right. And you need punctuation because streams are never-ending.

To make progress, streaming punctuates the data, and punctuation is the discretization of events. It says a lot of things. Let's walk through what it means. So you have a timeline that never ends. As long as your business is running, time isn't going to stop. So how do you make progress when the stream never ends? Well, I mentioned a Cron job.

Cron is the real primitive, that streaming borrows in frameworks like Google. Concord had a similar approach. You have a specific timer callback. By the way almost all streaming frameworks are already abstracted away.

You just like to express your computational pipeline as a directed acyclic graph. It kind of understands punctuation with things like materials. They do smarter things, but these are the primitives.

So let's say somebody asked me "what's my budget? Or how much money am I making". The way to make sense of the world is to discretize. To do this you discretize time into buckets.

And now it's starting to look like batch. The easiest thing that could work is I meet an event to the database every hour.

When my users come in, they ask me what's the balance in my bank account? I can give them the right answer. So let's say this is one of our buckets. This punctuation of your event stream allows you to go back to the last hour and replay it. It goes to your bucket and reprocesses the same events in the same way as this original events.

In that example it becomes a one hour bucket. I'm using one hour kind of as this huge time window horizon to highlight a point in a stream. This is often in milliseconds. 10 milliseconds, 40 milliseconds or something.

For human visible things real time is anything less than 150 milliseconds. Basically the blink of an eye. It's roughly when people start to detect if something is real time or not. Computers are super fast so 150 milliseconds is actually quite a large batch, and you can reduce the time.

Bart (13:46)
Can we ask a question about upstream sources? If upstream sources of the majority of your data is received in batch files, does it still make sense to implement a streaming platform to migrate from batch ETL to streaming ETL?

Alexander (14:03)
Yeah. Let's just use fraud detection, they consume both static files, like regularly uploaded files to s3 and CSVs, and credit reports and a bunch of things. If Alex closes his VISA credit card and you see a future transaction that may happen overnight, that may happen in a very long time horizon window.

But what's often useful then when people find themselves in practice is the merge and coalescence of multiple data sources. It relates to the merge of static files merging with more constant streams of data.

So about punctuations, what's cool about streaming is they're arbitrary. You can make this based on some account or time window divorced from my watch. And now related to the number of events, let's say every 1000 events, I'm going to emit a record to the database.

Then you can start to mix both. Because they're punctuating the time horizon. So you're saying “emit an event here and there”. And there's a new set of literature that allows you to merge results in a particular data structure. That decreases the count to one.

You can make this as granular as you can. But the point is that the punctuations of your real time streams are arbitrary, and the programmer gets to control. Let's talk merging batches and static files and real time streams. Any other questions from the audience?

Bart (16:00)
No. But one general thing is understanding the use case – what’s batch as opposed to streaming. We're talking about making quick decisions like in fraud detection or knowing exactly where our food is with UberEATS.

Perhaps there are cases where it's not so necessary to have the information at that speed. But we're seeing these cases adding up. Is that what we can say is generally driving the need for streaming technologies forward?

Alexander (16:43)
Yeah. I think a classic example is machine learning workflows. Machine learning has a fantastic merge of largely static definition files like publishing coefficients into a database, and in real time consumptions of data. Remember we do explain why streaming is a superset of batch.

It boils down to the keystone principle of being able to punctuate your time streams and merge data effectively. With a machine learning model, this talk about fraud detection for credit cards, I think it's a good idea. For fraud detection, you'll consume credit scores every month and you may have like 90 batch jobs that check the balance.

Real time would be your point of sales. You have real time queries from your point of sales and need to merge these more static definitions into real time strings. It's usually used to think of the streaming as a computational framework, rather than necessarily having to process data close to the originating time of the event.

Time is important because it helps humans understand how to slice and dice data. It's easy if I ask how many burritos did you sell last week. Or how many laptops did Apple sell two days ago? It's easy to understand.

But windowing and punctuation is arbitrary. So if you think of it more as a computational framework, where you admit events, then it's very flexible in being able to merge your historical and real time data.

Alexander (19:00)
In fact, we do this with a thing called shadow indexing. Your client application doesn't have to change. You could use the same Kafka driver but consume historical data and your real time data.

The programmatic interface exposed to the programmer each month is much simpler. Here's why this becomes simple. Before we get into that, I want to explain one thing. Typically this is done with like a product timeline. But in reality it will become multiple streams of data. At some point you have to define your data modeling technique.

It’s just an additional tool to help developers to reason about their data. Let's talk about in practice, what is a static model and the dynamic model.

Let's say you have this static class that will become independent strips. So you would have a credit score in a balanced stream. Then you would have your point of sales stream.

They're all published into Redpanda. But this is actually an architectural pattern. And you can replace that with moderate ones, we think we have a pretty good implementation, but the modeling technique here is the salient point.

They become strings, which means that your application that is interacting with these data streams, doesn't understand and shouldn't care if it is real time or not. And so this is the kind of architectural advantage some people are moving to the log as an abstraction mechanism for some of their service communication.

So assume you have some data sources and some services. In the absence of a system like Redpanda, and you can stop anything for that is each service assuming that all services have to consume from the same three data sources, it will communicate with all the three data sources, but every time you add another service, that service will also communicate with all three.

Alexander (21:49)
If you look at this picture, versus what happens when you introduce a log, and you abstract time, and the interface into a common interface, which is the log, you reduce the computational complexity from n squared to linear because you don't have the same data sources.

They all connect to Redpanda, as opposed to every service consuming from every data source. So the service consuming static and dynamic data doesn't really care whether the data is static or rather, the data comes naturally or it comes hourly or comes every Monday doesn't matter.

The point is that you define a contract on your services saying like “I'm gonna start to consume from this API”. This idea of the data mash is effectively like auditing an app calls and load balancing. We could talk about the properties of write log is useful in a second.

So you're moving from something like this, to do something like this. It's much simpler and obviously reasonable. You're assuming this broker is resilient, it has partitions, and it has replayability.

He has auditing, he has Ackles. The most important one is it gives you a modeling technique as a program. Maybe I went too deep into that one.

Bart (23:41)
I don't think you did. But you did mention replayability, the concept of durability as well, could you touch on the issue of data loss? We're talking about streaming data, so what happens if I lose my data? Can I recover it?

Alexander (24:03)
Yes. Good question. People are migrating to the log because they think there's this implicit understanding by the programmer of the benefits it gives them. It’s not well understood.

Let's talk about why a programmer would care about this. You're not a distributed systems engineer, you're an application programmer. You care about your app sign up. So specifically in code you care about things like circuit breaking.

What circuit breaking means is, if I can connect to this broker, like fail on the client side and Kafka has some mechanisms for this. You also care about backoffs, durability, access control, auditing, high availability and data loss. So why does an application developer want to use the log as an abstraction in their application? A log gives you the understated concept auditing Ackles.

Access control lists say, only these services are allowed to consume this data. And when we extend the log with web assembly, you’re able to even change that far, like this service is allowed to consume this data in this particular shape.

Generally speaking, for most logging implementations, access control is a really big property. But there's a ton of them that offer that I think what writing an event to the log gives you is auditability.

Your current state of the world is the left default on all of the previous events that happen in your application. So left fold is just a functional programming concept to say if you start from zero, and consume every event, you should end up in a predictable deterministic state.

That’s so powerful to be able to debug, how did we get here? Let's talk about that abstraction for a second and why people care about being able to reproduce, the concept of your state being a left fold is really commonly known as like a command database.

If you build the compilers, you build the database, at some point, you realize most problems will look like a database. I think Salesforce or the database pack, I think of it as a relatively sophisticated control plane level database, and so on.

The log gives you this abstraction, where you think about your application as a set of commands you react to. But you can reconstruct it. You can understand every single transition of your state up until that point in time. That's what auditability gives you.

If you add some trace and metadata on the headers, you can reverse engineer the entire state of your application. This is why people look into that. And this is more of a new way of thinking how the log helps developers, it’s rising in the data mesh world.

There's obviously the analytical use case used to ship logs and process data and attach them to computational frameworks like Flink, Spark streaming, TensorFlow or whatever.

The next important thing is durability and replayability. Immutability is hugely important. So when you write events to the log, they're useful, because now you can reprocess the same events with auditing and get the same state.

Alexander (29:35)
If you crash at this particular point in time, Kafka in particular gives you this progress, which means temporarily, it'll snapshot your progress into another topic in the log.

So if the microservice or application crashes, it can pick up where you left off. Being able to go back in time and replay that log is powerful for disaster recovery use cases and not just for auditing.

Because it gives primitives to the services connected to the top right, you get circulated backhauls through ability means data safely stored on disk. Now, when you pair this with an infinite retention.

I don't know if my head of product is gonna kill me for this, but we just released tiered storage. We're gonna make an announcement soon.

Combined with being able to retain all events for an infinite amount of time on s3, because it's like two cents per gigabyte – super cheap. And if you get discounts on top of that, then you start to build your business as a function of the log. Everything else becomes an index on top of the log. Any questions from the audience before I move into immutability?

Bart (31:02)
No, it's good. You mentioned Kafka and what technologies has gotten writes in streaming data and where we’re at in improvements and what to expect.

Alexander (31:24)
I can talk about the progress of streaming and talk about immutability at the same time. Two things that are too important to give up so quickly. One is data loss. However, availability is too important.

A lot of abstraction makes it somebody else's problem. Redpanda, Kafka or Pulsar or whatever extract properties out of those systems and to configure them such that if one of those brokers die the system stays up.

For us it’s Raft. There's a mathematical proof, and there's like a TLA proof. And we're working on Jepsen pretty soon to ensure that the implementation of Raft has high fidelity with the actual paper description.

But as an Application Engineer, as an architect, you understand your trade offs between availability and consistency. We opted into a strongly consistent protocol, because it's easy for application developers to understand.

It’s related to data loss. Other systems give you different levels of availability and it's known as per protocol. But hardware is so good right now, that the default should be that you shouldn't have to pick between latency and data loss by and large.

Of course, there are exceptions to this rule, I understand. But from talking to whatever 400 or 500 companies over the course of two years, the majority of people are at 100,000 events per second and it's three or 400,000 events per second.

They don't have to give up data safety. You don't have to make that really challenging trade which complicates everything else. By default, you should use a strong, consistent setting on your data.

And you should expect that the systems, the log abstraction in this case, Redpanda gives you, we onboard the complexity of giving you primitives to reason about availability, and consistency.

Alexander (34:35)
Let's talk about the evolution of streaming superimposed on the evolution of hardware. It's like lossy compression. It's going to be correct, but there's gonna be missing details because I can't cover 20 years with high fidelity.

You have old school systems like Tibco, and Solace here. And then RabbitMQ came early in the 2000s. And then Kafka came somewhere around 2010 to early 2011. What Kafka got right is you don't need to have this specialized hardware. That's how you could use this app. I like the recipient appliance into your data center.

It says “can we take the MapReduce system thinking where we program this out for it to use cheap hardware, effective commodity hardware?” Then we can tune it so we can scale out. It wasn't about vertical scaling. It was about scaling up.

Kafka got an API that developers adopted. In part because it historically coincided with the launch of something like Apache Storm, which was released by Twitter.

Somewhere around that time a lot of people were thinking about event streaming. That became an architectural blueprint for a bunch of companies and startups.

Now, if you think about real time, you think Kafka. Back then it was Zookeeper and Kafka 08. And he gave programmers an API. If you want to communicate with any storage subsystems, I know Pulsar and Azure added it and we’re also leveraging the API.

So you can think of it like the SQL API. SQL became the lingua franca for databases. So that's Kafka. In 2014 Pulsar came and added the disaggregation of compute and storage.

That was the second important concept for event streaming. Event streaming in the classical Kafka implementation is really costly. Let's say you want to store a terabyte of data to give you more Kafka brokers, in the cloud, that's really expensive.

It's very, very expensive to use either EBS volumes, or local NVMe, SSD devices to host multiple terabytes of data. And data movement is increasing. We have customers that store petabytes of data with Redpanda.

When you think about that kind of scale, the actual cost structure moves out of the computer side into the tiered storage part. So that’s what Pulsar did right. It disaggregated compute and storage.

Then came about Redpanda. And what could we learn about these two systems? We learned from Kafka that people liked the API. They love the millions of lines of code that they don't have to write. Because you could leverage TensorFlow and Spark ML and Flink and Clickhouse and CockroachDB and Materialize and they all just work.

This huge ecosystem is built around this common API and that's what people really love. Broadly speaking it's seen as this Lego piece, because there's this huge ecosystem compatibility with that. So that's what Kafka did. And Pulsar made it cost effective to store petabytes of data on.

Bart (39:08)
So Pulsar made the jump anticipating much higher volumes of data and eliminating the need to have so much infrastructure in the cloud, which then becomes very, very costly as you're adding more volume.

Alexander (39:23)
Right, and the trade off for both systems was complexity. Kafka in its base form without adding the HTTP proxy and schema registry, which I think are fundamental pieces to streaming has to fall domains.

Pulsar came about and added three. So the systems are hard to understand. When we started working on that we're like, how do we make this one binary?

The system could figure out roles dynamically. But from the point of view of a programmer of an architect, why is this thing so hard to use? We learned from Pulsar and Kafka to be fully compatible with the Kafka API.

We don't even develop any drivers. If you're using saregama or friends go or the Liberty Kafka or PHP or Ruby Kafka, Spark streaming with Flink, they all kind of just work out of the box. And the reason is that to the client, we look exactly like a Kafka cluster.

So we return the same metadata during client discovery call, so it looks like a Kafka cluster. That’s how SQL works. You submit some syntax and then you return the same results. But the back end might be a noSQL database, or it might give you different consistency guarantees.

And then the second one is that we integrate with your storage. So in your compute, you only have to store whatever your local SSD NVMe devices give you.

And this is probably what we're going to talk about: the next left join is to dig deep into the tiered storage. The gist is, we had a customer that said “can you give us a quote for 10 terabytes of data”, we said “Okay, here's how much money and I told them we can transparently tiered data between s3 and or any s3 compatible back end and Redpanda”.

And they're like “hold on a second, can you push this to 12 petabytes of data?”. When you do that, the cost of compute gets lost, it's literally lost in the noise versus how much money it costs to store 12 petabytes of data on s3 object storage.

Those were the two major takeaways from the architecture of these two systems that we learned from them. We really got to look back at the last 10 years, and knew where we wanted to be. So we took the best of both systems, focusing on the developer experience and making the developer this first class citizen. Does that make sense?

Bart (42:20)
It does. So when we're talking about streaming data, are new folks being brought into the fold? We talked about application developers, people working in distributed systems, who are the new players in streaming data?

Alexander (42:44)
I think because of the multiple fault domains, the configuration, data safety guarantees, the protocols that they chose, the current systems before it kind of came about were really in this bracket for experts by experts.

They were for people like me who spent 12 years in event streaming. And I need to go back into it and understand trade offs, acknowledgement on the client side, what does it mean to have a Zookeeper failure?

This is a huge set of complexity. And we inherited some of them because of the Kafka compatibility. But by and large you had to be an expert. You either become an expert beforehand, or during an outage.

I thought we were serving the experts. But the design decision, making it a single binary, ended up placing Redpanda more in the center. It is our job to keep moving towards the novice.

What's interesting is actually we can be more accurate here. About 40% of our users are experts and about 60% are novices.

By focusing on the development, making them successful locally on their laptop, making it super easy to use and understand a single binary single folder.

We haven't even talked about the schema registry and HTTP proxy, and a bunch of other things that are also embedded in the same binary. It's actually a categorical elimination of complexity.

It’s not just that we eliminated Zookeeper. That's true. But we also eliminated a bunch of other systems. And we learned in practice that the application JavaScript developer, the Python data scientist, they also have the same need to leverage real time. But these technologies have been so difficult to get up and running and use.

A big part of this and learning from Kafka and Pulsar was really about giving them compatibility with the last 10 years of the ecosystem. The fact that you can take TensorFlow today, and it just works out of the box with zero configuration changes. It's kind of like magic to see hundreds of 1000s of lines of code, actually millions in the case of Spark just kind of work with that.

So that's where the market is and where we're heading is that new technologies are really solving the human experience around streaming. There's a few technologies if you're willing to put in the time and be an expert in and understand basically become an expert in that particular technology, you could probably get them to work. But it’s still challenging. I think that's kind of what Redpanda helps with.

Bart (46:51)
I like that. You mentioned Kafka, Storm, etc. And for some engineers that had quite a bit of experience, even those changes were quite challenging. So I think a lot of people at first glance might think streaming data isn’t for them. There shouldn't have to be such a wide landscape with so many choices, this can actually be simplified and it makes onboarding easier.

Alexander (47:30)
I want to talk about three specific features where we actually focus on the developer experience. And so the first one is this binary called RPK. So to talk about what happens when you're trying to run a system in production.

So we embedded a ton of knowledge in this little tool called RPK. It’s short for Redpanda Keeper, a cute little name, to run Redpanda. And this is our main developer experience tool. It's beautiful. It's really easy to use this single go binary.

Redpanda is written in C++, but allows us to target hosted environments like Windows and Mac and Linux. Redpanda that was very much designed to run in Linux operating systems. It works on ARM and x86, but the operating system is Linux. So RPK is this tool that does this.

There's a thing called RPK tune for production. It measures the hardware and says “That's my NIC has multiple networking capabilities”. And if it does turn it on, make sure that you have the max number of a iocb callbacks on every core, let's make sure that we call as the interrupts on every core, let's make sure that we read the events rather than having the kernel interrupt the rest of the process.

Let's make sure we disable the IO blocker scheduler coalescing so that the Linux block scheduler doesn't touch our data, because we use all direct files. And we understand how to relate the data precisely on the file system.

Think of it like a one liner when you add a little JavaScript snippet to your personal website because you want to understand where users are coming from. This is really what RPK is. And it's open source.

Alexander (50:19)
The second thing on the developer experience that matters quite a bit is automatic data rebalancing. We thought a lot and made this so both of them are in the source codebase.

The data rebalancing piece is that over time, your clusters get hot. This happens for lots of reasons. There's a network partition or whatever. Let's say this one may become particularly hot. It means it has an imbalance, a lot more network load, CPU load, or disk load, or whatever.

To help the average developer as a cloud first company, we released data rebalancing for free. We're gonna turn it on this month by default on future releases.

Data rebalancing constantly monitors and measures if there is an imbalance in leadership. If there is internal anti entropy mechanism that says “why don't you share some load”. That transitions the cluster into a healthy cluster. It's doing this every, like five or 10 seconds.

So we focus on the developer experience. It's not just an arcade game, it's actually making what other people have made enterprise features open source. We care about our cloud use. And that's really the only restriction that we have on the license.

I want to make people successful. And so even though getting started was easy. Day two operations have to be easy. And this is a day two operation.

Bart (52:26)
Great points and particularly to see it included in the open source making it even more accessible for that 60% of entry level people with various years of experience. DevEx is a buzzword and a fun thing to say.

But when you present these concrete examples, I think it really makes for a different story. We are close to the end, but I want to ask where you see this going?

Alexander (53:24)
There’s really fantastic trend on seeing the log as an abstraction mechanism to build new services. I call these control plane databases. We've seen a bunch of examples of this. Materialized is a good example that leverages a durable write ahead log like Redpanda.

It gives people a computational framework expressed through SQL. There is another database called memgraphDB, which leverages the source of truth – in this case Redpanda – to give programmers a graph, abstraction, and so on.

This write ahead log is this blog as an abstraction mechanism with specific ordering guarantees. And I think the future is going to continue where you have a lens that could be SQL, another lens that could be a graph, database wrapped up another lens that is just a key value store.

And another lens that is that search. If you think about the log as a stack, there's this level and so I call this the platform. I call this the log, and then at the top, there are applications.

In the application level, you have things like alpaca, wherever alpaca uses both platforms, and the logs themselves as a combination to present the users. You basically guarantee new semantic API's.

Amazon has famously released Aurora, and there's a bunch of new papers coming out, saying this is where the industry is heading because of the properties that we talked about durability, auditability, immutability ordering, and so on.

So when you take those primitives, it becomes really powerful and easier to build platforms and applications on top. Now you have a system whose only job is to give you ordering immutability and replication, roughly speaking.

So I think this is where the industry is heading in general. And it's really exciting, because I think it changes how people build applications in the future. The log becomes a critical component of how you reason about resiliency, and availability.

Bart (56:23)
Very good. You mention Alpaca, so I want to drop the link because next week, on Wednesday, there will be the Linux Foundation webinar with Alpaca and Redpanda talking about how Redpanda was able to help them up their game processing very large amounts of orders.

The other thing you mentioned, shadow indexing, we'll talk about next week in episode two of real time with Redpanda. So excited for that. Any resources that you might recommend for folks that are curious about shadow indexing before they arrive next week?

Alexander (57:08)
Yes, our head of product is about to release a blog post on the future of what shadow indexing is. Basically tiered storage plus plus. There's a lot more nuance there. Check out that blog post and come prepared with all of your questions.

If you're familiar with how Confluent and Pulsar does tiered storage, if you want to understand what happens in failure, how we leverage in case of a raft crash? How we deliver on the promise of giving you infinite data retention with an s3 backup.

Bart (57:50)
Fantastic! Alex it was a pleasure to spend time with you learning a lot. We are stopping this conversation right here and starting new ones in Slack. It's really easy to jump in. Lots of friendly folks in the Redpanda team are ready to help out answer any questions you might have.

While we were talking, an amazing artist following all the things we talked about in one drawing. A nice artistic representation of today. Thank you very much, Alex.

Keep learning about Redpanda

There's more where that came from

Check out our other tech talks to keep learning