Vector databases vs. knowledge graphs for streaming data applications

Pros, cons, emerging trends, and getting started

Fortune Adekogbe

October 15, 2024

CopIED!

Humans learn to understand the world through relationships between objects. For example, most people will recognize a dog as an animal, or that books go on shelves. This mutual understanding of objects facilitates daily interactions and societal functions. Similarly, computers must grasp relationships between data points for tasks like content retrieval or answering queries. Vector databases and knowledge graphs are commonly used methods for organizing this data.

Although you can’t talk about vector databases without understanding vectors. In this context, vectors are numerical arrays that contain fundamental information about specific data points. A vector database stores these vectors with pointers to the raw data they represent, which lets you perform all the typical CRUD operations associated with databases quickly and at scale.

Knowledge graphs, on the other hand, use a graph data structure that emphasizes the relationship between data points. Graphs are typically made up of nodes (entities or objects) and edges (relationships). Knowledge graphs go beyond this by representing data in semantic triples, which include a subject (a node), a predicate (an edge that defines a relationship), and an object (another node). These triples combine to form the knowledge graph.

Vector databases and knowledge graphs let you actually use the data and build systems that carry out tasks like semantic search, recommendation, and retrieval-augmented generation (RAG). This post compares these data layers based on how they structure data, their speed, and scalability, the types of data they work best with, how well they integrate with machine learning models, how simple they are to learn and use, and their suitability for streaming data and applications.

Vector databases

A vector database stores mathematical summaries of high-dimensional data, also known as vectors or embeddings. These representations capture the essence of various types of data, such as text, images, audio, or tabular information. By condensing the data into a smaller array, the database maintains enough detail for the computer to comprehend the original information. You can then easily query these vectors, whether you're looking for an exact match or similar data points.

Vector databases are designed to ensure that your queries return the desired results quickly. They also handle increasing scale very well. Some can search millions of vectors with a latency of milliseconds.

This capability relies on the nearest neighbor search technique, which compares a query vector to stored vectors and returns the ones with the highest similarity scores. In vector space, similarity is indicated by the proximity of vectors to each other. While this approach (particularly the k-nearest neighbors algorithm) is popular, it struggles with scalability.

To address this, vector databases employ approximate nearest neighbor search, a modification of the algorithm that accelerates the search process while sacrificing accuracy to some extent. Vector databases implement this in a variety of ways, including graph-based search techniques like hierarchical navigable small world (HNSW), hashing, and quantization. This makes them suitable for tasks such as RAG with large language models (LLMs), anomaly detection, and similarity search.

Advantages of vector databases

Some of the advantages of using a vector database are listed below:

‍Optimized for similarity search: Vector databases store raw data as vectors or embeddings, representing high-dimensional data in a memory-efficient format. These embeddings can be created using machine learning models that learn the most efficient way to distill highly dimensional data into an array that retains the original data's context and its relationship to other similar data. This makes them superior to knowledge graphs when searching for similar items in millions of vectors, especially when other types of relationships are less important.
‍High speed and scalability: Vector databases are designed to be extremely fast due to their use of approximate nearest neighbor search, which accelerates traditional neighbor search algorithms through a variety of techniques. Even if your data grows to millions of vectors, you can retrieve search results in milliseconds and ensure your users have the best possible experience. This makes vector databases the better choice when speed and scalability are critical. A vector database can also be updated much faster than a knowledge graph.
‍Better support for different data types: Vector databases can work with structured and unstructured data. Whether your data is in text, images, audio, or tabular format, modern machine learning techniques can generate embeddings and save them in your vector database. As a result, they are more versatile, particularly in handling unstructured data, than knowledge graphs.
‍More cost-effective: Vector databases cost less to update when compared to knowledge graphs because they are so quick. They are also generally the more cost-effective option regarding data size because they grow linearly, unlike a knowledge graph, which grows much faster.
‍Easier to learn and use: Vector databases are easier to use and learn than knowledge graphs because they involve familiar concepts. Data is stored in arrays, and popular distance metrics return similar arrays based on a query. This process is optimized with a modified nearest-neighbor search. In contrast, knowledge graphs introduce a complex data structure and query language. This makes vector databases easier to understand and use.
‍Versatile in streaming applications: Vector databases are well-suited for applications that require continuous or batch data processing. In this regard, they are more adaptable than graph databases, which are only useful when processing data in batches.

Disadvantages of vector databases

Some disadvantages of using a vector database are listed below:

‍Less accurate: While vector databases are excellent at quickly identifying similar data points, they sacrifice some accuracy. They may not produce the correct results or an exact number of correct results. The accuracy of vector databases also decreases as the vectors' dimensions increase, a phenomenon known as the curse of dimensionality. Vector databases actively address this challenge by employing techniques like dimensionality reduction and specialized indexing structures. However, these only slow the rate at which the accuracy problem grows with increasing dimensions rather than eliminating it. This makes knowledge graphs the preferred option in terms of accuracy.
‍Less rich representation: Vector databases can only relate data points based on numeric similarity. This limits their precision and specificity compared to the rich relationships between data points using knowledge graphs.
‍Provides less context to LLMs: Vector databases can store factual knowledge integrated with large language models to help them respond to prompts correctly and reduce the likelihood of hallucinations. However, they only provide the models with information on similarity. Knowledge graphs are superior in this regard because they contain a lot more contextual information, allowing language models to generate factual and consistent results.
‍Results are less interpretable: Since vectors are typically generated using black box models, vector databases cannot help you interpret your results by explaining why two data points are similar. This interpretative aspect is easier to achieve with knowledge graphs due to their emphasis on rich relationships between entities.
‍Incompatibility with complex queries: Vector databases can only identify similarities between data points, so their query capabilities are limited. They could not handle complex and nuanced queries about semantic relationships and properties, which the knowledge graph was designed for.

Using vector databases with streaming data applications and Redpanda

Vector databases can be integrated with data streams from a variety of sources. When data is ingested from these sources, a message can be published via Redpanda, a streaming data platform that can be used as a drop-in replacement for Apache Kafka®. You can then consume these messages via a service that generates vectors from raw data using an appropriate machine learning model and then returns a message indicating that a vector has been produced.

Rough architecture diagram for vector database

This message can then be consumed by a vector database, such as Pinecone, and the vector stored. Any service requiring access to the database can then perform queries. Thus, you can use Redpanda to integrate a vector database with a continuous stream of data.

Knowledge graphs

A knowledge graph represents a collection of real-world entities and their relationships. The graph organizes data into subject-predicate-object (SPO) triples, like "Victoria-plays-the clarinet." These triples contain context about aspects of the data, and multiple triples may be required to describe a single data point. Knowledge graphs are typically created using data from various sources, fed into a schema that normalizes them to the SPO format.

Machine learning techniques (such as natural language processing) can aid in this process by extracting entities and their relationships from data and incorporating them into the graph through a process known as semantic enrichment.

Knowledge graphs are "directed" graphs because the SPO structure is only meaningful in one direction. The subject does not have the same relationship to the object as the reverse. A statement like "Christopher wears shoes," in which each word represents the subject, predicate, and object in that order, has no meaning when read backward: "shoes wears Christopher." The directional nature of knowledge graphs is useful since it allows you to track the relationship between items and understand the nature of that connection. In a graph containing information about Christopher's wardrobe, you can precisely retrieve all the items he wears by following the "wearing" relationship between Christopher and his wardrobe items.

Aside from the relationship in the data, knowledge graphs formalize the relationship between entities using well-defined ontologies. For example, if a graph connects users to phone models, there would be triples connecting those models to brands, brands to the company that created them, companies to industries, and so on. Search engines such as Google and Bing have used this data representation technique to create graphs that connect the entire public internet, making searching more efficient and useful. Facebook and eBay have also used knowledge graphs to build user and product recommendation systems. They are also increasingly being used to augment LLMs in solving RAG problems.

Advantages of knowledge graphs

Some of the important advantages of using a knowledge graph include:

‍Rich data representation: Knowledge graphs represent data in relationships between subjects and objects, both entities. This relationship extends to both items' ontologies, connecting them to other entities within the same category. This rich representation of relationships gives knowledge graphs an advantage over vector databases, which only store a measure of similarity between data points without any additional context.
‍More accurate results: Knowledge graphs are more accurate than vector databases. This is due to their precise and rich relationships, which increase the likelihood that when given a query, they will return the exact number of results required to answer it. So, a knowledge graph is better if accuracy is your top priority.
‍Better context for LLMs: Because of the rich data representations provided by knowledge graphs, they can represent a factual source of information accessible to a large language model. The model can easily generate precise, factual, and logical responses because it is provided with data points and their relationships on multiple levels. Thus, knowledge graphs are more useful for extending the capabilities of large language models than vector databases.
‍Interpretability of results: Because of the volume of relationships stored in knowledge graphs, you can find the reason for an answer by traversing the entities and relationships that link a query data point to it in the database. This makes knowledge graph query results easier to interpret than vector database results. It's also easier to correct errors if they occur.
‍Ability to handle more complex queries: Knowledge graphs allow for more complex and nuanced queries than vector databases. Knowledge graphs provide additional context, allowing you to ask more interesting questions about relationships or structures.

Disadvantages of knowledge graphs

Some disadvantages of using a knowledge graph include:

Lower speed and scalability: Knowledge databases require more time to create and update. This is due to the number of entities and relationships that may need to be extracted from the data and accurately inserted into the graph. This makes them slower than vector databases, designed to operate much faster. This speed issue worsens with scale, making vector databases the better option.

Limited support for unstructured data types: Knowledge graphs are also limited in terms of the types of data they can process. They work best with structured data, as the relationships between entities are easier to determine. Unlike vector databases, they fall short with unstructured data and may require additional machine learning operations to extract entities and relationships from them to form the required knowledge structure.

Higher operational costs: Knowledge graphs tend to incur significantly higher costs due to the rapid expansion of graph size as more data points are added, as well as the ongoing expenses associated with updating them in terms of time and resources. This makes vector databases the preferred option when storage and maintenance costs need to be minimized.

Steeper learning curve: Knowledge graphs are an interesting but complex data structure to understand and implement. To create a useful knowledge graph, you must first understand your data thoroughly. They also require you to learn a query language to interact with them. This leads to a steeper learning curve for knowledge graphs compared to vector databases, making them less usable, especially for beginners.

Limited streaming applications: Knowledge graphs are also limited in handling data streams in real time. They are typically designed to process data and update the graph in batches, as opposed to vector databases, which can be used for real-time streaming or batch processing, depending on your needs.

Using knowledge graphs with streaming data applications and Redpanda

Knowledge graphs, such as ones from Neo4j, can be configured to accept data from multiple sources and convert it into SPO triples. However, due to the slow insertion time of graphs, data must be aggregated over a fixed time window or number of samples before being processed in batches. Redpanda allows you to accumulate data ingested by tracking the corresponding messages produced and then consuming them for processing once a batch is complete.

Rough architecture diagram for knowledge graph

This processing may require an additional service that employs machine learning models to extract entities from unstructured data to populate the graph. The knowledge graph is then updated, and a message can be generated using Redpanda, after which your system can proceed with other tasks. (Psst! Check out Redpanda’s Engineering Corner on YouTube. You might be interested in their demo featuring Neo4j for advanced graph analysis.)

TL;DR

The table below summarizes how vector databases and knowledge graphs compare:

Axis of Comparison	Vector Database	Knowledge Graph
Data representation	❌	✅
Result accuracy	❌	✅
Versatility with data	✅	❌
Query complexity	❌	✅
Speed and scalability	✅	❌
Operational cost	✅	❌
Knowledge base for LLMs	❌	✅
Result interpretability	❌	✅
Learning curve and usability	✅	❌
Versatility in streaming applications	✅	❌

Conclusion

This post should help you decide which data layer suits your specific use case and requirements. For example, if you need to stream data in your application, vector databases are a better fit.

Whether you choose a vector or knowledge graph database, you can use Redpanda to manage your data streams. To take it for a free spin, sign up for Redpanda Serverless and follow along with the documentation.

Want to keep learning? Check out our post on vector databases vs. graph databases.

‍

No items found.

Join the Redpanda Community on Slack

Chat with our team, ask industry experts, and meet fellow data streaming enthusiasts.