Pull to refresh

AI-powered semantic search using pgvector and embeddings

Level of difficulty Medium
Reading time 9 min
Views 444

Introduction

In the age of information, the ability to accurately and quickly retrieve data relevant to a user's query is paramount. Traditional search methodologies, which rely on keyword matching, often fall short when it comes to understanding the context and nuances of user queries. Semantic search, which seeks to improve search accuracy by understanding the searcher's intent and the contextual meaning of terms, has emerged as a solution to these limitations. However, implementing semantic search can be complex, involving advanced algorithms and understanding of natural language processing (NLP).

Existing solutions such as Elasticsearch and Solr have been at the forefront of tackling these challenges, providing platforms that support more nuanced search capabilities. These tools use a combination of inverted indices and text analysis techniques to improve search outcomes. Yet, the advent of machine learning and vector search technologies opens up new avenues for enhancing semantic search, with solutions like OpenAI's Embeddings API and the pgvector extension for PostgreSQL leading the charge.

Understanding pgvector and Vectors

At the heart of modern semantic search lies the concept of vectorization, where text is converted into numerical vectors that represent the semantic meaning of words or phrases. This approach allows for the comparison of textual information based on its content and context rather than mere textual similarity.

https://weaviate.io/blog/distance-metrics-in-vector-search
https://weaviate.io/blog/distance-metrics-in-vector-search

pgvector is an extension for PostgreSQL that enables efficient storage and search of high-dimensional vectors. It provides a way to index and query vectors using techniques like cosine similarity, which measures the cosine of the angle between two vectors. This metric, ranging from -1 to 1, indicates how similar two vectors are in terms of direction, with 1 meaning identical direction and -1 indicating opposite directions.

Cosine Similarity for Semantic Search

Cosine similarity is a metric used to measure how similar two vectors are, irrespective of their size. It's calculated as the cosine of the angle between these two vectors in a multi-dimensional space. This calculation results in a value between -1 and 1, where 1 means the vectors are identical in orientation (pointing in the same direction), 0 indicates orthogonality (no similarity), and -1 implies diametrically opposite directions (complete dissimilarity). In the context of semantic search or document similarity, each document or piece of text is represented as a vector in a high-dimensional space. Cosine similarity then quantifies how similar these documents are in terms of their content and meaning, based on the angle between their vector representations, rather than their magnitude or length. This approach is especially useful for comparing documents of different lengths in a normalized manner, focusing on the direction (which represents the semantic meaning) rather than the magnitude of the vectors.

https://www.learndatasci.com/glossary/cosine-similarity/
https://www.learndatasci.com/glossary/cosine-similarity/

OpenAI Embeddings API

Embeddings convert words, sentences, or documents into vectors of real numbers, capturing their semantic properties in a dense, low-dimensional space. This representation allows machine learning models to understand and process text by capturing the meaning and context of words or phrases, enabling tasks like text classification and semantic search. Unlike one-hot encoding, which creates sparse and high-dimensional vectors without semantic information, embeddings provide a compact and semantically rich representation, making them effective for various natural language processing tasks.

OpenAI's Embeddings API and its newest models, text-embedding-3-small and text-embedding-3-large, represent a significant advancement in making semantic search accessible and accurate. It allows developers to convert text into high-quality semantic vectors using state-of-the-art language models. The API is known for its accuracy and ease of use, requiring minimal NLP knowledge to integrate into applications.

Implementing Semantic Search with OpenAI Embeddings API and pgvector

To leverage the power of semantic search in your applications, you can use a combination of OpenAI's Embeddings API for generating text embeddings and pgvector for storing and querying these vectors in PostgreSQL. Here's a simplified solution example:

First, run the PostgreSQL container with pgvector:

docker run -e POSTGRES_PASSWORD=postgres -p 5432:5432 ankane/pgvector

Then, let's connect to the database, and create a table with vector column for embeddings:

-- Enable pgvector extension.
CREATE EXTENSION IF NOT EXISTS vector;
-- Create a vector column with 1536 dimensions.
-- The text-embedding-3-small model has 1536 dimensions.
CREATE TABLE IF NOT EXISTS articles (
id SERIAL PRIMARY KEY,
embedding vector(1536),
content TEXT
);

Next step is to generate several sample embeddings for the articles and put them to the database. I'll be using NodeJS implementation, but it might be easily replaced with Python implementation, for example. For demo purposes, I asked ChatGPT to generate 3 sample articles about programming and 2 more on different topics:

import OpenAI from 'openai';
import pg from 'pg'
const openai = new OpenAI({
apiKey: process.env.OPENAI_API_KEY
});
const client = new pg.Pool({
connectionString: 'postgres://postgres:postgres@localhost:5432/postgres'
})
const articles = [
// Article 1: Understanding the Basics of Python Programming
"Python is a highly versatile and widely used programming language, known for its ease of learning and flexibility in application development. As a high-level, interpreted language, Python enables developers to express concepts in fewer lines of code than would be possible in languages such as C++ or Java. Its straightforward syntax emphasizes readability, making it an ideal choice for beginners in the programming world. Additionally, Python supports multiple programming paradigms, including procedural, object-oriented, and functional programming.\n\nThe language comes with a comprehensive standard library that includes support for various tasks such as file I/O, internet protocols, and web services. Python's extensive suite of modules and packages enhances its scalability and utility in developing complex applications. From web development to data analysis, machine learning, and scientific computing, Python's libraries such as Flask, Django, Pandas, NumPy, and TensorFlow make it an indispensable tool for programmers and researchers alike.\n\nFurthermore, Python's community support is unparalleled. The availability of vast resources, tutorials, and forums facilitates learning and problem-solving. For developers looking to dive into new projects or troubleshoot existing code, Python's community provides a supportive environment to share knowledge and find solutions. As a result, Python continues to be a popular choice for both novice and experienced programmers, cementing its place as a cornerstone in the landscape of software development.",
// Article 2: Exploring the World of JavaScript for Web Development
"JavaScript stands as the backbone of modern web development, powering the dynamic behavior on the majority of websites. It is an essential language for front-end development, allowing developers to create interactive web pages that respond to user input. Unlike traditional web design that relies on static HTML and CSS, JavaScript introduces functionality and interactivity, making web experiences more engaging and user-friendly.\n\nWith the advent of Node.js, JavaScript has also become a significant player in server-side programming, enabling developers to use a single language across the entire web development stack. This has simplified the development process and opened up new possibilities for full-stack development. Libraries and frameworks like React, Angular, and Vue have further elevated JavaScript's status by providing powerful tools for building complex and responsive user interfaces.\n\nThe ecosystem surrounding JavaScript is constantly evolving, with new tools and frameworks emerging to address the challenges of modern web development. Whether it's through improving application performance, enhancing security, or offering better development workflows, the JavaScript community is at the forefront of web innovation. For aspiring web developers, mastering JavaScript is a crucial step towards building sophisticated web applications and pursuing a successful career in tech.",
// Article 3: The Importance of Version Control Systems in Software Development
"Version control systems (VCS) are fundamental tools in the realm of software development, providing teams with the ability to manage changes to source code over time. Systems like Git have become industry standards, enabling developers to collaborate more efficiently on projects of any scale. By tracking every modification made to the codebase, a VCS allows teams to revert to previous versions, compare changes, and identify when and by whom a particular change was made.\n\nThe use of a VCS facilitates a collaborative and organized development process, especially in large teams where multiple developers work on the same codebase simultaneously. It supports branching and merging strategies, allowing for parallel development streams without interference. This means that features, fixes, or experiments can be developed in isolation and then integrated into the main project at a suitable time.\n\nMoreover, version control is critical for continuous integration/continuous deployment (CI/CD) pipelines, automating the testing and deployment of software. This not only speeds up the development cycle but also helps maintain high-quality standards. Whether for open-source projects or enterprise applications, adopting a robust VCS is indispensable for modern software development practices, ensuring code integrity and facilitating team collaboration.",
// Article 4: The Evolution of Digital Marketing Strategies
"Digital marketing has undergone a remarkable transformation over the past decade, adapting to changing consumer behaviors and technological advancements. In the early days, digital marketing was predominantly focused on email campaigns and basic online advertising. Today, it encompasses a wide range of tactics, including search engine optimization (SEO), content marketing, social media advertising, and influencer partnerships. This evolution reflects the shift towards a more integrated and user-centric approach, aiming to engage customers across multiple digital channels.\n\nThe rise of social media platforms and mobile technology has significantly influenced digital marketing strategies. Marketers now prioritize content that is not only relevant and engaging but also optimized for mobile devices. Video content, in particular, has seen explosive growth due to its effectiveness in capturing audience attention. Moreover, data analytics play a crucial role in shaping marketing strategies, allowing businesses to understand consumer preferences and behavior in real-time.\n\nAs digital landscapes continue to evolve, marketers must stay ahead of the curve by embracing new technologies and platforms. Artificial intelligence (AI) and machine learning are becoming increasingly important, offering innovative ways to personalize marketing efforts and enhance customer experiences. The future of digital marketing lies in its ability to adapt to technological changes and consumer trends, emphasizing the importance of agility and innovation in achieving success.",
// Article 5: The Impact of Urban Green Spaces on Public Health
"Urban green spaces, such as parks, gardens, and river walkways, play a crucial role in enhancing the quality of life in cities. These areas provide a sanctuary from the hustle and bustle of urban life, offering residents opportunities for recreation, relaxation, and interaction with nature. Studies have shown that access to green spaces significantly contributes to physical and mental health, reducing stress levels, encouraging physical activity, and improving overall well-being.\n\nThe design and maintenance of urban green spaces are vital considerations for city planners and environmentalists. Well-designed green spaces not only support biodiversity and help mitigate the effects of urban heat islands but also promote social cohesion by serving as communal areas for social interaction and community events. The inclusion of green spaces in urban planning is increasingly recognized as a critical element for sustainable and livable cities.\n\nAs urban populations continue to grow, the challenge of integrating nature into cityscapes becomes more pressing. Innovative solutions, such as vertical gardens, green roofs, and urban forestry initiatives, are being explored to expand greenery in densely populated areas. The future of urban development lies in creating a harmonious balance between built environments and natural spaces, highlighting the importance of green spaces in building healthy, resilient, and inclusive communities."
]
async function main() {
for (const article of articles) {
const embedding = await openai.embeddings.create({
model: "text-embedding-3-small",
input: article,
encoding_format: "float",
});
await client.query(
  'INSERT INTO articles (content, embedding) VALUES ($1, $2)',
  [article, JSON.stringify(embedding.data[0].embedding)]
)

}
}
main();

After the script execution, we have 5 rows added to the table:

So now everything is prepared for similar articles searching. Let's generate an embedding for the search input (for demo purposes, I'll use the first article title) and query database to compare similarity on vector column:

import OpenAI from 'openai';
import pg from 'pg'
const openai = new OpenAI({
apiKey: process.env.OPENAI_API_KEY
});
const client = new pg.Pool({
connectionString: 'postgres://postgres:postgres@localhost:5432/postgres'
})
async function main() {
const embedding = await openai.embeddings.create({
model: "text-embedding-3-small",
input: "Understanding the Basics of Python Programming",
encoding_format: "float",
});
const result = await client.query(
SELECT         SUBSTRING(content, 0, 50),         1 - (embedding <=> $1) as distance       FROM articles       ORDER BY distance DESC,
[JSON.stringify(embedding.data[0].embedding)]
);
console.log(result.rows);
process.exit();
}
main();

And, below is the execution demo. By limiting distance, you can easily cut off unrelevant matches; as distance closer to zero, as less the similarity.

Conclusion

The combination of OpenAI's Embeddings API and pgvector for PostgreSQL offers a powerful and efficient solution for implementing semantic search. This approach leverages the latest advancements in machine learning and database technology to provide accurate, context-aware search results. While this solution is highly effective, it's also worth exploring open-source alternatives for embeddings which can offer similar capabilities without the reliance on proprietary APIs. Ultimately, the choice of tools will depend on specific project requirements, including the need for accuracy, scalability, and cost considerations.

Tags:
Hubs:
+1
Comments 0
Comments Leave a comment

Articles