
How South Korea is building an AI-powered future for everyone
At the Microsoft AI Tour in Seoul, Korean companies demonstrated how AI is moving beyond efficiency gains to become a true growth engine.
Organizations use retrieval augmented generation (or RAG) to incorporate current, domain-specific data into language model-based applications without extensive fine-tuning.
Expert insights and guidance from a curated set of AI business resources
This article outlines and defines various practices used across the RAG pipeline—full-text search, vector search, chunking, hybrid search, query rewriting, and re-ranking.
Full-text search is the process of searching the entire document or dataset, rather than just indexing and searching specific fields or metadata. This type of search is typically used to retrieve the most relevant chunks of text from the underlying dataset or knowledge base. These retrieved chunks are then used to augment the input to the language model, providing context and information to improve the quality of the generated response.
Full-text search is often combined with other search techniques, such as vector search or hybrid search, to leverage the strengths of multiple approaches.
The purpose of full-text search is to:
The process of implementing a full-text search involves the following techniques:
Vector search retrieves stored matching information based on conceptual similarity, or the underlying meaning of sentences, rather than exact keyword matches. In vector search, machine learning models generate numeric representations of data, including text and images. Because the content is numeric rather than plain text, matching is based on vectors that are most similar to the query vector, enabling search matching for:
With the rise of generative AI applications, vector search and vector databases have seen a dramatic rise in adoption, along with the increased number of applications using dialogue interactions and question/answer formats. Embeddings are a specific type of vector representation created by natural language machine learning models trained to identify patterns and relationships between words.
There are three steps in processing vector search:
Things to consider when implementing vector search:
Chunking is the process of dividing large documents and text files into smaller parts to stay under the maximum token input limits for embedding models. Partitioning your content into chunks ensures that your data can be processed by the embedding models and that you don’t lose information due to truncation.
For example, the maximum length of input text for the Azure OpenAI Service text-embedding-ada-002 model is 8,191 tokens. Given that each token is around four characters of text for common OpenAI models, this maximum limit is equivalent to around 6,000 words of text. If you’re using these models to generate embeddings, it’s critical that the input text stays below the limit.
Documents are divided into smaller segments, depending on:
When implementing chunking, it’s important to consider these factors:
Hybrid search combines keyword search and vector search results and fuses them together using a scoring algorithm. A common model is reciprocal rank fusion (RRF). When two or more queries are executed in parallel, RRF evaluates the search scores to produce a unified result set.
For generative AI applications and scenarios, hybrid search often refers to the ability to search both full text and vector data.
The process of hybrid search involves:
When implementing hybrid search, consider the following:
Explore how Microsoft AI can transform your organization
Query rewriting is an important technique used in RAG to enhance the quality and relevance of the information retrieved by modifying and augmenting a provided user query. Query rewriting creates variations of the same query that are shared with the retriever simultaneously, alongside the original query. This helps remediate poorly phrased questions and casts a broader net for the type of knowledge collected for a single query.
In RAG systems, rewriting helps improve recall, better capturing user intent. It’s performed during pre-retrieval, before the information retrieval step in a RAG scenario.
Query rewriting can be approached in three ways:
Re-ranking, or L2 ranking, uses the context or semantic meaning of a query to compute a new relevance score over pre-ranked results. Post retrieval, a retrieval system passes search results to a ranking machine-learning model that scores the documents (or textual chunks) by relevance. Then, the top results of a limited, defined number of documents (top 50, top 10, top 3) are shared with the LLM.
AI agents are changing the way we work
Explore moreRAG systems employ various techniques to enhance knowledge retrieval and improve the quality of generated responses. These techniques work to provide language models with highly relevant context to generate accurate and informative responses.
To get started, use the following resources to start building a RAG application with Azure AI Foundry and use them with agents built using Microsoft Copilot Studio.
Organizations across industries are leveraging Azure AI Foundry and Microsoft Copilot Studio capabilities to drive growth, increase productivity, and create value-added experiences.
We’re committed to helping organizations use and build AI that is trustworthy, meaning it is secure, private, and safe. We bring best practices and learnings from decades of researching and building AI products at scale to provide industry-leading commitments and capabilities that span our three pillars of security, privacy, and safety. Trustworthy AI is only possible when you combine our commitments, such as our Secure Future Initiative and our Responsible AI principles, with our product capabilities to unlock AI transformation with confidence.
Azure remains steadfast in its commitment to Trustworthy AI, with security, privacy, and safety as priorities. Check out the 2024 Responsible AI Transparency Report.