Common retrieval augmented generation (RAG) techniques explained

By Olivia Shone, Senior Director, Product Marketing

Organizations use retrieval augmented generation (or RAG) to incorporate current, domain-specific data into language model-based applications without extensive fine-tuning.

AI business resources

Expert insights and guidance from a curated set of AI business resources

Get the resources

This article outlines and defines various practices used across the RAG pipeline—full-text search, vector search, chunking, hybrid search, query rewriting, and re-ranking.

What is full-text search?

Full-text search is the process of searching the entire document or dataset, rather than just indexing and searching specific fields or metadata. This type of search is typically used to retrieve the most relevant chunks of text from the underlying dataset or knowledge base. These retrieved chunks are then used to augment the input to the language model, providing context and information to improve the quality of the generated response.

Full-text search is often combined with other search techniques, such as vector search or hybrid search, to leverage the strengths of multiple approaches.

The purpose of full-text search is to:

Allow the retrieval of relevant data from the complete textual content of a document or dataset.
Enable the identification of documents that may contain the answer or relevant information, even if the specific query terms are not present in the metadata or document titles.

The process of implementing a full-text search involves the following techniques:

Indexing—the full text of documents or dataset is indexed, often using inverted index structures that store and organize information that helps improve the speed and efficiency of search queries and retrieved results.
Querying—when a user query is received, the full text of the documents or dataset is searched to find the most relevant information.
Ranking—the retrieved documents or chunks are ranked based on relevance to the query, using techniques like term frequency inverse document frequency (TF-IDF) or BM25.

What is vector search?

Vector search retrieves stored matching information based on conceptual similarity, or the underlying meaning of sentences, rather than exact keyword matches. In vector search, machine learning models generate numeric representations of data, including text and images. Because the content is numeric rather than plain text, matching is based on vectors that are most similar to the query vector, enabling search matching for:

Semantic or conceptual likeness (“dog” and “canine,” conceptually similar yet linguistically distinct).
Multilingual content (“dog” in English and “hund” in German).
Multiple content types (“dog” in plain text and a photograph of a dog in an image file).

With the rise of generative AI applications, vector search and vector databases have seen a dramatic rise in adoption, along with the increased number of applications using dialogue interactions and question/answer formats. Embeddings are a specific type of vector representation created by natural language machine learning models trained to identify patterns and relationships between words.

There are three steps in processing vector search:

Encoding—use language models to transform or convert text chunks into high-dimensional vectors or embeddings.
Indexing—store these vectors in a specialized database optimized for vector operations.
Querying—convert user queries into vectors using the same encoding method to retrieve semantically similar content.

Things to consider when implementing vector search:

Selecting the right embedding model for your specific use case, like GPT or BERT.
Balancing index size, search speed, and accuracy.
Keeping vector representations up to date as the source data changes.

What is chunking?

Chunking is the process of dividing large documents and text files into smaller parts to stay under the maximum token input limits for embedding models. Partitioning your content into chunks ensures that your data can be processed by the embedding models and that you don’t lose information due to truncation.

For example, the maximum length of input text for the Azure OpenAI Service text-embedding-ada-002 model is 8,191 tokens. Given that each token is around four characters of text for common OpenAI models, this maximum limit is equivalent to around 6,000 words of text. If you’re using these models to generate embeddings, it’s critical that the input text stays below the limit.

Documents are divided into smaller segments, depending on:

Number of tokens or characters.
Structure-aware segments, like paragraphs and sections.
Overlapping windows of text.

When implementing chunking, it’s important to consider these factors:

Shape and density of your documents. If you need intact text or passages, larger chunks and variable chunking that preserves sentence structure can produce better results.
User queries. Larger chunks and overlapping strategies help preserve context and semantic richness for queries that target specific information.
Large language models (LLMs) have performance guidelines for chunk size. You need to set a chunk size that works best for all of the models you’re using. For instance, if you use models for summarization and embeddings, choose an optimal chunk size that works for both.

Explore common chunking techniques

What is hybrid search?

Hybrid search combines keyword search and vector search results and fuses them together using a scoring algorithm. A common model is reciprocal rank fusion (RRF). When two or more queries are executed in parallel, RRF evaluates the search scores to produce a unified result set.

For generative AI applications and scenarios, hybrid search often refers to the ability to search both full text and vector data.

The process of hybrid search involves:

Transforming the query into a vector format.
Performing vector search to find semantically similar chunks.
Simultaneously conducting keyword search on the same corpus.
Combining and ranking results from both methods.

When implementing hybrid search, consider the following:

Balancing the influence of each search method.
Increased computational complexity compared to single-method search.
Tuning the system to work well across diverse types of queries and content.
Overlapping keywords to match when using question and answering systems, like ChatGPT.

Microsoft AI in action

Explore how Microsoft AI can transform your organization

See the business impact

What is query rewriting?

Query rewriting is an important technique used in RAG to enhance the quality and relevance of the information retrieved by modifying and augmenting a provided user query. Query rewriting creates variations of the same query that are shared with the retriever simultaneously, alongside the original query. This helps remediate poorly phrased questions and casts a broader net for the type of knowledge collected for a single query.

In RAG systems, rewriting helps improve recall, better capturing user intent. It’s performed during pre-retrieval, before the information retrieval step in a RAG scenario.

Query rewriting can be approached in three ways:

Rules-based—using predefined rules and patterns to modify the query.
Machine learning-based—training models to learn how to transform queries based on examples.
Mixed—combining rules-based and machine learning-based techniques.

What is re-ranking?

Re-ranking, or L2 ranking, uses the context or semantic meaning of a query to compute a new relevance score over pre-ranked results. Post retrieval, a retrieval system passes search results to a ranking machine-learning model that scores the documents (or textual chunks) by relevance. Then, the top results of a limited, defined number of documents (top 50, top 10, top 3) are shared with the LLM.

Learn how to start building a RAG application

AI agents are changing the way we work

Explore more

RAG systems employ various techniques to enhance knowledge retrieval and improve the quality of generated responses. These techniques work to provide language models with highly relevant context to generate accurate and informative responses.

To get started, use the following resources to start building a RAG application with Azure AI Foundry and use them with agents built using Microsoft Copilot Studio.

Build a RAG application with Azure OpenAI Service and Azure AI Search.
Watch our videos:
- Retrieval augmented generation with Azure AI Search
- OpenAI creates retrieval augmented generation features with Azure AI Search

Our commitment to Trustworthy AI

Organizations across industries are leveraging Azure AI Foundry and Microsoft Copilot Studio capabilities to drive growth, increase productivity, and create value-added experiences.

We’re committed to helping organizations use and build AI that is trustworthy, meaning it is secure, private, and safe. We bring best practices and learnings from decades of researching and building AI products at scale to provide industry-leading commitments and capabilities that span our three pillars of security, privacy, and safety. Trustworthy AI is only possible when you combine our commitments, such as our Secure Future Initiative and our Responsible AI principles, with our product capabilities to unlock AI transformation with confidence.

Azure remains steadfast in its commitment to Trustworthy AI, with security, privacy, and safety as priorities. Check out the 2024 Responsible AI Transparency Report.

AI business resources

What is full-text search?

What is vector search?

What is chunking?

What is hybrid search?

Microsoft AI in action

What is query rewriting?

What is re-ranking?

Learn how to start building a RAG application

Our commitment to Trustworthy AI

Related Posts

How South Korea is building an AI-powered future for everyone chevron_right

The 2025 Annual Work Trend Index: The Frontier Firm is born arrow_up_right

Securing AI: Navigating risks and compliance for the future chevron_right

How South Korea is building an AI-powered future for everyone

The 2025 Annual Work Trend Index: The Frontier Firm is born

Securing AI: Navigating risks and compliance for the future