Key Components of Vectorization
- Ingestion: The vectorization process begins with data ingestion, where content from various sources is collected, processed, and transformed into embeddings. Ingested content may include PDFs, web pages, structured data feeds, or other relevant documents, allowing the framework to handle both static and dynamic information.
- Chunking: For lengthy documents such as PDFs, the framework uses a chunking technique to break down the text into manageable segments. Each chunk is independently vectorized, allowing agents to retrieve specific sections of a document based on the query’s relevance, making responses more targeted and precise.
- External Data Sources: In addition to static data, the RAG framework can ingest information from real-time external sources, such as RSS feeds or APIs, enabling it to keep content up-to-date and ensure users receive the most current and relevant information.
Vectorization Models and Techniques
To generate high-quality embeddings and perform accurate retrieval, the RAG framework uses a combination of advanced models and techniques, each optimizing different aspects of the vectorization process:- OpenAI Embeddings: OpenAI models provide embeddings that capture semantic meaning across a wide range of topics, enabling the RAG framework to understand and respond to complex queries effectively.
- Bi-Encoder Models: Bi-Encoder models generate embeddings for both queries and documents independently, enabling efficient similarity matching. This setup is particularly useful for large datasets, as it allows for fast retrieval by matching query vectors with document vectors in the vector database.
- BM25 and Term Frequency (TF): Traditional retrieval algorithms like BM25 and Term Frequency (TF) help rank documents based on word frequency and relevance. These algorithms provide an initial set of document matches at a lexical level, which are then refined using vector similarity.
- Reranker: After initial retrieval, the reranker model re-evaluates results for improved relevance. By using additional context from the query, the reranker adjusts the ranking of documents, enhancing specificity in the retrieved information.