Introduction to Data Ingestion in Kosmoy Studio
Learn how to use Data Ingestion to prepare and process your unstructured data for use in AI applications, including vectorization for RAG.
Working with Data Ingestion
The Data Ingestion section of Kosmoy Studio provides the tools you need to transform your unstructured data into a format suitable for use in your AI applications. Currently, the focus is on vectorization, a crucial step in building Retrieval Augmented Generation (RAG) systems.
Key Features:
- Vector Pipelines: Create and manage pipelines that transform unstructured data (PDFs and Office files) from your object stores into vector embeddings stored in your vector databases.
- Support for RAG: Prepare your data for use in RAG applications, enabling your AI Assistants to retrieve relevant information from your knowledge base and generate more accurate and context-aware responses.
Current Capabilities:
At present, Data Ingestion in Kosmoy Studio is centered around the creation and management of Vector Pipelines. These pipelines perform the following actions:
- Source Data: Read PDF and Office files from a specified folder within a registered Object Store.
- Chunking: Divide the documents into smaller, manageable chunks based on a defined strategy.
- Vectorization: Generate vector embeddings for each chunk using a selected embeddings model.
- Storage: Store the generated vectors in a target collection within a registered Vector Database.
Future Enhancements:
The Data Ingestion capabilities of Kosmoy Studio will continue to evolve, with plans to support additional data types, ingestion methods, and data processing techniques.
In this section, you will learn about:
- Understanding Vector Pipelines and their role in RAG.
- Creating and configuring Vector Pipelines.
- Managing existing Vector Pipelines.
By effectively utilizing Data Ingestion, you can prepare your unstructured data for use in advanced AI applications, unlocking the full potential of techniques like Retrieval Augmented Generation.