Vector Pipelines
Learn how to create, manage, run, and monitor Vector Pipelines to transform unstructured data into vector embeddings for use in RAG applications.
Working with Vector Pipelines
Vector Pipelines within Kosmoy Studio’s Data Ingestion section are designed to streamline the process of preparing unstructured data for use in Retrieval Augmented Generation (RAG) and other AI applications. They automate the transformation of documents (PDFs and Office files) into vector embeddings, which can then be efficiently queried and utilized by your AI Assistants.
Creating a Vector Pipeline
Here’s a step-by-step guide on how to create a Vector Pipeline:
-
Navigate to Vector Pipelines:
- Click on the “Data Management” icon in the left-hand navigation bar.
- Select “Data Ingestion” and then “Vector Pipelines”.
-
Add a New Vector Pipeline: Click the ”+ ADD” button located in the upper right corner of the Vector Pipelines section.
-
Select Source Object Store and Folder:
- Object Store: Choose a pre-registered Object Store from the dropdown menu. This is where your source documents (PDFs and Office files) are located.
- Folder: Select the specific Folder within the Object Store that contains the documents you want to vectorize.
-
Configure Chunking Strategy: Define how the documents should be divided into smaller chunks. Choose the appropriate settings for chunk size and overlap based on your specific needs and the nature of your documents.
-
Select Embeddings Model: Choose a pre-registered embeddings model from the dropdown menu. This model will be used to generate vector representations of the text chunks.
-
Select Target Vector Database and Collection:
- Vector Database: Choose a pre-registered Vector Database where the generated embeddings will be stored.
- Collection: Select the target Collection within the Vector Database.
-
Click “Next”: Proceed to the next step.
-
Name and Describe the Pipeline: Give your Vector Pipeline a unique name and an optional description.
-
Click “Review”: Review the pipeline configuration.
-
Click “Create”: Create the Vector Pipeline.
Vector Pipeline Cards
The Vector Pipelines section displays each created pipeline as a card. Each card shows:
- Pipeline Name: The name you assigned to the pipeline.
- Description: The description you provided for the pipeline.
- Status: Indicates the current status of the pipeline (e.g., Created, Running, Completed, Failed).
- Clicking on a card: Navigates you to the Pipeline Details Page where you can run, monitor and stop the pipeline.
Vector Pipeline Editing and Deletion Restrictions
You cannot edit or delete a Vector Pipeline while it is in a “Running” status.
- Editing: If you attempt to edit a pipeline that is currently running, a warning banner will be displayed at the top of the screen, preventing any modifications.
- Deleting: If you attempt to delete a pipeline that is currently running, a modal will appear, preventing the deletion and explaining that the pipeline is in a “Running” status.
Updating a Vector Pipeline
You can update the configuration of a Vector Pipeline only if it is not running.
- Navigate to Vector Pipelines: Go to “Data Management” > “Data Ingestion” > “Vector Pipelines”.
- Locate the Vector Pipeline Card: Find the card for the pipeline you want to update.
- Click the Edit (Pencil) Icon: This will open the update dialog.
- Modify Pipeline Parameters: Update the pipeline’s configuration as needed. You can change the source, chunking strategy, embeddings model, target database/collection, name and description.
- Click “Save”: Save the changes.
Removing a Vector Pipeline
You can remove a registered Vector Pipeline if it’s no longer needed. However, you cannot delete a pipeline that is currently in a “Running” status.
- Navigate to Vector Pipelines: Go to “Data Management” > “Data Ingestion” > “Vector Pipelines”.
- Locate the Vector Pipeline Card: Find the card for the pipeline you want to remove.
- Click the Delete (Trash Bin) Icon: This will trigger a confirmation prompt.
If you attempt to delete a pipeline that is currently running, a modal will appear, preventing the deletion and explaining that the pipeline is in use.
- Confirm Deletion: Confirm that you want to delete the pipeline.
Warning: Deleting a Vector Pipeline is a permanent action and cannot be undone.
Running and Monitoring a Vector Pipeline
To run, monitor, and stop a Vector Pipeline:
- Navigate to Vector Pipelines: Go to “Data Management” > “Data Ingestion” > “Vector Pipelines”.
- Click on the Pipeline Card: Click on the card of the Vector Pipeline you want to manage. This will take you to the Pipeline Details Page.
Pipeline Details Page
The Pipeline Details Page provides detailed information about a specific Vector Pipeline and allows you to manage its execution.
Top Section:
- Pipeline Name: The name of the selected pipeline.
- Description: The description of the selected pipeline.
Main Action Buttons (Top Right):
- “Run” Button: Click this button to start a new run of the pipeline.
- “Back” Button: Click this button to return to the Vector Pipelines collection page.
Runs List:
This section displays a list of all runs for the selected pipeline.
- Refresh Button: Click this button to refresh the list of runs and see the latest status.
- “STOP RUN” Button: Click this button to stop a currently running pipeline run. This button is only active for running pipelines.
Run Information:
Each run in the list displays the following information:
- Run Status: The status of the run (e.g., Running, Completed, Failed).
- Start Time: The date and time the run started.
- End Time: The date and time the run finished (if applicable).
- Duration: The total duration of the run.
Using Vector Pipelines
Once a Vector Pipeline is created and run, it will process the documents in the specified source folder, generate embeddings, and store them in the target Vector Database collection. This process typically runs in the background.
The generated vector embeddings can then be used in Data Channels to enable Retrieval Augmented Generation (RAG) capabilities in your AI Assistants.