Importing Vector Data

Importing vector data is the process of ingesting data to the CSS vector database. When writing vector data to a vector index, you need to specify the vector field (for example, my_vector) and the corresponding data format. The CSS vector database supports two common formats: floating-point arrays and Base64.

Choose a format based on the characteristics of your data. Also, choose an appropriate data importing method.

Constraints

Importing a Single Record

Bulk Import

For details about how to use the Bulk API, see Bulk API.

(Optional) Post-processing after Data Ingestion: Offline Index Building

  • Use offline index creation via an API only when real-time data is not required or crucial and the cluster version is OpenSearch 2.19.0.
  • If lazy_indexing is enabled, offline index building must be performed after data ingestion. Otherwise, the system will return error code 500 for standard vector query, with the error message "Load native index failed exception." To solve this problem, perform offline index building before vectors queries.

OpenSearch uses an LSM (Log-Structured Merge) tree-like model to accelerate write operations. As data is continuously written in and updated, numerous small index segments are generated and later merged via a backend task to enhance query performance. As vector indexing is computationally intensive, frequent index merging while vector data is being written in consumes significant CPU resources. Therefore, where real-time data is not crucial, it is advisable to set lazy_indexing to true for vector fields. This allows a final vector index to be created via a non-real time API after all data has been written in. This approach significantly reduces index merges, thereby improving overall write and index merging performance.

Offline index building consists of two steps:

  1. Merge index segments.
  2. Create the final vector index based on the final index segments.

The API used for offline index building is as follows:

POST _vector/indexing/{index_name}
{
  "field": "{field_name}"
}

where, {index_name} indicates the name of the index to create. {field_name} indicates the name of the vector field for which lazy_indexing has been set to true.