Importing vector data is the process of ingesting data to the CSS vector database. When writing vector data to a vector index, you need to specify the vector field (for example, my_vector) and the corresponding data format. The CSS vector database supports two common formats: floating-point arrays and Base64.
Choose a format based on the characteristics of your data. Also, choose an appropriate data importing method.
POST my_index/_doc
{
"my_vector": [1.0, 2.0]
}
POST my_index/_doc
{
"my_vector": "AACAPwAAAEA="
}
POST my_index/_bulk
{"index": {}}
{"my_vector": [1.0, 2.0], "my_label": "red"}
{"index": {}}
{"my_vector": [2.0, 2.0], "my_label": "green"}
{"index": {}}
{"my_vector": [2.0, 3.0], "my_label": "red"}
POST my_index/_bulk
{"index":{}}
{"my_vector":"AACAPwAAAEA=", "my_label": "red"}
{"index":{}}
{"my_vector":"AAAAQAAAAEA=", "my_label": "green"}
{"index":{}}
{"my_vector":"AAAAQAAAQEA=", "my_label": "red"}
For details about how to use the Bulk API, see Bulk API.
OpenSearch uses an LSM (Log-Structured Merge) tree-like model to accelerate write operations. As data is continuously written in and updated, numerous small index segments are generated and later merged via a backend task to enhance query performance. As vector indexing is computationally intensive, frequent index merging while vector data is being written in consumes significant CPU resources. Therefore, where real-time data is not crucial, it is advisable to set lazy_indexing to true for vector fields. This allows a final vector index to be created via a non-real time API after all data has been written in. This approach significantly reduces index merges, thereby improving overall write and index merging performance.
Offline index building consists of two steps:
The API used for offline index building is as follows:
POST _vector/indexing/{index_name}
{
"field": "{field_name}"
}
where, {index_name} indicates the name of the index to create. {field_name} indicates the name of the vector field for which lazy_indexing has been set to true.