forked from docs/doc-exports
Reviewed-by: Pruthi, Vineet <vineet.pruthi@t-systems.com> Co-authored-by: zhengxiu <zhengxiu@huawei.com> Co-committed-by: zhengxiu <zhengxiu@huawei.com>
56 lines
7.8 KiB
HTML
56 lines
7.8 KiB
HTML
<a name="EN-US_TOPIC_0000002333390490"></a><a name="EN-US_TOPIC_0000002333390490"></a>
|
|
|
|
<h1 class="topictitle1">Importing Vector Data</h1>
|
|
<div id="body0000002333390490"><p id="EN-US_TOPIC_0000002333390490__p015512713407">Importing vector data is the process of ingesting data to the CSS vector database. When writing vector data to a vector index, you need to specify the vector field (for example, <strong id="EN-US_TOPIC_0000002333390490__b788950244113121">my_vector</strong>) and the corresponding data format. The CSS vector database supports two common formats: floating-point arrays and Base64.</p>
|
|
<ul id="EN-US_TOPIC_0000002333390490__ul12570134934217"><li id="EN-US_TOPIC_0000002333390490__li1857084917423">Floating-point arrays: transmitting readable arrays directly.</li><li id="EN-US_TOPIC_0000002333390490__li11570124910425">Base64: encoding vectors (little-endian byte order) into character strings to reduce network transmission overhead and improve efficiency in handling high-dimensional/binary vectors.</li></ul>
|
|
<p id="EN-US_TOPIC_0000002333390490__p77872510388">Choose a format based on the characteristics of your data. Also, choose an appropriate data importing method.</p>
|
|
<ul id="EN-US_TOPIC_0000002333390490__ul11613101718129"><li id="EN-US_TOPIC_0000002333390490__li136131817181217">Importing a single record: Use for small-scale applications or testing.</li><li id="EN-US_TOPIC_0000002333390490__li15613191771215">Bulk import: Use for large-scale applications, where write requests are merged to reduce network overhead.</li></ul>
|
|
<div class="section" id="EN-US_TOPIC_0000002333390490__section6903614172518"><h4 class="sectiontitle">Constraints</h4><ul id="EN-US_TOPIC_0000002333390490__ul187041447152514"><li id="EN-US_TOPIC_0000002333390490__li107049478251">Ensure that the vector field names and vector dimensions are consistent with those defined for the index.</li><li id="EN-US_TOPIC_0000002333390490__li8704164792512">Base64 encoding must use the little-endian byte order. Otherwise, parsing errors may occur.</li><li id="EN-US_TOPIC_0000002333390490__li17043472253">In the case of bulk imports, you are advised to submit 100 to 1000 records per request. This balances throughput and memory overhead.</li></ul>
|
|
</div>
|
|
<div class="section" id="EN-US_TOPIC_0000002333390490__section112382921217"><h4 class="sectiontitle">Importing a Single Record</h4><ul id="EN-US_TOPIC_0000002333390490__en-us_topic_0000002353678449_ul1447115404176"><li id="EN-US_TOPIC_0000002333390490__en-us_topic_0000002353678449_li154711140131720">Floating-point array<pre class="screen" id="EN-US_TOPIC_0000002333390490__en-us_topic_0000002353678449_screen187218394184">POST <i><span class="varname" id="EN-US_TOPIC_0000002333390490__en-us_topic_0000002353678449_varname122034371812">my_index</span></i>/_doc
|
|
{
|
|
"my_vector": [1.0, 2.0]
|
|
}</pre>
|
|
</li><li id="EN-US_TOPIC_0000002333390490__en-us_topic_0000002353678449_li1747144071720">Base64<pre class="screen" id="EN-US_TOPIC_0000002333390490__en-us_topic_0000002353678449_screen5659104710189">POST <i><span class="varname" id="EN-US_TOPIC_0000002333390490__en-us_topic_0000002353678449_varname1197011507187">my_index</span></i>/_doc
|
|
{
|
|
"my_vector": "AACAPwAAAEA="
|
|
}</pre>
|
|
</li></ul>
|
|
</div>
|
|
<div class="section" id="EN-US_TOPIC_0000002333390490__section1759833513124"><h4 class="sectiontitle">Bulk Import</h4><ul id="EN-US_TOPIC_0000002333390490__en-us_topic_0000002353678449_ul11956134511195"><li id="EN-US_TOPIC_0000002333390490__en-us_topic_0000002353678449_li169561945101915">Floating-point array<pre class="screen" id="EN-US_TOPIC_0000002333390490__en-us_topic_0000002353678449_screen1786216364208">POST <i><span class="varname" id="EN-US_TOPIC_0000002333390490__en-us_topic_0000002353678449_varname119601143172014">my_index</span></i>/_bulk
|
|
{"index": {}}
|
|
{"my_vector": [1.0, 2.0], "my_label": "red"}
|
|
{"index": {}}
|
|
{"my_vector": [2.0, 2.0], "my_label": "green"}
|
|
{"index": {}}
|
|
{"my_vector": [2.0, 3.0], "my_label": "red"}</pre>
|
|
</li><li id="EN-US_TOPIC_0000002333390490__en-us_topic_0000002353678449_li6956134511920">Base64<pre class="screen" id="EN-US_TOPIC_0000002333390490__en-us_topic_0000002353678449_screen6310134384817">POST <i><span class="varname" id="EN-US_TOPIC_0000002333390490__en-us_topic_0000002353678449_varname632816518231">my_index</span></i>/_bulk
|
|
{"index":{}}
|
|
{"my_vector":"AACAPwAAAEA=", "my_label": "red"}
|
|
{"index":{}}
|
|
{"my_vector":"AAAAQAAAAEA=", "my_label": "green"}
|
|
{"index":{}}
|
|
{"my_vector":"AAAAQAAAQEA=", "my_label": "red"}</pre>
|
|
</li></ul>
|
|
<p id="EN-US_TOPIC_0000002333390490__en-us_topic_0000002353678449_p73291054143716">For details about how to use the Bulk API, see <a href="https://www.elastic.co/guide/en/elasticsearch/reference/7.7/docs-bulk.html" target="_blank" rel="noopener noreferrer">Bulk API</a>.</p>
|
|
</div>
|
|
<div class="section" id="EN-US_TOPIC_0000002333390490__section164656265118"><a name="EN-US_TOPIC_0000002333390490__section164656265118"></a><a name="section164656265118"></a><h4 class="sectiontitle">(Optional) Post-processing after Data Ingestion: Offline Index Building</h4><div class="warning" id="EN-US_TOPIC_0000002333390490__note9426454103716"><span class="warningtitle"><img src="public_sys-resources/warning_3.0-en-us.png"> </span><div class="warningbody"><ul id="EN-US_TOPIC_0000002333390490__ul1286142123818"><li id="EN-US_TOPIC_0000002333390490__li8286725386">Use offline index creation via an API only when real-time data is not required or crucial and the cluster version is OpenSearch 2.19.0.</li><li id="EN-US_TOPIC_0000002333390490__li1075867174411">If lazy_indexing is enabled, offline index building must be performed after data ingestion. Otherwise, the system will return error code 500 for standard vector query, with the error message "Load native index failed exception." To solve this problem, perform offline index building before vectors queries.</li></ul>
|
|
</div></div>
|
|
<p id="EN-US_TOPIC_0000002333390490__p1529111251215">OpenSearch uses an LSM (Log-Structured Merge) tree-like model to accelerate write operations. As data is continuously written in and updated, numerous small index segments are generated and later merged via a backend task to enhance query performance. As vector indexing is computationally intensive, frequent index merging while vector data is being written in consumes significant CPU resources. Therefore, where real-time data is not crucial, it is advisable to set <strong id="EN-US_TOPIC_0000002333390490__b114183182411171">lazy_indexing</strong> to <strong id="EN-US_TOPIC_0000002333390490__b174512243911171">true</strong> for vector fields. This allows a final vector index to be created via a non-real time API after all data has been written in. This approach significantly reduces index merges, thereby improving overall write and index merging performance.</p>
|
|
<p id="EN-US_TOPIC_0000002333390490__p031561711139">Offline index building consists of two steps:</p>
|
|
<ol id="EN-US_TOPIC_0000002333390490__ol478183191311"><li id="EN-US_TOPIC_0000002333390490__li87811731131318">Merge index segments.</li><li id="EN-US_TOPIC_0000002333390490__li478115315135">Create the final vector index based on the final index segments.</li></ol>
|
|
<p id="EN-US_TOPIC_0000002333390490__p1367412481662">The API used for offline index building is as follows:</p>
|
|
<pre class="screen" id="EN-US_TOPIC_0000002333390490__screen1367414483612">POST _vector/indexing/{index_name}
|
|
{
|
|
"field": "{field_name}"
|
|
}</pre>
|
|
<p id="EN-US_TOPIC_0000002333390490__p96742481061">where, {index_name} indicates the name of the index to create. {field_name} indicates the name of the vector field for which <strong id="EN-US_TOPIC_0000002333390490__b192479830911171">lazy_indexing</strong> has been set to <strong id="EN-US_TOPIC_0000002333390490__b212160817111171">true</strong>.</p>
|
|
</div>
|
|
</div>
|
|
<div>
|
|
<div class="familylinks">
|
|
<div class="parentlink"><strong>Parent topic:</strong> <a href="css_01_0101.html">Configuring Vector Search for OpenSearch Clusters</a></div>
|
|
</div>
|
|
</div>
|
|
|