doc-exports/docs/dws/dev/dws_06_0094.html
Lu, Huayi e6fa411af0 DWS DEV 830.201 version
Reviewed-by: Pruthi, Vineet <vineet.pruthi@t-systems.com>
Co-authored-by: Lu, Huayi <luhuayi@huawei.com>
Co-committed-by: Lu, Huayi <luhuayi@huawei.com>
2024-05-16 07:24:04 +00:00

87 lines
16 KiB
HTML

<a name="EN-US_TOPIC_0000001188270518"></a><a name="EN-US_TOPIC_0000001188270518"></a>
<h1 class="topictitle1">Ranking Search Results</h1>
<div id="body8662426"><p id="EN-US_TOPIC_0000001188270518__en-us_topic_0059777759_p79693118219">Ranking attempts to measure how relevant documents are to a particular query, so that when there are many matches the most relevant ones can be shown first. <span id="EN-US_TOPIC_0000001188270518__text230338983">GaussDB(DWS)</span> provides two predefined ranking functions: <strong id="EN-US_TOPIC_0000001188270518__b478792515444">ts_rank</strong> and <strong id="EN-US_TOPIC_0000001188270518__b343032824417">ts_rank_cd</strong>. which take into account lexical, proximity, and structural information; that is, they consider how often the query terms appear in the document, how close together the terms are in the document, and how important is the part of the document where they occur. However, the concept of relevancy is vague and application-specific. Different applications might require additional information for ranking, for example, document modification time. The built-in ranking functions are only examples. You can write your own ranking functions and/or combine their results with additional factors to fit your specific needs.</p>
<p id="EN-US_TOPIC_0000001188270518__aff4887867c934ef296c788a9b38fa4dc">The two ranking functions currently available are:</p>
<div class="codecoloring" codetype="Sql" id="EN-US_TOPIC_0000001188270518__s17e3e3c2993f4e4a861d82c43526820f"><div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><span class="n">ts_rank</span><span class="p">([</span><span class="w"> </span><span class="n">weights</span><span class="w"> </span><span class="n">float4</span><span class="p">[],</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="n">vector</span><span class="w"> </span><span class="n">tsvector</span><span class="p">,</span><span class="w"> </span><span class="n">query</span><span class="w"> </span><span class="n">tsquery</span><span class="w"> </span><span class="p">[,</span><span class="w"> </span><span class="n">normalization</span><span class="w"> </span><span class="nb">integer</span><span class="w"> </span><span class="p">])</span><span class="w"> </span><span class="k">returns</span><span class="w"> </span><span class="n">float4</span>
</pre></div></td></tr></table></div>
</div>
<p id="EN-US_TOPIC_0000001188270518__a8643d2c58f9b4add8e2d74376a53ffe5">Ranks vectors based on the frequency of their matching lexemes.</p>
<div class="codecoloring" codetype="Sql" id="EN-US_TOPIC_0000001188270518__s754b49569ef84135872e704d6be47080"><div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><span class="n">ts_rank_cd</span><span class="p">([</span><span class="w"> </span><span class="n">weights</span><span class="w"> </span><span class="n">float4</span><span class="p">[],</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="n">vector</span><span class="w"> </span><span class="n">tsvector</span><span class="p">,</span><span class="w"> </span><span class="n">query</span><span class="w"> </span><span class="n">tsquery</span><span class="w"> </span><span class="p">[,</span><span class="w"> </span><span class="n">normalization</span><span class="w"> </span><span class="nb">integer</span><span class="w"> </span><span class="p">])</span><span class="w"> </span><span class="k">returns</span><span class="w"> </span><span class="n">float4</span>
</pre></div></td></tr></table></div>
</div>
<p id="EN-US_TOPIC_0000001188270518__aa54a18cf703e41ffbf0178045a2ad508">This function requires positional information in its input. Therefore, it will not work on "stripped" <strong id="EN-US_TOPIC_0000001188270518__b84235270612650">tsvector</strong> values. It will always return zero. </p>
<p id="EN-US_TOPIC_0000001188270518__ae3cf983c396d4f51b2151202e10da455">For both these functions, the optional <strong id="EN-US_TOPIC_0000001188270518__b8423527061273">weights</strong> argument offers the ability to weigh word instances more or less heavily depending on how they are labeled. The weight arrays specify how heavily to weigh each category of word, in the order:</p>
<pre class="screen" id="EN-US_TOPIC_0000001188270518__s390228f1201b4555b57b4cef799f82c9">{D-weight, C-weight, B-weight, A-weight}</pre>
<p id="EN-US_TOPIC_0000001188270518__afc56f59af66948aa95e6808c464b4106">If no <strong id="EN-US_TOPIC_0000001188270518__b84235270612734">weights</strong> are provided, then these defaults are used: {0.1, 0.2, 0.4, 1.0}</p>
<p id="EN-US_TOPIC_0000001188270518__a8392e99e97e74f82b6dc2ffba72a9bf2">Typically weights are used to mark words from special areas of the document, like the title or an initial abstract, so they can be treated with more or less importance than words in the document body.</p>
<p id="EN-US_TOPIC_0000001188270518__a91a8710bf7c04e639055f9ae385c17e9">Since a longer document has a greater chance of containing a query term it is reasonable to take into account document size. For example, a hundred-word document with five instances of a search word is probably more relevant than a thousand-word document with five instances. Both ranking functions take an integer <strong id="EN-US_TOPIC_0000001188270518__b84235270612916">normalization</strong> option that specifies whether and how a document's length should impact its rank. The integer option controls several behaviors, so it is a bit mask: you can specify one or more behaviors using a vertical bar (<strong id="EN-US_TOPIC_0000001188270518__b1567672574614">|</strong>) (for example, <strong id="EN-US_TOPIC_0000001188270518__b12676202594613">2|4</strong>).</p>
<ul id="EN-US_TOPIC_0000001188270518__u030390dedb2345ae8a99cbf1386c5130"><li id="EN-US_TOPIC_0000001188270518__l11ccb9fd13604a9cb95fe9d40aba5201">0 (the default) ignores the document length</li><li id="EN-US_TOPIC_0000001188270518__le70680ffe6054ac59bfa1798b94f7e05">1 divides the rank by (1 + Logarithm of the document length)</li><li id="EN-US_TOPIC_0000001188270518__ldb26969c19964d9d890c2c57f5d8a229">2 divides the rank by the document length </li><li id="EN-US_TOPIC_0000001188270518__laaedcd2448d44bcb9c30464560fb667e">4 divides the rank by the mean harmonic distance between extents (this is implemented only by ts_rank_cd)</li><li id="EN-US_TOPIC_0000001188270518__lf5fc1613f6454f6a8e188e26aab44ce3">8 divides the rank by the number of unique words in document</li><li id="EN-US_TOPIC_0000001188270518__l0c326266f0ed4dbc8bc15206fe78814c">16 divides the rank by (1 + Logarithm of the number of unique words in document)</li><li id="EN-US_TOPIC_0000001188270518__l2130887e16d145e8aff6aa4a45c6b4c2">32 divides the rank by (itself + 1)</li></ul>
<p id="EN-US_TOPIC_0000001188270518__a45fc284b343e4bf7bfc9ddaeb94c08c3">If more than one flag bit is specified, the transformations are applied in the order listed.</p>
<p id="EN-US_TOPIC_0000001188270518__a6a25572ae270403fa5ba01ec5a98d058">It is important to note that the ranking functions do not use any global information, so it is impossible to produce a fair normalization to 1% or 100% as sometimes desired. Normalization option 32 <strong id="EN-US_TOPIC_0000001188270518__b17511545135111">(rank/(rank+1))</strong> can be used to scale all ranks into the range zero to one. This is just a cosmetic change, and it will not affect the ordering of the search results.</p>
<p id="EN-US_TOPIC_0000001188270518__aab2de272d75d479f98bd5adf4b29a9bc">Here is an example that selects only the ten highest-ranked matches:</p>
<p id="EN-US_TOPIC_0000001188270518__p83801046152515">Run the following statements in a database that uses the UTF-8 or GBK encoding:</p>
<div class="codecoloring" codetype="Sql" id="EN-US_TOPIC_0000001188270518__s9cddbe3b65254feca6f754373b812624"><div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal"> 1</span>
<span class="normal"> 2</span>
<span class="normal"> 3</span>
<span class="normal"> 4</span>
<span class="normal"> 5</span>
<span class="normal"> 6</span>
<span class="normal"> 7</span>
<span class="normal"> 8</span>
<span class="normal"> 9</span>
<span class="normal">10</span>
<span class="normal">11</span>
<span class="normal">12</span></pre></div></td><td class="code"><div><pre><span></span><span class="k">SELECT</span><span class="w"> </span><span class="n">id</span><span class="p">,</span><span class="w"> </span><span class="n">title</span><span class="p">,</span><span class="w"> </span><span class="n">ts_rank_cd</span><span class="p">(</span><span class="n">to_tsvector</span><span class="p">(</span><span class="n">body</span><span class="p">),</span><span class="w"> </span><span class="n">query</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">rank</span><span class="w"> </span>
<span class="k">FROM</span><span class="w"> </span><span class="n">tsearch</span><span class="p">.</span><span class="n">pgweb</span><span class="p">,</span><span class="w"> </span><span class="n">to_tsquery</span><span class="p">(</span><span class="s1">'science'</span><span class="p">)</span><span class="w"> </span><span class="n">query</span><span class="w"> </span>
<span class="k">WHERE</span><span class="w"> </span><span class="n">query</span><span class="w"> </span><span class="o">@@</span><span class="w"> </span><span class="n">to_tsvector</span><span class="p">(</span><span class="n">body</span><span class="p">)</span><span class="w"> </span>
<span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">rank</span><span class="w"> </span><span class="k">DESC</span><span class="w"> </span>
<span class="k">LIMIT</span><span class="w"> </span><span class="mi">10</span><span class="p">;</span>
<span class="w"> </span><span class="n">id</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">title</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">rank</span>
<span class="c1">----+------------------+------</span>
<span class="w"> </span><span class="mi">7</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">Medical</span><span class="w"> </span><span class="n">science</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="p">.</span><span class="mi">2</span>
<span class="w"> </span><span class="mi">3</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">Computer</span><span class="w"> </span><span class="n">science</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="p">.</span><span class="mi">1</span>
<span class="w"> </span><span class="mi">2</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">Mathematics</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="p">.</span><span class="mi">1</span>
<span class="w"> </span><span class="mi">5</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">Geography</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="p">.</span><span class="mi">1</span>
<span class="p">(</span><span class="mi">4</span><span class="w"> </span><span class="k">rows</span><span class="p">)</span>
</pre></div></td></tr></table></div>
</div>
<p id="EN-US_TOPIC_0000001188270518__a083bf085035541528c7060ef04e7f11b">This is the same example using normalized ranking:</p>
<div class="codecoloring" codetype="Sql" id="EN-US_TOPIC_0000001188270518__scc1b1acc40c34d1dbb1784fe2165f97f"><div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal"> 1</span>
<span class="normal"> 2</span>
<span class="normal"> 3</span>
<span class="normal"> 4</span>
<span class="normal"> 5</span>
<span class="normal"> 6</span>
<span class="normal"> 7</span>
<span class="normal"> 8</span>
<span class="normal"> 9</span>
<span class="normal">10</span>
<span class="normal">11</span>
<span class="normal">12</span></pre></div></td><td class="code"><div><pre><span></span><span class="k">SELECT</span><span class="w"> </span><span class="n">id</span><span class="p">,</span><span class="w"> </span><span class="n">title</span><span class="p">,</span><span class="w"> </span><span class="n">ts_rank_cd</span><span class="p">(</span><span class="n">to_tsvector</span><span class="p">(</span><span class="n">body</span><span class="p">),</span><span class="w"> </span><span class="n">query</span><span class="p">,</span><span class="w"> </span><span class="mi">32</span><span class="w"> </span><span class="cm">/* rank/(rank+1) */</span><span class="w"> </span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">rank</span><span class="w"> </span>
<span class="k">FROM</span><span class="w"> </span><span class="n">tsearch</span><span class="p">.</span><span class="n">pgweb</span><span class="p">,</span><span class="w"> </span><span class="n">to_tsquery</span><span class="p">(</span><span class="s1">'science'</span><span class="p">)</span><span class="w"> </span><span class="n">query</span><span class="w"> </span>
<span class="k">WHERE</span><span class="w"> </span><span class="n">query</span><span class="w"> </span><span class="o">@@</span><span class="w"> </span><span class="n">to_tsvector</span><span class="p">(</span><span class="n">body</span><span class="p">)</span><span class="w"> </span>
<span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">rank</span><span class="w"> </span><span class="k">DESC</span><span class="w"> </span>
<span class="k">LIMIT</span><span class="w"> </span><span class="mi">10</span><span class="p">;</span>
<span class="w"> </span><span class="n">id</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">title</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">rank</span>
<span class="c1">----+------------------+----------</span>
<span class="w"> </span><span class="mi">7</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">Medical</span><span class="w"> </span><span class="n">science</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="p">.</span><span class="mi">166667</span>
<span class="w"> </span><span class="mi">3</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">Computer</span><span class="w"> </span><span class="n">science</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="p">.</span><span class="mi">0909091</span>
<span class="w"> </span><span class="mi">2</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">Mathematics</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="p">.</span><span class="mi">0909091</span>
<span class="w"> </span><span class="mi">5</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">Geography</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="p">.</span><span class="mi">0909091</span>
<span class="p">(</span><span class="mi">4</span><span class="w"> </span><span class="k">rows</span><span class="p">)</span>
</pre></div></td></tr></table></div>
</div>
<p id="EN-US_TOPIC_0000001188270518__ace6879f01fc541529a1ed77faa20348e">Ranking can be expensive since it requires consulting the <strong id="EN-US_TOPIC_0000001188270518__b84235270615813">tsvector</strong> of each matching document, which can be I/O bound and therefore slow. Unfortunately, it is almost impossible to avoid since practical queries often result in large numbers of matches.</p>
</div>
<div>
<div class="familylinks">
<div class="parentlink"><strong>Parent topic:</strong> <a href="dws_06_0091.html">Controlling Text Search</a></div>
</div>
</div>