forked from docs/doc-exports
Reviewed-by: Pruthi, Vineet <vineet.pruthi@t-systems.com> Co-authored-by: Lu, Huayi <luhuayi@huawei.com> Co-committed-by: Lu, Huayi <luhuayi@huawei.com>
44 lines
4.5 KiB
HTML
44 lines
4.5 KiB
HTML
<a name="EN-US_TOPIC_0000001233628593"></a><a name="EN-US_TOPIC_0000001233628593"></a>
|
|
|
|
<h1 class="topictitle1">Stop Words</h1>
|
|
<div id="body8662426"><p id="EN-US_TOPIC_0000001233628593__en-us_topic_0059778544_p86851123411">Stop words are words that are very common, appear in almost every document, and have no discrimination value. Therefore, they can be ignored in the context of full text searching. Each type of dictionaries treats stop words in different ways. For example, <strong id="EN-US_TOPIC_0000001233628593__b1081661124919">Ispell</strong> dictionaries first normalize words and then check the list of stop words, while <strong id="EN-US_TOPIC_0000001233628593__b081815115498">Snowball</strong> dictionaries first check the list of stop words.</p>
|
|
<p id="EN-US_TOPIC_0000001233628593__a28d2d72038fc4781802bc7e756a26064">For example, every English text contains words like <strong id="EN-US_TOPIC_0000001233628593__b81889344121756">a</strong> and <strong id="EN-US_TOPIC_0000001233628593__b2621137921756">the</strong>, so it is useless to store them in an index. However, stop words affect the positions in <strong id="EN-US_TOPIC_0000001233628593__b842352706144919">tsvector</strong>, which in turn affect ranking.</p>
|
|
<div class="codecoloring" codetype="Sql" id="EN-US_TOPIC_0000001233628593__s698e816ffc534f5299816a16e1db55f7"><div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal">1</span>
|
|
<span class="normal">2</span>
|
|
<span class="normal">3</span>
|
|
<span class="normal">4</span></pre></div></td><td class="code"><div><pre><span></span><span class="k">SELECT</span><span class="w"> </span><span class="n">to_tsvector</span><span class="p">(</span><span class="s1">'english'</span><span class="p">,</span><span class="s1">'in the list of stop words'</span><span class="p">);</span>
|
|
<span class="w"> </span><span class="n">to_tsvector</span>
|
|
<span class="c1">----------------------------</span>
|
|
<span class="w"> </span><span class="s1">'list'</span><span class="p">:</span><span class="mi">3</span><span class="w"> </span><span class="s1">'stop'</span><span class="p">:</span><span class="mi">5</span><span class="w"> </span><span class="s1">'word'</span><span class="p">:</span><span class="mi">6</span>
|
|
</pre></div></td></tr></table></div>
|
|
|
|
</div>
|
|
<p id="EN-US_TOPIC_0000001233628593__a17318900a29d4d9fbc0b842a11e65c20">The missing positions 1, 2, and 4 are because of stop words. Ranks calculated for documents with and without stop words are quite different:</p>
|
|
<div class="codecoloring" codetype="Sql" id="EN-US_TOPIC_0000001233628593__s47258dfe2aa541c0a058b23dbafc89d0"><div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal">1</span>
|
|
<span class="normal">2</span>
|
|
<span class="normal">3</span>
|
|
<span class="normal">4</span>
|
|
<span class="normal">5</span>
|
|
<span class="normal">6</span>
|
|
<span class="normal">7</span>
|
|
<span class="normal">8</span>
|
|
<span class="normal">9</span></pre></div></td><td class="code"><div><pre><span></span><span class="k">SELECT</span><span class="w"> </span><span class="n">ts_rank_cd</span><span class="w"> </span><span class="p">(</span><span class="n">to_tsvector</span><span class="p">(</span><span class="s1">'english'</span><span class="p">,</span><span class="s1">'in the list of stop words'</span><span class="p">),</span><span class="w"> </span><span class="n">to_tsquery</span><span class="p">(</span><span class="s1">'list & stop'</span><span class="p">));</span>
|
|
<span class="w"> </span><span class="n">ts_rank_cd</span>
|
|
<span class="c1">------------</span>
|
|
<span class="w"> </span><span class="p">.</span><span class="mi">05</span>
|
|
|
|
<span class="k">SELECT</span><span class="w"> </span><span class="n">ts_rank_cd</span><span class="w"> </span><span class="p">(</span><span class="n">to_tsvector</span><span class="p">(</span><span class="s1">'english'</span><span class="p">,</span><span class="s1">'list stop words'</span><span class="p">),</span><span class="w"> </span><span class="n">to_tsquery</span><span class="p">(</span><span class="s1">'list & stop'</span><span class="p">));</span>
|
|
<span class="w"> </span><span class="n">ts_rank_cd</span>
|
|
<span class="c1">------------</span>
|
|
<span class="w"> </span><span class="p">.</span><span class="mi">1</span>
|
|
</pre></div></td></tr></table></div>
|
|
|
|
</div>
|
|
</div>
|
|
<div>
|
|
<div class="familylinks">
|
|
<div class="parentlink"><strong>Parent topic:</strong> <a href="dws_06_0102.html">Dictionaries</a></div>
|
|
</div>
|
|
</div>
|
|
|