Yang, Tong 6182f91ba8 MRS component operation guide_normal 2.0.38.SP20 version
Reviewed-by: Hasko, Vladimir <vladimir.hasko@t-systems.com>
Co-authored-by: Yang, Tong <yangtong2@huawei.com>
Co-committed-by: Yang, Tong <yangtong2@huawei.com>
2022-12-09 14:55:21 +00:00

79 lines
17 KiB
HTML

<a name="mrs_01_1999"></a><a name="mrs_01_1999"></a>
<h1 class="topictitle1">Optimizing SQL Query of Data of Multiple Sources</h1>
<div id="body1595920219238"><div class="section" id="mrs_01_1999__s3b34da8437914dbfba759958bf67b545"><h4 class="sectiontitle">Scenario</h4><p id="mrs_01_1999__a8647c2d5485d4b56adea5c11d9afa14d">This section describes how to enable or disable the query optimization for inter-source complex SQL.</p>
</div>
<div class="section" id="mrs_01_1999__sc9af3f18a8944f8393089526600550c3"><h4 class="sectiontitle">Procedure</h4><ul id="mrs_01_1999__uf271b959dae74bc3a23a5897351f21e8"><li id="mrs_01_1999__l54e876862ec347c8a417d438e01e4dbc">(Optional) Prepare for connecting to the MPPDB data source.<p id="mrs_01_1999__a33b63fd08ee9437cab2be835005bdff9"><a name="mrs_01_1999__l54e876862ec347c8a417d438e01e4dbc"></a><a name="l54e876862ec347c8a417d438e01e4dbc"></a>If the data source to be connected is MPPDB, a class name conflict occurs because the MPPDB Driver file <span class="filepath" id="mrs_01_1999__filepath84968048293650"><b>gsjdbc4.jar</b></span> and the Spark JAR package <span class="filepath" id="mrs_01_1999__filepath55292564593650"><b>gsjdbc4-VXXXRXXXCXXSPCXXX.jar</b></span> contain the same class name. Therefore, before connecting to the MPPDB data source, perform the following steps:</p>
<ol id="mrs_01_1999__od5f1ba66ed004d86a843a9a5d2bd349e"><li id="mrs_01_1999__l96683a0d853e4b72a1bd39c5c4955481">Move <span class="filepath" id="mrs_01_1999__f0f90fae630e8483298ddcf583942c146"><b>gsjdbc4-VXXXRXXXCXXSPCXXX.jar</b></span> from Spark. Spark running does not depend on this JAR file. Therefore, moving this JAR file to another directory (for example, the <strong id="mrs_01_1999__b1721314717111">/tmp</strong> directory) will not affect Spark running.<ol type="a" id="mrs_01_1999__o0260de1e3d704b76a802620e8c30dad0"><li id="mrs_01_1999__l875b013f2679490cac6515b9b9d6150f">Log in to the Spark server and move <span class="filepath" id="mrs_01_1999__fbcab862e5037466eb646878459e5a8f5"><b>gsjdbc4-VXXXRXXXCXXSPCXXX.jar</b></span> from the <strong id="mrs_01_1999__b88619573493650">${BIGDATA_HOME}/FusionInsight_Spark2x_<span id="mrs_01_1999__text10434120105616">8.1.0.1</span>/install/FusionInsight-Spark2x-<span id="mrs_01_1999__text5563355171417">3.1.1</span>/spark/jars</strong> directory.</li><li id="mrs_01_1999__la940c3e5cfd24365934fcd6f27da7dc0">Log in to the Spark client host and move <span class="filepath" id="mrs_01_1999__f589bf21aaf694a32bedcaa7ed01793e1"><b>gsjdbc4-VXXXRXXXCXXSPCXXX.jar</b></span> from the <strong id="mrs_01_1999__b202664598193650">/opt/client/Spark2x/spark/jars</strong> directory.</li></ol>
</li><li id="mrs_01_1999__l1470dcea59604849bd6e333a3266882d">Obtain the MPPDB Driver file <span class="filepath" id="mrs_01_1999__f364cec8f76764368952d3316d402a6f9"><b>gsjdbc4.jar</b></span> from the MPPDB installation package and upload the file to the following directories:<div class="note" id="mrs_01_1999__note17589145575918"><img src="public_sys-resources/note_3.0-en-us.png"><span class="notetitle"> </span><div class="notebody"><p id="mrs_01_1999__p959085585912">Obtain <span class="filepath" id="mrs_01_1999__filepath225916906"><b>gsjdbc4.jar</b></span> from <strong id="mrs_01_1999__b206251061615">FusionInsight_MPPDB\software\components\package\FusionInsight-MPPDB-</strong><em id="mrs_01_1999__i17600175411016">xxx</em><strong id="mrs_01_1999__b0820111617166">\package\Gauss-MPPDB-ALL-PACKAGES\GaussDB-</strong><em id="mrs_01_1999__i166461014119">xxx</em><strong id="mrs_01_1999__b10935103111618">-REDHAT-</strong><em id="mrs_01_1999__i57934211111">xxx</em><strong id="mrs_01_1999__b915873918166">-Jdbc\jdbc</strong>, the directory where the MPPDB installation package is stored.</p>
</div></div>
<ul id="mrs_01_1999__u96be9c19a2be40fbadc23d4dd34e6f02"><li id="mrs_01_1999__l05307d5c9aa946f391cd23d27f6f37dc"><strong id="mrs_01_1999__b129818814493650">/${BIGDATA_HOME}/FusionInsight_Spark2x_<span id="mrs_01_1999__text6751302567">8.1.0.1</span>/install/FusionInsight-Spark2x-<span id="mrs_01_1999__text54621546181617">3.1.1</span>/spark/jars</strong> on the Spark server.</li><li id="mrs_01_1999__l9234d6c005b349c9bed3ead6e8041056"><strong id="mrs_01_1999__b124558113410">/opt</strong><strong id="mrs_01_1999__b12457893413"></strong><strong id="mrs_01_1999__b192461810344">/client/Spark2x/spark/jars</strong> on the Spark client.</li></ul>
</li><li id="mrs_01_1999__l878f9b389dce47aeb02d10cd0ff3f964">Update the <strong id="mrs_01_1999__b185671257893650">/user/spark2x/jars/<span id="mrs_01_1999__text188711438125614">8.1.0.1</span>/spark-archive-2x.zip</strong> package stored in the HDFS.<div class="note" id="mrs_01_1999__note377119319188"><img src="public_sys-resources/note_3.0-en-us.png"><span class="notetitle"> </span><div class="notebody"><p id="mrs_01_1999__p137721039181"><span id="mrs_01_1999__text782645293718">The version 8.1.0.1 is used as an example. Replace it with the actual version number.</span></p>
</div></div>
<ol type="a" id="mrs_01_1999__oe822cb33265f4c79a5837df7d7a2c6f0"><li id="mrs_01_1999__l218a7b95286c41cc8545c6ca6168eec6">Log in to the node where the client is installed as a client installation user. Run the following command to switch to the client installation directory, for example, <strong id="mrs_01_1999__b108350907293650">/opt/client</strong>:<p class="litext" id="mrs_01_1999__ae41fc07a760a42a98559c2131368c11d"><strong id="mrs_01_1999__ab0fb585918c940bebcd2862499c26076">cd /opt/client</strong></p>
</li><li id="mrs_01_1999__l547d95320d6847f78b570778b18ccaa8">Run the following command to configure environment variables:<p class="litext" id="mrs_01_1999__a88f84e3012de4d1281f9d1c47e00c38b"><a name="mrs_01_1999__l547d95320d6847f78b570778b18ccaa8"></a><a name="l547d95320d6847f78b570778b18ccaa8"></a><strong id="mrs_01_1999__a55333720b0dd43cc83b1a83080b91131">source bigdata_env</strong></p>
</li><li id="mrs_01_1999__l3e932a5d19ce4a99a0bfbf2793c04c4d">If the cluster is in security mode, run the following command to get authenticated:<p id="mrs_01_1999__a535310a79f6c4f1f9d06523fbf505d79"><a name="mrs_01_1999__l3e932a5d19ce4a99a0bfbf2793c04c4d"></a><a name="l3e932a5d19ce4a99a0bfbf2793c04c4d"></a><strong id="mrs_01_1999__b90186036993650">kinit</strong> <em id="mrs_01_1999__i42342522593650">Component service user</em></p>
</li><li id="mrs_01_1999__l6c09f918927449d187269865f4e6fab1">Run the following commands to create the temporary file <strong id="mrs_01_1999__b38654588793650">./tmp</strong>, obtain <strong id="mrs_01_1999__b22557328893650">spark-archive-2x.zip</strong> from HDFS, and decompress it to the <strong id="mrs_01_1999__b8955943893650">tmp</strong> directory:<p id="mrs_01_1999__ac6971af938474ee48ce2fc573b4c93d0"><b><span class="cmdname" id="mrs_01_1999__cmdname3240020134020">mkdir tmp</span></b></p>
<p id="mrs_01_1999__a627f4dc6eb814dea9f4713578b5dfa72"><b><span class="cmdname" id="mrs_01_1999__cmdname679862994110">hdfs dfs -get</span></b> /user/spark2x/jars/<span id="mrs_01_1999__text16871347165611">8.1.0.1</span>/spark-archive-2x.zip <b><span class="cmdname" id="mrs_01_1999__cmdname0370203714113">./</span></b></p>
<p id="mrs_01_1999__a74bb3439d6a74dc1b7a9473f66daf317"><b><span class="cmdname" id="mrs_01_1999__cmdname8159204344113">unzip spark-archive-2x.zip -d ./tmp</span></b></p>
</li><li id="mrs_01_1999__lc86ecd4e025841f89961e2125940d316"><a name="mrs_01_1999__lc86ecd4e025841f89961e2125940d316"></a><a name="lc86ecd4e025841f89961e2125940d316"></a>Switch to the <strong id="mrs_01_1999__b127301950693650">tmp</strong> directory, delete the <strong id="mrs_01_1999__b79280227693650">gsjdbc4-VXXXRXXXCXXSPCXXX.jar</strong> file, upload the MPPDB Driver file <strong id="mrs_01_1999__b172928990493650">gsjdbc4.jar</strong> to the <strong id="mrs_01_1999__b44413959193650">tmp</strong> directory, and run the following command to compress the file again:<p id="mrs_01_1999__a9145014b057945219d5ecfbac8b73dde"><b><span class="cmdname" id="mrs_01_1999__cmdname20895148124117">zip -r spark-archive-2x.zip *.jar</span></b></p>
</li><li id="mrs_01_1999__l674eea8efb2840148140df76f0686333">Delete <strong id="mrs_01_1999__b5800085893650">spark-archive-2x.zip</strong> from the HDFS and update the <strong id="mrs_01_1999__b90964784993650">spark-archive-2x.zip</strong> package generated in <a href="#mrs_01_1999__lc86ecd4e025841f89961e2125940d316">3.e</a> to the <strong id="mrs_01_1999__b184063678893650">/user/spark2x/jars/<span id="mrs_01_1999__text25850726145748">8.1.0.1</span>/</strong> directory in the HDFS.<p id="mrs_01_1999__a9c3292ab55b042ca8fc07703c07ba865"><b><span class="cmdname" id="mrs_01_1999__cmdname278416541418">hdfs dfs -rm</span></b> /user/spark2x/jars/<span id="mrs_01_1999__text830203165712">8.1.0.1</span>/spark-archive-2x.zip</p>
<p id="mrs_01_1999__afb239f4ba4d3421eb939d53b31c6882e"><b><span class="cmdname" id="mrs_01_1999__cmdname194661959114114">hdfs dfs -put</span></b> ./spark-archive-2x.zip /user/spark2x/jars/<span id="mrs_01_1999__text1563931165714">8.1.0.1</span></p>
</li></ol>
</li><li id="mrs_01_1999__l32bb4bfb52e0477295662d9dc43a2648">Restart the Spark service. After the Spark service is restarted, restart the Spark client.</li></ol>
</li></ul>
<ul id="mrs_01_1999__u3176297f411a4950a3cf75f20a6a6a5f"><li id="mrs_01_1999__l212ee87e232440bbb52e4507e6729f91">Enable the optimization function.<p id="mrs_01_1999__a1e25e5f44bd746c38e006684b8e7f977"><a name="mrs_01_1999__l212ee87e232440bbb52e4507e6729f91"></a><a name="l212ee87e232440bbb52e4507e6729f91"></a>For all modules that support query pushdown, you can run the <strong id="mrs_01_1999__b16625341593650">SET</strong> command on the <strong id="mrs_01_1999__b123093016593650">spark-beeline</strong> client to enable the cross-source query optimization function. By default, the function is disabled.</p>
<p id="mrs_01_1999__adb3833cb4e094d51aa1c1fbd1e69a046">Pushdown configurations can be performed in dimensions of global, data sources, and tables. Commands are as follows:</p>
<ul id="mrs_01_1999__ueb5021eeacb04e2281bb25a5445858d0"><li id="mrs_01_1999__le8caa3540f8a4929a6902a28353a6706">Global (valid for all data sources):<p id="mrs_01_1999__a8a040701e9934dc49084133dde1564ec"><a name="mrs_01_1999__le8caa3540f8a4929a6902a28353a6706"></a><a name="le8caa3540f8a4929a6902a28353a6706"></a><b><span class="cmdname" id="mrs_01_1999__cmdname59541569426">SET spark.sql.datasource.jdbc = project,aggregate,orderby-limit</span></b></p>
</li><li id="mrs_01_1999__l0a0beaedcd2649dfbcea82dea67c68f6">Data sources:<p id="mrs_01_1999__adf2ca1b2f9a64a759b70d3809643a2ec"><a name="mrs_01_1999__l0a0beaedcd2649dfbcea82dea67c68f6"></a><a name="l0a0beaedcd2649dfbcea82dea67c68f6"></a><b><span class="cmdname" id="mrs_01_1999__cmdname1870619115425">SET spark.sql.datasource.${url} = project,aggregate,orderby-limit</span></b></p>
</li><li id="mrs_01_1999__lda98158e90ce47949de1f647d78dc67c">Tables:<p id="mrs_01_1999__aed534409ad294d48ae1f146985f31d60"><a name="mrs_01_1999__lda98158e90ce47949de1f647d78dc67c"></a><a name="lda98158e90ce47949de1f647d78dc67c"></a><b><span class="cmdname" id="mrs_01_1999__cmdname35461161427">SET spark.sql.datasource.${url}.${table} = project,aggregate,orderby-limit</span></b></p>
</li></ul>
<p id="mrs_01_1999__aaaf1559a949e47da9a2482a90b21b0ac">When you run the <strong id="mrs_01_1999__b39006887593650">SET</strong> command to configure preceding parameters, you are allowed to specify multiple pushdown modules and separate them by commas. The following table lists parameters of corresponding pushdown modules.</p>
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="mrs_01_1999__td3ea289f70034d62b484044590a8e6af" frame="border" border="1" rules="all"><caption><b>Table 1 </b>Parameters of modules</caption><thead align="left"><tr id="mrs_01_1999__rb91586d16fbf4ff69e3e636630fec8a1"><th align="left" class="cellrowborder" valign="top" width="50%" id="mcps1.3.2.3.1.5.2.3.1.1"><p id="mrs_01_1999__a6f02923d1e4d4c118a891ccb879d04a6">Module</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="50%" id="mcps1.3.2.3.1.5.2.3.1.2"><p id="mrs_01_1999__a577cd71042084e50b0244fed2e24f4b7">Parameter Value in the SET Command</p>
</th>
</tr>
</thead>
<tbody><tr id="mrs_01_1999__r144b9bb22b6c4b16ab9ad6dc5645caf1"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.2.3.1.5.2.3.1.1 "><p id="mrs_01_1999__aa86485feffac436f9abbb9609735c60c">project</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.2.3.1.5.2.3.1.2 "><p id="mrs_01_1999__af89fc9cff9ea435482a84c7272b69990">project</p>
</td>
</tr>
<tr id="mrs_01_1999__r46ef3b8fdf704aaeb90c12558d0ecc2f"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.2.3.1.5.2.3.1.1 "><p id="mrs_01_1999__a04bd55af6e5541279f5b1aba4d01ffea">aggregate</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.2.3.1.5.2.3.1.2 "><p id="mrs_01_1999__abf78fd69cfdb4cbfb821455616cf3933">aggregate</p>
</td>
</tr>
<tr id="mrs_01_1999__r9377b0fce013400b97edff76c7972163"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.2.3.1.5.2.3.1.1 "><p id="mrs_01_1999__aa04d4e437d164ee39cb6664ed62e73f8">order by, limit over project or aggregate</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.2.3.1.5.2.3.1.2 "><p id="mrs_01_1999__a6a8a25e0203f4e02a3ba20daf6fd60ba">orderby-limit</p>
</td>
</tr>
</tbody>
</table>
</div>
<p id="mrs_01_1999__a5eb8ba86999b4567bc196ce918e18180">The following is a statement for creating an external table of MySQL:</p>
<p id="mrs_01_1999__a99847fde0d214228bac49f669b4b1dfd"><b><span class="cmdname" id="mrs_01_1999__cmdname569932313422">create table if not exists pdmysql using org.apache.spark.sql.jdbc options(driver "com.mysql.jdbc.Driver", url "jdbc:mysql://ip2:3306/test", user "hive", password "<span id="mrs_01_1999__text234311514396"><em id="mrs_01_1999__i10120152143915">xxx</em></span>", dbtable "mysqldata");</span></b></p>
<p id="mrs_01_1999__a82fcfa66af1048e58bd1ad41b5f2df00">In the preceding statement:</p>
<ul id="mrs_01_1999__u75b41512c20d4d5aac7282aee4dca70d"><li id="mrs_01_1999__l26e826256fda4c70a538c6ef0f62c1a6">${url} = jdbc:mysql://ip2:3306/test</li><li id="mrs_01_1999__lc5a7a93e63b74d6a97fe2a1081dd1ac7">${table} = mysqldata</li></ul>
<div class="note" id="mrs_01_1999__n7d14e6aba9eb41779fe297ebe03568a8"><img src="public_sys-resources/note_3.0-en-us.png"><span class="notetitle"> </span><div class="notebody"><ul id="mrs_01_1999__u1aa0ebb6b0bc4807b8245c1bc4acb807"><li id="mrs_01_1999__l99d1b9dc8f6a4538a0a112c7358d028e">On the right of the equal sign (=) is the operators (separated by commas) to be enabled by pushdown.</li><li id="mrs_01_1999__lb7853dd2db4c44d0b692415415df166a">Priority: table &gt; data source &gt; global. If the table switch is set, the global switch of the data source is invalid for the table. If a data source switch is set, the global switch is invalid for the data source.</li><li id="mrs_01_1999__l988045bd12c84aeea35078d2bc7290e4">The equal sign (=) is not allowed in URL. Equal signs (=) are automatically deleted in the SET clause.</li><li id="mrs_01_1999__l01d1f237f307455dac07dd5d70ae510c">After multiple SET operations, results with different keys will not overwrite each other.</li></ul>
</div></div>
</li><li id="mrs_01_1999__l4a971896bcce4b2a8c43d73f469b1111">Add functions that support query pushdown.<p id="mrs_01_1999__a9eda48283a3d4288a343a7c3b1143712"><a name="mrs_01_1999__l4a971896bcce4b2a8c43d73f469b1111"></a><a name="l4a971896bcce4b2a8c43d73f469b1111"></a>In addition to query pushdown of mathematical, time, and string functions such as abs(), month(), and length(), you can run the <strong id="mrs_01_1999__b123993021493650">SET</strong> command to add a data source that supports query pushdown. Run the following command on the Spark-beeline client:</p>
<p id="mrs_01_1999__a1bee084d7f104bfe83d69ffe4b5aa98b"><b><span class="cmdname" id="mrs_01_1999__cmdname794922715424">SET spark.sql.datasource.${datasource}.functions = fun1,fun2</span></b></p>
</li><li id="mrs_01_1999__lfc0f4d41f10a4e5689bc626241cdf803">Reset the configuration set by the <strong id="mrs_01_1999__b7261925693650">SET</strong> command.<p id="mrs_01_1999__acb2a5b5d0e0f4e049f7d1ea18021fe6e">Currently, you can only run the <strong id="mrs_01_1999__b200127127193650">RESET</strong> command on the <strong id="mrs_01_1999__b126499891893650">spark-beeline</strong> client to cancel all <strong id="mrs_01_1999__b196250651693650">SET</strong> content. After running the <strong id="mrs_01_1999__b52929971993650">RESET</strong> command, all values in the <strong id="mrs_01_1999__b66784708593650">SET</strong> command will be cleared. Exercise caution when performing this operation.</p>
<p id="mrs_01_1999__a6955a385ed61444d9fa6d7fd0617ad91">The <strong id="mrs_01_1999__b156343608993650">SET</strong> command is valid in the current session on the client. After the client is shut down, the content in the <strong id="mrs_01_1999__b198743402493650">SET</strong> command turns invalid.</p>
<p id="mrs_01_1999__lfc0f4d41f10a4e5689bc626241cdf803p1">Alternatively, change the value of <strong id="mrs_01_1999__b115300128893650">spark.sql.locale.support</strong> in the <strong id="mrs_01_1999__b186737875393650">spark-defaults.conf</strong> file to <strong id="mrs_01_1999__b63696920093650">true</strong>.</p>
</li></ul>
</div>
<div class="section" id="mrs_01_1999__s65834aceacaa41c1b64d11515e04a9e7"><h4 class="sectiontitle">Precautions</h4><p id="mrs_01_1999__p96231951198">Only MySQL, MPPDB, Hive, oracle, and PostgreSQL data sources are supported.</p>
</div>
</div>
<div>
<div class="familylinks">
<div class="parentlink"><strong>Parent topic:</strong> <a href="mrs_01_1985.html">Spark SQL and DataFrame Tuning</a></div>
</div>
</div>