Files
doc-exports/docs/dli/dev/dli_09_0078.html
Hasko, Vladimir cfc48b3aed dli_dev_0104_version
Reviewed-by: Pruthi, Vineet <vineet.pruthi@t-systems.com>
Co-authored-by: Hasko, Vladimir <vladimir.hasko@t-systems.com>
Co-committed-by: Hasko, Vladimir <vladimir.hasko@t-systems.com>
2024-05-06 09:14:57 +00:00

300 lines
29 KiB
HTML

<a name="dli_09_0078"></a><a name="dli_09_0078"></a>
<h1 class="topictitle1">PySpark Example Code</h1>
<div id="body8662426"><div class="section" id="dli_09_0078__section69541420195910"><h4 class="sectiontitle">Development Description</h4><p id="dli_09_0078__en-us_topic_0197738130_p8060118">The CloudTable HBase and MRS HBase can be connected to DLI as data sources.</p>
<ul id="dli_09_0078__ul1782753175910"><li id="dli_09_0078__li58225318597">Prerequisites<p id="dli_09_0078__p146019217010"><a name="dli_09_0078__li58225318597"></a><a name="li58225318597"></a>A datasource connection has been created on the DLI management console. </p>
<div class="note" id="dli_09_0078__note1358715714155"><img src="public_sys-resources/note_3.0-en-us.png"><span class="notetitle"> </span><div class="notebody"><p id="dli_09_0078__p1858718570154">Hard-coded or plaintext passwords pose significant security risks. To ensure security, encrypt your passwords, store them in configuration files or environment variables, and decrypt them when needed.</p>
</div></div>
</li><li id="dli_09_0078__li168339435014">Code implementation<ol id="dli_09_0078__en-us_topic_0197738130_ol12123050181818"><li id="dli_09_0078__en-us_topic_0197738130_li1612316509182">Import dependency packages.<div class="codecoloring" codetype="Python" id="dli_09_0078__en-us_topic_0197738130_screen1856216330202"><div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal">1</span>
<span class="normal">2</span>
<span class="normal">3</span></pre></div></td><td class="code"><div><pre><span></span><span class="kn">from</span> <span class="nn">__future__</span> <span class="kn">import</span> <span class="n">print_function</span>
<span class="kn">from</span> <span class="nn">pyspark.sql.types</span> <span class="kn">import</span> <span class="n">StructType</span><span class="p">,</span> <span class="n">StructField</span><span class="p">,</span> <span class="n">IntegerType</span><span class="p">,</span> <span class="n">StringType</span><span class="p">,</span> <span class="n">BooleanType</span><span class="p">,</span> <span class="n">ShortType</span><span class="p">,</span> <span class="n">LongType</span><span class="p">,</span> <span class="n">FloatType</span><span class="p">,</span> <span class="n">DoubleType</span>
<span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="kn">import</span> <span class="n">SparkSession</span>
</pre></div></td></tr></table></div>
</div>
</li><li id="dli_09_0078__en-us_topic_0197738130_li11272141817195">Create a session.<div class="codecoloring" codetype="Python" id="dli_09_0078__en-us_topic_0197738130_screen2658132002217"><div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><span class="n">sparkSession</span> <span class="o">=</span> <span class="n">SparkSession</span><span class="o">.</span><span class="n">builder</span><span class="o">.</span><span class="n">appName</span><span class="p">(</span><span class="s2">&quot;datasource-hbase&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">getOrCreate</span><span class="p">()</span>
</pre></div></td></tr></table></div>
</div>
</li></ol>
</li><li id="dli_09_0078__li288734724719">Connecting to data sources through SQL APIs<ol id="dli_09_0078__ol11417145402410"><li id="dli_09_0078__li194171854122419">Create a table to connect to an HBase data source.<ul id="dli_09_0078__ul16163164615243"><li id="dli_09_0078__li91606460247">The sample code is applicable, if Kerberos authentication <strong id="dli_09_0078__b328011219414">is disabled</strong> for the interconnected HBase cluster:<pre class="screen" id="dli_09_0078__screen2160646102412">sparkSession.sql(
"CREATE TABLE testhbase(id STRING, location STRING, city STRING) using hbase OPTIONS (\
'ZKHost' = '192.168.0.189:2181',\
'TableName' = 'hbtest',\
'RowKey' = 'id:5',\
'Cols' = 'location:info.location,city:detail.city')")</pre>
</li><li id="dli_09_0078__li141631146172418">The sample code is applicable, if Kerberos authentication <strong id="dli_09_0078__b114371511419">is enabled</strong> for the interconnected HBase cluster:<pre class="screen" id="dli_09_0078__screen18161194622419">sparkSession.sql(
"CREATE TABLE testhbase(id STRING, location STRING, city STRING) using hbase OPTIONS (\
'ZKHost' = '192.168.0.189:2181',\
'TableName' = 'hbtest',\
'RowKey' = 'id:5',\
'Cols' = 'location:info.location,city:detail.city',\
<strong id="dli_09_0078__b51605466247">'krb5conf' = './krb5.conf',\</strong>
<strong id="dli_09_0078__b7160104612247"> 'keytab'='./user.keytab',\</strong>
<strong id="dli_09_0078__b9160246182417"> 'principal' ='krbtest')")</strong></pre>
<div class="p" id="dli_09_0078__p616224672418">If Kerberos authentication is enabled, you need to set three more parameters, as listed in <a href="#dli_09_0078__table8162174602419">Table 1</a>.
<div class="tablenoborder"><a name="dli_09_0078__table8162174602419"></a><a name="table8162174602419"></a><table cellpadding="4" cellspacing="0" summary="" id="dli_09_0078__table8162174602419" frame="border" border="1" rules="all"><caption><b>Table 1 </b>Description</caption><thead align="left"><tr id="dli_09_0078__row21611246102419"><th align="left" class="cellrowborder" valign="top" width="30.349999999999998%" id="mcps1.3.1.3.3.1.1.1.2.3.2.2.3.1.1"><p id="dli_09_0078__p8161246152415">Parameter and Value</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="69.65%" id="mcps1.3.1.3.3.1.1.1.2.3.2.2.3.1.2"><p id="dli_09_0078__p11611046142412">Description</p>
</th>
</tr>
</thead>
<tbody><tr id="dli_09_0078__row161611846112414"><td class="cellrowborder" valign="top" width="30.349999999999998%" headers="mcps1.3.1.3.3.1.1.1.2.3.2.2.3.1.1 "><p id="dli_09_0078__p19161194672412">'krb5conf' = './krb5.conf'</p>
</td>
<td class="cellrowborder" valign="top" width="69.65%" headers="mcps1.3.1.3.3.1.1.1.2.3.2.2.3.1.2 "><p id="dli_09_0078__p4161446172410">Path of the <strong id="dli_09_0078__b1187824092411">krb5.conf</strong> file.</p>
</td>
</tr>
<tr id="dli_09_0078__row17161184613247"><td class="cellrowborder" valign="top" width="30.349999999999998%" headers="mcps1.3.1.3.3.1.1.1.2.3.2.2.3.1.1 "><p id="dli_09_0078__p11610469246">'keytab'='./user.keytab'</p>
</td>
<td class="cellrowborder" valign="top" width="69.65%" headers="mcps1.3.1.3.3.1.1.1.2.3.2.2.3.1.2 "><p id="dli_09_0078__p516144615248">Path of the <strong id="dli_09_0078__b18200174402415">keytab</strong> file.</p>
</td>
</tr>
<tr id="dli_09_0078__row121621246162414"><td class="cellrowborder" valign="top" width="30.349999999999998%" headers="mcps1.3.1.3.3.1.1.1.2.3.2.2.3.1.1 "><p id="dli_09_0078__p7162134652410">'principal' ='krbtest'</p>
</td>
<td class="cellrowborder" valign="top" width="69.65%" headers="mcps1.3.1.3.3.1.1.1.2.3.2.2.3.1.2 "><p id="dli_09_0078__p1716214610244">Authentication username.</p>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p id="dli_09_0078__p181623464247">For details about how to obtain the <strong id="dli_09_0078__b1188204284518">krb5.conf</strong> and <strong id="dli_09_0078__b1018811428453">keytab</strong> files, see <a href="dli_09_0196.html#dli_09_0196__section12676527182715">Completing Configurations for Enabling Kerberos Authentication</a>.</p>
<div class="note" id="dli_09_0078__note316214466242"><img src="public_sys-resources/note_3.0-en-us.png"><span class="notetitle"> </span><div class="notebody"><p id="dli_09_0078__p016254692417">For details about parameters in the table, see <a href="dli_09_0063.html#dli_09_0063__table15979164115531">Table 1</a>.</p>
</div></div>
</li></ul>
</li><li id="dli_09_0078__li1918612165253">Import data to HBase.<pre class="screen" id="dli_09_0078__screen1616314464245">sparkSession.sql("insert into testhbase values('95274','abc','Jinan')")</pre>
</li><li id="dli_09_0078__li115602012517">Read data from HBase.<pre class="screen" id="dli_09_0078__screen919141662511">sparkSession.sql("select * from testhbase").show()</pre>
</li></ol>
</li><li id="dli_09_0078__li1347517221018">Connecting to data sources through DataFrame APIs<ol id="dli_09_0078__en-us_topic_0197738130_ol164851156145419"><li id="dli_09_0078__en-us_topic_0197738130_li17698293198">Create a table to connect to an HBase data source.<div class="codecoloring" codetype="Python" id="dli_09_0078__en-us_topic_0197738130_screen95431138152317"><div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal">1</span>
<span class="normal">2</span>
<span class="normal">3</span>
<span class="normal">4</span>
<span class="normal">5</span>
<span class="normal">6</span>
<span class="normal">7</span>
<span class="normal">8</span>
<span class="normal">9</span></pre></div></td><td class="code"><div><pre><span></span><span class="n">sparkSession</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span>\
<span class="s2">&quot;CREATE TABLE test_hbase(id STRING, location STRING, city STRING, booleanf BOOLEAN, shortf SHORT, intf INT, longf LONG,</span>
<span class="n">floatf</span> <span class="n">FLOAT</span><span class="p">,</span> <span class="n">doublef</span> <span class="n">DOUBLE</span><span class="p">)</span> <span class="n">using</span> <span class="n">hbase</span> <span class="n">OPTIONS</span> <span class="p">(</span>\
<span class="s1">'ZKHost'</span> <span class="o">=</span> <span class="s1">'cloudtable-cf82-zk3-pa6HnHpf.cloudtable.com:2181,</span><span class="se">\</span>
<span class="s1"> cloudtable-cf82-zk2-weBkIrjI.cloudtable.com:2181,</span><span class="se">\</span>
<span class="s1"> cloudtable-cf82-zk1-WY09px9l.cloudtable.com:2181'</span><span class="p">,</span>\
<span class="s1">'TableName'</span> <span class="o">=</span> <span class="s1">'table_DupRowkey1'</span><span class="p">,</span>\
<span class="s1">'RowKey'</span> <span class="o">=</span> <span class="s1">'id:5,location:6,city:7'</span><span class="p">,</span>\
<span class="s1">'Cols'</span> <span class="o">=</span> <span class="s1">'booleanf:CF1.booleanf, shortf:CF1.shortf, intf:CF1.intf, \ longf:CF1.longf, floatf:CF1.floatf, doublef:CF1.doublef'</span><span class="p">)</span><span class="s2">&quot;)</span>
</pre></div></td></tr></table></div>
</div>
<div class="note" id="dli_09_0078__en-us_topic_0197738130_note1376719247267"><img src="public_sys-resources/note_3.0-en-us.png"><span class="notetitle"> </span><div class="notebody"><ul id="dli_09_0078__en-us_topic_0197738130_ul078155017343"><li id="dli_09_0078__en-us_topic_0197738130_li177895083419">For details about the <strong id="dli_09_0078__b245913134615">ZKHost</strong>, <strong id="dli_09_0078__b1459163164612">RowKey</strong>, and <strong id="dli_09_0078__b2459153110466">Cols</strong> parameters, see <a href="dli_09_0063.html#dli_09_0063__table15979164115531">Table 1</a>.</li><li id="dli_09_0078__en-us_topic_0197738130_li1484845219345"><strong id="dli_09_0078__b1525912365814">TableName</strong>: Name of a table in the CloudTable file. If no table name exists, the system automatically creates one.</li></ul>
</div></div>
</li><li id="dli_09_0078__en-us_topic_0197738130_li24856568549">Construct a schema.<div class="codecoloring" codetype="Python" id="dli_09_0078__en-us_topic_0197738130_screen97743142556"><div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal">1</span>
<span class="normal">2</span>
<span class="normal">3</span>
<span class="normal">4</span>
<span class="normal">5</span>
<span class="normal">6</span>
<span class="normal">7</span>
<span class="normal">8</span>
<span class="normal">9</span></pre></div></td><td class="code"><div><pre><span></span><span class="n">schema</span> <span class="o">=</span> <span class="n">StructType</span><span class="p">([</span><span class="n">StructField</span><span class="p">(</span><span class="s2">&quot;id&quot;</span><span class="p">,</span> <span class="n">StringType</span><span class="p">()),</span>\
<span class="n">StructField</span><span class="p">(</span><span class="s2">&quot;location&quot;</span><span class="p">,</span> <span class="n">StringType</span><span class="p">()),</span>\
<span class="n">StructField</span><span class="p">(</span><span class="s2">&quot;city&quot;</span><span class="p">,</span> <span class="n">StringType</span><span class="p">()),</span>\
<span class="n">StructField</span><span class="p">(</span><span class="s2">&quot;booleanf&quot;</span><span class="p">,</span> <span class="n">BooleanType</span><span class="p">()),</span>\
<span class="n">StructField</span><span class="p">(</span><span class="s2">&quot;shortf&quot;</span><span class="p">,</span> <span class="n">ShortType</span><span class="p">()),</span>\
<span class="n">StructField</span><span class="p">(</span><span class="s2">&quot;intf&quot;</span><span class="p">,</span> <span class="n">IntegerType</span><span class="p">()),</span>\
<span class="n">StructField</span><span class="p">(</span><span class="s2">&quot;longf&quot;</span><span class="p">,</span> <span class="n">LongType</span><span class="p">()),</span>\
<span class="n">StructField</span><span class="p">(</span><span class="s2">&quot;floatf&quot;</span><span class="p">,</span> <span class="n">FloatType</span><span class="p">()),</span>\
<span class="n">StructField</span><span class="p">(</span><span class="s2">&quot;doublef&quot;</span><span class="p">,</span> <span class="n">DoubleType</span><span class="p">())])</span>
</pre></div></td></tr></table></div>
</div>
</li><li id="dli_09_0078__en-us_topic_0197738130_li139272576564">Set data.<div class="codecoloring" codetype="Python" id="dli_09_0078__en-us_topic_0197738130_screen92801813185719"><div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><span class="n">dataList</span> <span class="o">=</span> <span class="n">sparkSession</span><span class="o">.</span><span class="n">sparkContext</span><span class="o">.</span><span class="n">parallelize</span><span class="p">([(</span><span class="s2">&quot;11111&quot;</span><span class="p">,</span> <span class="s2">&quot;aaa&quot;</span><span class="p">,</span> <span class="s2">&quot;aaa&quot;</span><span class="p">,</span> <span class="kc">False</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">23</span><span class="p">,</span> <span class="mf">2.3</span><span class="p">,</span> <span class="mf">2.34</span><span class="p">)])</span>
</pre></div></td></tr></table></div>
</div>
</li><li id="dli_09_0078__en-us_topic_0197738130_li253713178588">Create a DataFrame.<div class="codecoloring" codetype="Python" id="dli_09_0078__en-us_topic_0197738130_screen473683415815"><div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><span class="n">dataFrame</span> <span class="o">=</span> <span class="n">sparkSession</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">(</span><span class="n">dataList</span><span class="p">,</span> <span class="n">schema</span><span class="p">)</span>
</pre></div></td></tr></table></div>
</div>
</li><li id="dli_09_0078__en-us_topic_0197738130_li21141045409">Import data to HBase.<div class="codecoloring" codetype="Python" id="dli_09_0078__en-us_topic_0197738130_screen13841031018"><div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><span class="n">dataFrame</span><span class="o">.</span><span class="n">write</span><span class="o">.</span><span class="n">insertInto</span><span class="p">(</span><span class="s2">&quot;test_hbase&quot;</span><span class="p">)</span>
</pre></div></td></tr></table></div>
</div>
</li><li id="dli_09_0078__en-us_topic_0197738130_li479317294115">Read data from HBase.<div class="codecoloring" codetype="Python" id="dli_09_0078__en-us_topic_0197738130_screen188506525117"><div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal"> 1</span>
<span class="normal"> 2</span>
<span class="normal"> 3</span>
<span class="normal"> 4</span>
<span class="normal"> 5</span>
<span class="normal"> 6</span>
<span class="normal"> 7</span>
<span class="normal"> 8</span>
<span class="normal"> 9</span>
<span class="normal">10</span>
<span class="normal">11</span>
<span class="normal">12</span>
<span class="normal">13</span>
<span class="normal">14</span>
<span class="normal">15</span>
<span class="normal">16</span></pre></div></td><td class="code"><div><pre><span></span><span class="o">//</span> <span class="n">Set</span> <span class="n">cross</span><span class="o">-</span><span class="n">source</span> <span class="n">connection</span> <span class="n">parameters</span>
<span class="n">TableName</span> <span class="o">=</span> <span class="s2">&quot;table_DupRowkey1&quot;</span>
<span class="n">RowKey</span> <span class="o">=</span> <span class="s2">&quot;id:5,location:6,city:7&quot;</span>
<span class="n">Cols</span> <span class="o">=</span> <span class="s2">&quot;booleanf:CF1.booleanf,shortf:CF1.shortf,intf:CF1.intf,longf:CF1.longf,floatf:CF1.floatf,doublef:CF1.doublef&quot;</span>
<span class="n">ZKHost</span> <span class="o">=</span> <span class="s2">&quot;cloudtable-cf82-zk3-pa6HnHpf.cloudtable.com:2181,cloudtable-cf82-zk2-weBkIrjI.cloudtable.com:2181,</span>
<span class="n">cloudtable</span><span class="o">-</span><span class="n">cf82</span><span class="o">-</span><span class="n">zk1</span><span class="o">-</span> <span class="n">WY09px9l</span><span class="o">.</span><span class="n">cloudtable</span><span class="o">.</span><span class="n">com</span><span class="p">:</span><span class="mi">2181</span><span class="s2">&quot;</span>
<span class="o">//</span> <span class="n">select</span>
<span class="n">jdbcDF</span> <span class="o">=</span> <span class="n">sparkSession</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">schema</span><span class="p">(</span><span class="n">schema</span><span class="p">)</span>\
<span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="s2">&quot;hbase&quot;</span><span class="p">)</span>\
<span class="o">.</span><span class="n">option</span><span class="p">(</span><span class="s2">&quot;ZKHost&quot;</span><span class="p">,</span><span class="n">ZKHost</span><span class="p">)</span>\
<span class="o">.</span><span class="n">option</span><span class="p">(</span><span class="s2">&quot;TableName&quot;</span><span class="p">,</span><span class="n">TableName</span><span class="p">)</span>\
<span class="o">.</span><span class="n">option</span><span class="p">(</span><span class="s2">&quot;RowKey&quot;</span><span class="p">,</span><span class="n">RowKey</span><span class="p">)</span>\
<span class="o">.</span><span class="n">option</span><span class="p">(</span><span class="s2">&quot;Cols&quot;</span><span class="p">,</span><span class="n">Cols</span><span class="p">)</span>\
<span class="o">.</span><span class="n">load</span><span class="p">()</span>
<span class="n">jdbcDF</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="s2">&quot;id = '12333' or id='11111'&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</pre></div></td></tr></table></div>
</div>
<div class="note" id="dli_09_0078__en-us_topic_0197738130_note1146792071514"><img src="public_sys-resources/note_3.0-en-us.png"><span class="notetitle"> </span><div class="notebody"><p id="dli_09_0078__en-us_topic_0197738130_p20477420101514">The length of <strong id="dli_09_0078__b11247185615595">id</strong>, <strong id="dli_09_0078__b112471856155919">location</strong>, and <strong id="dli_09_0078__b82471756105914">city</strong> parameter is limited. When inserting data, you must set the data values based on the required length. Otherwise, an encoding format error occurs during query.</p>
</div></div>
</li></ol>
</li><li id="dli_09_0078__li186809540119">Submitting a Spark job<ol id="dli_09_0078__en-us_topic_0197738130_ol612481914610"><li id="dli_09_0078__li5407152122416">Upload the Python code file to DLI.<p id="dli_09_0078__p4803928112412"><a name="dli_09_0078__li5407152122416"></a><a name="li5407152122416"></a></p>
<p id="dli_09_0078__p985542212244"></p>
</li><li id="dli_09_0078__li20873105417215">(Optional) Add the <strong id="dli_09_0078__b1760394614302">krb5.conf</strong> and <strong id="dli_09_0078__b156038467306">user.keytab</strong> files to other dependency files of the job when creating a Spark job in an MRS cluster with Kerberos authentication enabled. Skip this step if Kerberos authentication is not enabled for the cluster. </li><li id="dli_09_0078__li1721019426248">In the Spark job editor, select the corresponding dependency module and execute the Spark job.<p id="dli_09_0078__p20945447152415"><a name="dli_09_0078__li1721019426248"></a><a name="li1721019426248"></a></p>
<div class="p" id="dli_09_0078__p1668024392411"><div class="note" id="dli_09_0078__en-us_topic_0197738130_note1435543551919"><img src="public_sys-resources/note_3.0-en-us.png"><span class="notetitle"> </span><div class="notebody"><ul id="dli_09_0078__en-us_topic_0197738130_ul17825285811"><li id="dli_09_0078__en-us_topic_0197738142_li58215295819">If the Spark version is 2.3.2 (will be offline soon) or 2.4.5, specify the <strong id="dli_09_0078__b196782011154216">Module</strong> to <strong id="dli_09_0078__b4678611154217">sys.datasource.hbase</strong> when you submit a job.</li><li id="dli_09_0078__li6624653171317">If the Spark version is 3.1.1, you do not need to select a module. Configure <strong id="dli_09_0078__b172041216423">Spark parameters (--conf)</strong>.<p id="dli_09_0078__p1765215102311">spark.driver.extraClassPath=/usr/share/extension/dli/spark-jar/datasource/hbase/*</p>
<p id="dli_09_0078__p1865215532311">spark.executor.extraClassPath=/usr/share/extension/dli/spark-jar/datasource/hbase/*</p>
</li></ul>
</div></div>
</div>
</li></ol>
</li></ul>
</div>
<div class="section" id="dli_09_0078__section1549419379279"><h4 class="sectiontitle">Complete Example Code</h4><ul id="dli_09_0078__ul10666115713267"><li id="dli_09_0078__li2025720152717">Connecting to MRS HBase through SQL APIs<ul id="dli_09_0078__ul899297152320"><li id="dli_09_0078__li196891851192214">Sample code when Kerberos authentication is <strong id="dli_09_0078__b291223324719">disabled</strong><pre class="screen" id="dli_09_0078__screen1720511363213"># _*_ coding: utf-8 _*_
from __future__ import print_function
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, BooleanType, ShortType, LongType, FloatType, DoubleType
from pyspark.sql import SparkSession
if __name__ == "__main__":
# Create a SparkSession session.
sparkSession = SparkSession.builder.appName("datasource-hbase").getOrCreate()
sparkSession.sql(
"CREATE TABLE testhbase(id STRING, location STRING, city STRING) using hbase OPTIONS (\
'ZKHost' = '192.168.0.189:2181',\
'TableName' = 'hbtest',\
'RowKey' = 'id:5',\
'Cols' = 'location:info.location,city:detail.city')")
sparkSession.sql("insert into testhbase values('95274','abc','Jinan')")
sparkSession.sql("select * from testhbase").show()
# close session
sparkSession.stop()</pre>
</li></ul>
<ul id="dli_09_0078__ul745724142820"><li id="dli_09_0078__li1645720413283">Sample code when Kerberos authentication is <strong id="dli_09_0078__b125171841144710">enabled</strong><pre class="screen" id="dli_09_0078__screen202493532327"># _*_ coding: utf-8 _*_
from __future__ import print_function
from pyspark import SparkFiles
from pyspark.sql import SparkSession
import shutil
import time
import os
if __name__ == "__main__":
# Create a SparkSession session.
sparkSession = SparkSession.builder.appName("Test_HBase_SparkSql_Kerberos").getOrCreate()
sc = sparkSession.sparkContext
time.sleep(10)
krb5_startfile = SparkFiles.get("krb5.conf")
keytab_startfile = SparkFiles.get("user.keytab")
path_user = os.getcwd()
krb5_endfile = path_user + "/" + "krb5.conf"
keytab_endfile = path_user + "/" + "user.keytab"
shutil.copy(krb5_startfile, krb5_endfile)
shutil.copy(keytab_startfile, keytab_endfile)
time.sleep(20)
sparkSession.sql(
"CREATE TABLE testhbase(id string,booleanf boolean,shortf short,intf int,longf long,floatf float,doublef double) " +
"using hbase OPTIONS(" +
"'ZKHost'='10.0.0.146:2181'," +
"'TableName'='hbtest'," +
"'RowKey'='id:100'," +
"'Cols'='booleanf:CF1.booleanf,shortf:CF1.shortf,intf:CF1.intf,longf:CF2.longf,floatf:CF1.floatf,doublef:CF2.doublef'," +
"'krb5conf'='" + path_user + "/krb5.conf'," +
"'keytab'='" + path_user+ "/user.keytab'," +
"'principal'='krbtest') ")
sparkSession.sql("insert into testhbase values('95274','abc','Jinan')")
sparkSession.sql("select * from testhbase").show()
# close session
sparkSession.stop()</pre>
</li></ul>
</li><li id="dli_09_0078__li16345354122317">Connecting to HBase through DataFrame APIs<pre class="screen" id="dli_09_0078__screen1754883218332"># _*_ coding: utf-8 _*_
from __future__ import print_function
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, BooleanType, ShortType, LongType, FloatType, DoubleType
from pyspark.sql import SparkSession
if __name__ == "__main__":
# Create a SparkSession session.
sparkSession = SparkSession.builder.appName("datasource-hbase").getOrCreate()
# Createa data table for DLI-associated ct
sparkSession.sql(\
"CREATE TABLE test_hbase(id STRING, location STRING, city STRING, booleanf BOOLEAN, shortf SHORT, intf INT, longf LONG,floatf FLOAT,doublef DOUBLE) using hbase OPTIONS ( \
'ZKHost' = 'cloudtable-cf82-zk3-pa6HnHpf.cloudtable.com:2181,\
cloudtable-cf82-zk2-weBkIrjI.cloudtable.com:2181,\
cloudtable-cf82-zk1-WY09px9l.cloudtable.com:2181',\
'TableName' = 'table_DupRowkey1',\
'RowKey' = 'id:5,location:6,city:7',\
'Cols' = 'booleanf:CF1.booleanf,shortf:CF1.shortf,intf:CF1.intf,longf:CF1.longf,floatf:CF1.floatf,doublef:CF1.doublef')")
# Create a DataFrame and initialize the DataFrame data.
dataList = sparkSession.sparkContext.parallelize([("11111", "aaa", "aaa", False, 4, 3, 23, 2.3, 2.34)])
# Setting schema
schema = StructType([StructField("id", StringType()),
StructField("location", StringType()),
StructField("city", StringType()),
StructField("booleanf", BooleanType()),
StructField("shortf", ShortType()),
StructField("intf", IntegerType()),
StructField("longf", LongType()),
StructField("floatf", FloatType()),
StructField("doublef", DoubleType())])
# Create a DataFrame from RDD and schema
dataFrame = sparkSession.createDataFrame(dataList, schema)
# Write data to the cloudtable-hbase
dataFrame.write.insertInto("test_hbase")
# Set cross-source connection parameters
TableName = "table_DupRowkey1"
RowKey = "id:5,location:6,city:7"
Cols = "booleanf:CF1.booleanf,shortf:CF1.shortf,intf:CF1.intf,longf:CF1.longf,floatf:CF1.floatf,doublef:CF1.doublef"
ZKHost = "cloudtable-cf82-zk3-pa6HnHpf.cloudtable.com:2181,cloudtable-cf82-zk2-weBkIrjI.cloudtable.com:2181,
cloudtable-cf82-zk1-WY09px9l.cloudtable.com:2181"
# Read data on CloudTable-HBase
jdbcDF = sparkSession.read.schema(schema)\
.format("hbase")\
.option("ZKHost", ZKHost)\
.option("TableName",TableName)\
.option("RowKey", RowKey)\
.option("Cols", Cols)\
.load()
jdbcDF.filter("id = '12333' or id='11111'").show()
# close session
sparkSession.stop()</pre>
</li></ul>
</div>
</div>
<div>
<div class="familylinks">
<div class="parentlink"><strong>Parent topic:</strong> <a href="dli_09_0077.html">Connecting to HBase</a></div>
</div>
</div>