doc-exports/docs/dli/dev/dli_09_0081.html
Hasko, Vladimir cfc48b3aed dli_dev_0104_version
Reviewed-by: Pruthi, Vineet <vineet.pruthi@t-systems.com>
Co-authored-by: Hasko, Vladimir <vladimir.hasko@t-systems.com>
Co-committed-by: Hasko, Vladimir <vladimir.hasko@t-systems.com>
2024-05-06 09:14:57 +00:00

168 lines
16 KiB
HTML

<a name="dli_09_0081"></a><a name="dli_09_0081"></a>
<h1 class="topictitle1">PySpark Example Code</h1>
<div id="body8662426"><div class="section" id="dli_09_0081__section3685105194914"><h4 class="sectiontitle">Development Description</h4><p id="dli_09_0081__en-us_topic_0197738133_p492312464537">The CloudTable OpenTSDB and MRS OpenTSDB can be connected to DLI as data sources.</p>
<ul id="dli_09_0081__ul62191935508"><li id="dli_09_0081__li221993135018">Prerequisites<p id="dli_09_0081__p246892735015"><a name="dli_09_0081__li221993135018"></a><a name="li221993135018"></a>A datasource connection has been created on the DLI management console. </p>
<div class="note" id="dli_09_0081__note1358715714155"><img src="public_sys-resources/note_3.0-en-us.png"><span class="notetitle"> </span><div class="notebody"><p id="dli_09_0081__p1858718570154">Hard-coded or plaintext passwords pose significant security risks. To ensure security, encrypt your passwords, store them in configuration files or environment variables, and decrypt them when needed.</p>
</div></div>
</li><li id="dli_09_0081__li55257218511">Code implementation<ol id="dli_09_0081__en-us_topic_0197738133_ol12123050181818"><li id="dli_09_0081__en-us_topic_0197738133_li1612316509182">Import dependency packages.<div class="codecoloring" codetype="Python" id="dli_09_0081__en-us_topic_0197738133_screen68181719144911"><div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal">1</span>
<span class="normal">2</span>
<span class="normal">3</span></pre></div></td><td class="code"><div><pre><span></span><span class="kn">from</span> <span class="nn">__future__</span> <span class="kn">import</span> <span class="n">print_function</span>
<span class="kn">from</span> <span class="nn">pyspark.sql.types</span> <span class="kn">import</span> <span class="n">StructType</span><span class="p">,</span> <span class="n">StructField</span><span class="p">,</span> <span class="n">StringType</span><span class="p">,</span> <span class="n">LongType</span><span class="p">,</span> <span class="n">DoubleType</span>
<span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="kn">import</span> <span class="n">SparkSession</span>
</pre></div></td></tr></table></div>
</div>
</li><li id="dli_09_0081__en-us_topic_0197738133_li11272141817195">Create a session.<div class="codecoloring" codetype="Python" id="dli_09_0081__en-us_topic_0197738133_screen2658132002217"><div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><span class="n">sparkSession</span> <span class="o">=</span> <span class="n">SparkSession</span><span class="o">.</span><span class="n">builder</span><span class="o">.</span><span class="n">appName</span><span class="p">(</span><span class="s2">&quot;datasource-opentsdb&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">getOrCreate</span><span class="p">()</span>
</pre></div></td></tr></table></div>
</div>
</li><li id="dli_09_0081__en-us_topic_0197738133_li17698293198">Create a table to connect to an OpenTSDB data source.<div class="codecoloring" codetype="Python" id="dli_09_0081__en-us_topic_0197738133_screen95431138152317"><div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal">1</span>
<span class="normal">2</span>
<span class="normal">3</span>
<span class="normal">4</span></pre></div></td><td class="code"><div><pre><span></span><span class="n">sparkSession</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s2">&quot;create table opentsdb_test using opentsdb options(</span>
<span class="s1">'Host'</span><span class="o">=</span><span class="s1">'opentsdb-3xcl8dir15m58z3.cloudtable.com:4242'</span><span class="p">,</span>
<span class="s1">'metric'</span><span class="o">=</span><span class="s1">'ct_opentsdb'</span><span class="p">,</span>
<span class="s1">'tags'</span><span class="o">=</span><span class="s1">'city,location'</span><span class="p">)</span><span class="s2">&quot;)</span>
</pre></div></td></tr></table></div>
</div>
<div class="note" id="dli_09_0081__en-us_topic_0197738133_note1376719247267"><img src="public_sys-resources/note_3.0-en-us.png"><span class="notetitle"> </span><div class="notebody"><p id="dli_09_0081__en-us_topic_0197738133_p14768153018616">For details about the <strong id="dli_09_0081__b159035137210">Host</strong>, <strong id="dli_09_0081__b11903111352110">metric</strong>, and <strong id="dli_09_0081__b790420131217">tags</strong> parameters, see <a href="dli_09_0065.html#dli_09_0065__en-us_topic_0190597601_table463015581831">Table 1</a>.</p>
</div></div>
</li></ol>
</li><li id="dli_09_0081__li98591845192715">Connecting to data sources through SQL APIs<ol id="dli_09_0081__ol19158103973018"><li id="dli_09_0081__li12158839163020">Insert data.<pre class="screen" id="dli_09_0081__screen1821218553307">sparkSession.sql("insert into opentsdb_test values('aaa', 'abc', '2021-06-30 18:00:00', 30.0)")</pre>
</li><li id="dli_09_0081__li1687241143016">Query data.<pre class="screen" id="dli_09_0081__screen1913623113213">result = sparkSession.sql("SELECT * FROM opentsdb_test")</pre>
</li></ol>
</li><li id="dli_09_0081__li761708155216">Connecting to data sources through DataFrame APIs<ol id="dli_09_0081__en-us_topic_0197738133_ol62934313101"><li id="dli_09_0081__en-us_topic_0197738133_li4293143141018">Construct a schema.<div class="codecoloring" codetype="Python" id="dli_09_0081__en-us_topic_0197738133_screen6395195210104"><div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal">1</span>
<span class="normal">2</span>
<span class="normal">3</span>
<span class="normal">4</span></pre></div></td><td class="code"><div><pre><span></span><span class="n">schema</span> <span class="o">=</span> <span class="n">StructType</span><span class="p">([</span><span class="n">StructField</span><span class="p">(</span><span class="s2">&quot;location&quot;</span><span class="p">,</span> <span class="n">StringType</span><span class="p">()),</span>\
<span class="n">StructField</span><span class="p">(</span><span class="s2">&quot;name&quot;</span><span class="p">,</span> <span class="n">StringType</span><span class="p">()),</span> \
<span class="n">StructField</span><span class="p">(</span><span class="s2">&quot;timestamp&quot;</span><span class="p">,</span> <span class="n">LongType</span><span class="p">()),</span>\
<span class="n">StructField</span><span class="p">(</span><span class="s2">&quot;value&quot;</span><span class="p">,</span> <span class="n">DoubleType</span><span class="p">())])</span>
</pre></div></td></tr></table></div>
</div>
</li><li id="dli_09_0081__en-us_topic_0197738133_li531012517114">Set data.<div class="codecoloring" codetype="Python" id="dli_09_0081__en-us_topic_0197738133_screen2083911515127"><div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><span class="n">dataList</span> <span class="o">=</span> <span class="n">sparkSession</span><span class="o">.</span><span class="n">sparkContext</span><span class="o">.</span><span class="n">parallelize</span><span class="p">([(</span><span class="s2">&quot;aaa&quot;</span><span class="p">,</span> <span class="s2">&quot;abc&quot;</span><span class="p">,</span> <span class="mi">123456</span><span class="n">L</span><span class="p">,</span> <span class="mf">30.0</span><span class="p">)])</span>
</pre></div></td></tr></table></div>
</div>
</li><li id="dli_09_0081__en-us_topic_0197738133_li4741141821116">Create a DataFrame.<div class="codecoloring" codetype="Python" id="dli_09_0081__en-us_topic_0197738133_screen1520319134349"><div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><span class="n">dataFrame</span> <span class="o">=</span> <span class="n">sparkSession</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">(</span><span class="n">dataList</span><span class="p">,</span> <span class="n">schema</span><span class="p">)</span>
</pre></div></td></tr></table></div>
</div>
</li><li id="dli_09_0081__en-us_topic_0197738133_li10107173841110">Import data to OpenTSDB.<div class="codecoloring" codetype="Python" id="dli_09_0081__en-us_topic_0197738133_screen14133320357"><div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><span class="n">dataFrame</span><span class="o">.</span><span class="n">write</span><span class="o">.</span><span class="n">insertInto</span><span class="p">(</span><span class="s2">&quot;opentsdb_test&quot;</span><span class="p">)</span>
</pre></div></td></tr></table></div>
</div>
</li><li id="dli_09_0081__en-us_topic_0197738133_li1956013474119">Read data from OpenTSDB.<div class="codecoloring" codetype="Python" id="dli_09_0081__en-us_topic_0197738133_screen7259610133515"><div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal">1</span>
<span class="normal">2</span>
<span class="normal">3</span>
<span class="normal">4</span>
<span class="normal">5</span>
<span class="normal">6</span>
<span class="normal">7</span></pre></div></td><td class="code"><div><pre><span></span><span class="n">jdbdDF</span> <span class="o">=</span> <span class="n">sparkSession</span><span class="o">.</span><span class="n">read</span>
<span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="s2">&quot;opentsdb&quot;</span><span class="p">)</span>\
<span class="o">.</span><span class="n">option</span><span class="p">(</span><span class="s2">&quot;Host&quot;</span><span class="p">,</span><span class="s2">&quot;opentsdb-3xcl8dir15m58z3.cloudtable.com:4242&quot;</span><span class="p">)</span>\
<span class="o">.</span><span class="n">option</span><span class="p">(</span><span class="s2">&quot;metric&quot;</span><span class="p">,</span><span class="s2">&quot;ctopentsdb&quot;</span><span class="p">)</span>\
<span class="o">.</span><span class="n">option</span><span class="p">(</span><span class="s2">&quot;tags&quot;</span><span class="p">,</span><span class="s2">&quot;city,location&quot;</span><span class="p">)</span>\
<span class="o">.</span><span class="n">load</span><span class="p">()</span>
<span class="n">jdbdDF</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</pre></div></td></tr></table></div>
</div>
</li></ol>
</li><li id="dli_09_0081__li1410414355525">Submitting a Spark job<ol id="dli_09_0081__en-us_topic_0197738133_ol612481914610"><li id="dli_09_0081__li525841115179">Upload the Python code file to DLI.<p id="dli_09_0081__p648216175172"><a name="dli_09_0081__li525841115179"></a><a name="li525841115179"></a></p>
<p id="dli_09_0081__p7676212171720"></p>
</li><li id="dli_09_0081__li78195201174">In the Spark job editor, select the corresponding dependency module and execute the Spark job.<p id="dli_09_0081__p5931225131720"><a name="dli_09_0081__li78195201174"></a><a name="li78195201174"></a></p>
<div class="p" id="dli_09_0081__p7319112271716"><div class="note" id="dli_09_0081__en-us_topic_0197738133_note1435543551919"><img src="public_sys-resources/note_3.0-en-us.png"><span class="notetitle"> </span><div class="notebody"><ul id="dli_09_0081__en-us_topic_0197738133_ul17825285811"><li id="dli_09_0081__en-us_topic_0197738142_li58215295819">If the Spark version is 2.3.2 (will be offline soon) or 2.4.5, specify the <strong id="dli_09_0081__b923021854915">Module</strong> to <strong id="dli_09_0081__b3230618174915">sys.datasource.opentsdb</strong> when you submit a job.</li><li id="dli_09_0081__li6624653171317">If the Spark version is 3.1.1, you do not need to select a module. Configure <strong id="dli_09_0081__b18248151924914">Spark parameters (--conf)</strong>.<p id="dli_09_0081__p1723617371259">spark.driver.extraClassPath=/usr/share/extension/dli/spark-jar/datasource/opentsdb/*</p>
<p id="dli_09_0081__p6236153714259">spark.executor.extraClassPath=/usr/share/extension/dli/spark-jar/datasource/opentsdb/*</p>
</li></ul>
</div></div>
</div>
</li></ol>
</li></ul>
</div>
<div class="section" id="dli_09_0081__section1783516613536"><h4 class="sectiontitle">Complete Example Code</h4><ul id="dli_09_0081__ul2617145113018"><li id="dli_09_0081__li16176503011">Connecting to MRS OpenTSDB through SQL APIs<pre class="screen" id="dli_09_0081__screen1024318416307"># _*_ coding: utf-8 _*_
from __future__ import print_function
from pyspark.sql.types import StructType, StructField, StringType, LongType, DoubleType
from pyspark.sql import SparkSession
if __name__ == "__main__":
# Create a SparkSession session.
sparkSession = SparkSession.builder.appName("datasource-opentsdb").getOrCreate()
# Create a DLI cross-source association opentsdb data table
sparkSession.sql(\
"create table opentsdb_test using opentsdb options(\
'Host'='10.0.0.171:4242',\
'metric'='cts_opentsdb',\
'tags'='city,location')")
sparkSession.sql("insert into opentsdb_test values('aaa', 'abc', '2021-06-30 18:00:00', 30.0)")
result = sparkSession.sql("SELECT * FROM opentsdb_test")
result.show()
# close session
sparkSession.stop()</pre>
</li><li id="dli_09_0081__li469501910305">Connecting to OpenTSDB through DataFrame APIs<pre class="screen" id="dli_09_0081__screen1895134416305"># _*_ coding: utf-8 _*_
from __future__ import print_function
from pyspark.sql.types import StructType, StructField, StringType, LongType, DoubleType
from pyspark.sql import SparkSession
if __name__ == "__main__":
# Create a SparkSession session.
sparkSession = SparkSession.builder.appName("datasource-opentsdb").getOrCreate()
# Create a DLI cross-source association opentsdb data table
sparkSession.sql(
"create table opentsdb_test using opentsdb options(\
'Host'='opentsdb-3xcl8dir15m58z3.cloudtable.com:4242',\
'metric'='ct_opentsdb',\
'tags'='city,location')")
# Create a DataFrame and initialize the DataFrame data.
dataList = sparkSession.sparkContext.parallelize([("aaa", "abc", 123456L, 30.0)])
# Setting schema
schema = StructType([StructField("location", StringType()),\
StructField("name", StringType()),\
StructField("timestamp", LongType()),\
StructField("value", DoubleType())])
# Create a DataFrame from RDD and schema
dataFrame = sparkSession.createDataFrame(dataList, schema)
# Set cross-source connection parameters
metric = "ctopentsdb"
tags = "city,location"
Host = "opentsdb-3xcl8dir15m58z3.cloudtable.com:4242"
# Write data to the cloudtable-opentsdb
dataFrame.write.insertInto("opentsdb_test")
# ******* Opentsdb does not currently implement the ctas method to save data, so the save() method cannot be used.*******
# dataFrame.write.format("opentsdb").option("Host", Host).option("metric", metric).option("tags", tags).mode("Overwrite").save()
# Read data on CloudTable-OpenTSDB
jdbdDF = sparkSession.read\
.format("opentsdb")\
.option("Host",Host)\
.option("metric",metric)\
.option("tags",tags)\
.load()
jdbdDF.show()
# close session
sparkSession.stop()</pre>
</li></ul>
</div>
</div>
<div>
<div class="familylinks">
<div class="parentlink"><strong>Parent topic:</strong> <a href="dli_09_0080.html">Connecting to OpenTSDB</a></div>
</div>
</div>