forked from docs/doc-exports
Reviewed-by: Pruthi, Vineet <vineet.pruthi@t-systems.com> Co-authored-by: Hasko, Vladimir <vladimir.hasko@t-systems.com> Co-committed-by: Hasko, Vladimir <vladimir.hasko@t-systems.com>
201 lines
18 KiB
HTML
201 lines
18 KiB
HTML
<a name="dli_09_0117"></a><a name="dli_09_0117"></a>
|
|
|
|
<h1 class="topictitle1">PySpark Example Code</h1>
|
|
<div id="body8662426"><div class="section" id="dli_09_0117__section76842055113920"><h4 class="sectiontitle">Development Description</h4><p id="dli_09_0117__en-us_topic_0204096847_p492312464537">Mongo can be connected only through enhanced datasource connections. </p>
|
|
<div class="note" id="dli_09_0117__note12343132893511"><img src="public_sys-resources/note_3.0-en-us.png"><span class="notetitle"> </span><div class="notebody"><p id="dli_09_0117__p1734422863515">DDS is compatible with the MongoDB protocol.</p>
|
|
</div></div>
|
|
<ul id="dli_09_0117__ul11582025144110"><li id="dli_09_0117__li14116829103617">Prerequisites<p id="dli_09_0117__p6629155314372"><a name="dli_09_0117__li14116829103617"></a><a name="li14116829103617"></a>An enhanced datasource connection has been created on the DLI management console and bound to a queue in packages. </p>
|
|
<div class="note" id="dli_09_0117__note1358715714155"><img src="public_sys-resources/note_3.0-en-us.png"><span class="notetitle"> </span><div class="notebody"><p id="dli_09_0117__p692572617287">Hard-coded or plaintext passwords pose significant security risks. To ensure security, encrypt your passwords, store them in configuration files or environment variables, and decrypt them when needed.</p>
|
|
</div></div>
|
|
</li><li id="dli_09_0117__li827910561142">Connecting to data sources through DataFrame APIs<ol id="dli_09_0117__en-us_topic_0204096847_ol62934313101"><li id="dli_09_0117__en-us_topic_0204096847_li17921229203113">Import dependencies.<pre class="screen" id="dli_09_0117__en-us_topic_0204096847_screen1532017333587">from __future__ import print_function
|
|
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
|
|
from pyspark.sql import SparkSession</pre>
|
|
</li><li id="dli_09_0117__en-us_topic_0204096847_li618044220189">Create a session.<div class="codecoloring" codetype="Python" id="dli_09_0117__en-us_topic_0204096847_screen18318195741816"><div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><span class="n">sparkSession</span> <span class="o">=</span> <span class="n">SparkSession</span><span class="o">.</span><span class="n">builder</span><span class="o">.</span><span class="n">appName</span><span class="p">(</span><span class="s2">"datasource-mongo"</span><span class="p">)</span><span class="o">.</span><span class="n">getOrCreate</span><span class="p">()</span>
|
|
</pre></div></td></tr></table></div>
|
|
|
|
</div>
|
|
</li><li id="dli_09_0117__en-us_topic_0204096847_li362510127192">Set connection parameters.<div class="codecoloring" codetype="Python" id="dli_09_0117__screen538422515319"><div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal">1</span>
|
|
<span class="normal">2</span>
|
|
<span class="normal">3</span>
|
|
<span class="normal">4</span>
|
|
<span class="normal">5</span>
|
|
<span class="normal">6</span></pre></div></td><td class="code"><div><pre><span></span><span class="n">url</span> <span class="o">=</span> <span class="s2">"192.168.4.62:8635,192.168.5.134:8635/test?authSource=admin"</span>
|
|
<span class="n">uri</span> <span class="o">=</span> <span class="s2">"mongodb://username:pwd@host:8635/db"</span>
|
|
<span class="n">user</span> <span class="o">=</span> <span class="s2">"rwuser"</span>
|
|
<span class="n">database</span> <span class="o">=</span> <span class="s2">"test"</span>
|
|
<span class="n">collection</span> <span class="o">=</span> <span class="s2">"test"</span>
|
|
<span class="n">password</span> <span class="o">=</span> <span class="s2">"######"</span>
|
|
</pre></div></td></tr></table></div>
|
|
|
|
</div>
|
|
<div class="note" id="dli_09_0117__note5393153410317"><img src="public_sys-resources/note_3.0-en-us.png"><span class="notetitle"> </span><div class="notebody"><p id="dli_09_0117__p139353416314">For details about the parameters, see <a href="dli_09_0114.html#dli_09_0114__en-us_topic_0204096844_table2072415395012">Table 1</a>.</p>
|
|
</div></div>
|
|
</li><li id="dli_09_0117__en-us_topic_0204096847_li4741141821116">Create a DataFrame.<div class="codecoloring" codetype="Python" id="dli_09_0117__en-us_topic_0204096847_screen111295412013"><div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal">1</span>
|
|
<span class="normal">2</span>
|
|
<span class="normal">3</span>
|
|
<span class="normal">4</span>
|
|
<span class="normal">5</span></pre></div></td><td class="code"><div><pre><span></span><span class="n">dataList</span> <span class="o">=</span> <span class="n">sparkSession</span><span class="o">.</span><span class="n">sparkContext</span><span class="o">.</span><span class="n">parallelize</span><span class="p">([(</span><span class="mi">1</span><span class="p">,</span> <span class="s2">"Katie"</span><span class="p">,</span> <span class="mi">19</span><span class="p">),(</span><span class="mi">2</span><span class="p">,</span><span class="s2">"Tom"</span><span class="p">,</span><span class="mi">20</span><span class="p">)])</span>
|
|
<span class="n">schema</span> <span class="o">=</span> <span class="n">StructType</span><span class="p">([</span><span class="n">StructField</span><span class="p">(</span><span class="s2">"id"</span><span class="p">,</span> <span class="n">IntegerType</span><span class="p">(),</span> <span class="kc">False</span><span class="p">),</span>
|
|
<span class="n">StructField</span><span class="p">(</span><span class="s2">"name"</span><span class="p">,</span> <span class="n">StringType</span><span class="p">(),</span> <span class="kc">False</span><span class="p">),</span>
|
|
<span class="n">StructField</span><span class="p">(</span><span class="s2">"age"</span><span class="p">,</span> <span class="n">IntegerType</span><span class="p">(),</span> <span class="kc">False</span><span class="p">)])</span>
|
|
<span class="n">dataFrame</span> <span class="o">=</span> <span class="n">sparkSession</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">(</span><span class="n">dataList</span><span class="p">,</span> <span class="n">schema</span><span class="p">)</span>
|
|
</pre></div></td></tr></table></div>
|
|
|
|
</div>
|
|
</li><li id="dli_09_0117__en-us_topic_0204096847_li10107173841110">Import data to Mongo.<div class="codecoloring" codetype="Python" id="dli_09_0117__en-us_topic_0204096847_screen10458192419221"><div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal">1</span>
|
|
<span class="normal">2</span>
|
|
<span class="normal">3</span>
|
|
<span class="normal">4</span>
|
|
<span class="normal">5</span>
|
|
<span class="normal">6</span>
|
|
<span class="normal">7</span>
|
|
<span class="normal">8</span>
|
|
<span class="normal">9</span></pre></div></td><td class="code"><div><pre><span></span><span class="n">dataFrame</span><span class="o">.</span><span class="n">write</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="s2">"mongo"</span><span class="p">)</span>
|
|
<span class="o">.</span><span class="n">option</span><span class="p">(</span><span class="s2">"url"</span><span class="p">,</span> <span class="n">url</span><span class="p">)</span>\
|
|
<span class="o">.</span><span class="n">option</span><span class="p">(</span><span class="s2">"uri"</span><span class="p">,</span> <span class="n">uri</span><span class="p">)</span>\
|
|
<span class="o">.</span><span class="n">option</span><span class="p">(</span><span class="s2">"user"</span><span class="p">,</span><span class="n">user</span><span class="p">)</span>\
|
|
<span class="o">.</span><span class="n">option</span><span class="p">(</span><span class="s2">"password"</span><span class="p">,</span><span class="n">password</span><span class="p">)</span>\
|
|
<span class="o">.</span><span class="n">option</span><span class="p">(</span><span class="s2">"database"</span><span class="p">,</span><span class="n">database</span><span class="p">)</span>\
|
|
<span class="o">.</span><span class="n">option</span><span class="p">(</span><span class="s2">"collection"</span><span class="p">,</span><span class="n">collection</span><span class="p">)</span>\
|
|
<span class="o">.</span><span class="n">mode</span><span class="p">(</span><span class="s2">"Overwrite"</span><span class="p">)</span>\
|
|
<span class="o">.</span><span class="n">save</span><span class="p">()</span>
|
|
</pre></div></td></tr></table></div>
|
|
|
|
</div>
|
|
</li><li id="dli_09_0117__en-us_topic_0204096847_li1956013474119">Read data from Mongo.<div class="codecoloring" codetype="Python" id="dli_09_0117__en-us_topic_0204096847_screen1821452911136"><div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal"> 1</span>
|
|
<span class="normal"> 2</span>
|
|
<span class="normal"> 3</span>
|
|
<span class="normal"> 4</span>
|
|
<span class="normal"> 5</span>
|
|
<span class="normal"> 6</span>
|
|
<span class="normal"> 7</span>
|
|
<span class="normal"> 8</span>
|
|
<span class="normal"> 9</span>
|
|
<span class="normal">10</span></pre></div></td><td class="code"><div><pre><span></span><span class="n">jdbcDF</span> <span class="o">=</span> <span class="n">sparkSession</span><span class="o">.</span><span class="n">read</span>
|
|
<span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="s2">"mongo"</span><span class="p">)</span>\
|
|
<span class="o">.</span><span class="n">option</span><span class="p">(</span><span class="s2">"url"</span><span class="p">,</span> <span class="n">url</span><span class="p">)</span>\
|
|
<span class="o">.</span><span class="n">option</span><span class="p">(</span><span class="s2">"uri"</span><span class="p">,</span> <span class="n">uri</span><span class="p">)</span>\
|
|
<span class="o">.</span><span class="n">option</span><span class="p">(</span><span class="s2">"user"</span><span class="p">,</span><span class="n">user</span><span class="p">)</span>\
|
|
<span class="o">.</span><span class="n">option</span><span class="p">(</span><span class="s2">"password"</span><span class="p">,</span><span class="n">password</span><span class="p">)</span>\
|
|
<span class="o">.</span><span class="n">option</span><span class="p">(</span><span class="s2">"database"</span><span class="p">,</span><span class="n">database</span><span class="p">)</span>\
|
|
<span class="o">.</span><span class="n">option</span><span class="p">(</span><span class="s2">"collection"</span><span class="p">,</span><span class="n">collection</span><span class="p">)</span>\
|
|
<span class="o">.</span><span class="n">load</span><span class="p">()</span>
|
|
<span class="n">jdbcDF</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
|
|
</pre></div></td></tr></table></div>
|
|
|
|
</div>
|
|
</li><li id="dli_09_0117__en-us_topic_0204096847_li106413175281">View the operation result.<p id="dli_09_0117__en-us_topic_0204096847_p18705122316185"><a name="dli_09_0117__en-us_topic_0204096847_li106413175281"></a><a name="en-us_topic_0204096847_li106413175281"></a><span><img id="dli_09_0117__en-us_topic_0204096847_image12705162371811" src="en-us_image_0223996999.png"></span></p>
|
|
</li></ol>
|
|
</li><li id="dli_09_0117__li17970732101510">Connecting to data sources through SQL APIs<ol id="dli_09_0117__en-us_topic_0200509988_ol11291435145014"><li id="dli_09_0117__en-us_topic_0200509988_li7129135105012">Create a table to connect to a Mongo data source.<pre class="screen" id="dli_09_0117__screen364715734220">sparkSession.sql(
|
|
"create table test_dds(id string, name string, age int) using mongo options(
|
|
'url' = '192.168.4.62:8635,192.168.5.134:8635/test?authSource=admin',
|
|
'uri' = 'mongodb://username:pwd@host:8635/db',
|
|
'database' = 'test',
|
|
'collection' = 'test',
|
|
'user' = 'rwuser',
|
|
'password' = '######')")</pre>
|
|
<div class="note" id="dli_09_0117__note18659131651018"><img src="public_sys-resources/note_3.0-en-us.png"><span class="notetitle"> </span><div class="notebody"><p id="dli_09_0117__p565971621015">For details about the parameters, see <a href="dli_09_0114.html#dli_09_0114__en-us_topic_0204096844_table2072415395012">Table 1</a>.</p>
|
|
</div></div>
|
|
</li><li id="dli_09_0117__en-us_topic_0200509988_li08868382528">Insert data.<div class="codecoloring" codetype="Scala" id="dli_09_0117__en-us_topic_0200509988_screen82431853135210"><div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><span class="n">sparkSession</span><span class="p">.</span><span class="n">sql</span><span class="p">(</span><span class="s">"insert into test_dds values('3', 'Ann',23)"</span><span class="p">)</span>
|
|
</pre></div></td></tr></table></div>
|
|
|
|
</div>
|
|
</li><li id="dli_09_0117__en-us_topic_0200509988_li1777420614539">Query data.<div class="codecoloring" codetype="Scala" id="dli_09_0117__en-us_topic_0200509988_screen7922121675310"><div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><span class="n">sparkSession</span><span class="p">.</span><span class="n">sql</span><span class="p">(</span><span class="s">"select * from test_dds"</span><span class="p">).</span><span class="n">show</span><span class="p">()</span>
|
|
</pre></div></td></tr></table></div>
|
|
|
|
</div>
|
|
</li></ol>
|
|
</li><li id="dli_09_0117__li157824229206">Submitting a Spark job<ol id="dli_09_0117__en-us_topic_0204096847_ol612481914610"><li id="dli_09_0117__li392319333220">Upload the Python code file to DLI.<p id="dli_09_0117__p1581308173218"><a name="dli_09_0117__li392319333220"></a><a name="li392319333220"></a></p>
|
|
<p id="dli_09_0117__p166633515323"></p>
|
|
</li><li id="dli_09_0117__li3638911143216">In the Spark job editor, select the corresponding dependency module and execute the Spark job.<p id="dli_09_0117__p7741715153212"><a name="dli_09_0117__li3638911143216"></a><a name="li3638911143216"></a></p>
|
|
<div class="p" id="dli_09_0117__p12371913183214"><div class="note" id="dli_09_0117__en-us_topic_0204096847_note1435543551919"><img src="public_sys-resources/note_3.0-en-us.png"><span class="notetitle"> </span><div class="notebody"><ul id="dli_09_0117__en-us_topic_0204096847_ul17825285811"><li id="dli_09_0117__en-us_topic_0197738142_li58215295819">If the Spark version is 2.3.2 (will be offline soon) or 2.4.5, specify the <strong id="dli_09_0117__b727389259105236">Module</strong> to <strong id="dli_09_0117__b711695263105236">sys.datasource.mongo</strong> when you submit a job. </li><li id="dli_09_0117__li6624653171317">If the Spark version is 3.1.1, you do not need to select a module. Configure <strong id="dli_09_0117__b166606555548">Spark parameters (--conf)</strong>.<p id="dli_09_0117__p1520611118290">spark.driver.extraClassPath=/usr/share/extension/dli/spark-jar/datasource/mongo/*</p>
|
|
<p id="dli_09_0117__p182061411152917">spark.executor.extraClassPath=/usr/share/extension/dli/spark-jar/datasource/mongo/*</p>
|
|
</li></ul>
|
|
</div></div>
|
|
</div>
|
|
</li></ol>
|
|
</li></ul>
|
|
</div>
|
|
<div class="section" id="dli_09_0117__section6977544209"><h4 class="sectiontitle">Complete Example Code</h4><ul id="dli_09_0117__ul1583155162117"><li id="dli_09_0117__li14583205112110">Connecting to data sources through DataFrame APIs<pre class="screen" id="dli_09_0117__en-us_topic_0204096848_screen1457910412188">from __future__ import print_function
|
|
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
|
|
from pyspark.sql import SparkSession
|
|
|
|
if __name__ == "__main__":
|
|
# Create a SparkSession session.
|
|
sparkSession = SparkSession.builder.appName("datasource-mongo").getOrCreate()
|
|
|
|
# Create a DataFrame and initialize the DataFrame data.
|
|
dataList = sparkSession.sparkContext.parallelize([("1", "Katie", 19),("2","Tom",20)])
|
|
|
|
# Setting schema
|
|
schema = StructType([StructField("id", IntegerType(), False),StructField("name", StringType(), False), StructField("age", IntegerType(), False)])
|
|
|
|
# Create a DataFrame from RDD and schema
|
|
dataFrame = sparkSession.createDataFrame(dataList, schema)
|
|
|
|
# Setting connection parameters
|
|
url = "192.168.4.62:8635,192.168.5.134:8635/test?authSource=admin"
|
|
uri = "mongodb://username:pwd@host:8635/db"
|
|
user = "rwuser"
|
|
database = "test"
|
|
collection = "test"
|
|
password = "######"
|
|
|
|
# Write data to the mongodb table
|
|
dataFrame.write.format("mongo")
|
|
.option("url", url)\
|
|
.option("uri", uri)\
|
|
.option("user",user)\
|
|
.option("password",password)\
|
|
.option("database",database)\
|
|
.option("collection",collection)
|
|
.mode("Overwrite").save()
|
|
|
|
# Read data
|
|
jdbcDF = sparkSession.read.format("mongo")
|
|
.option("url", url)\
|
|
.option("uri", uri)\
|
|
.option("user",user)\
|
|
.option("password",password)\
|
|
.option("database",database)\
|
|
.option("collection",collection)\
|
|
.load()
|
|
jdbcDF.show()
|
|
|
|
# close session
|
|
sparkSession.stop()</pre>
|
|
</li><li id="dli_09_0117__li16299112722117">Connecting to data sources through SQL APIs<pre class="screen" id="dli_09_0117__screen549814239433">from __future__ import print_function
|
|
from pyspark.sql import SparkSession
|
|
|
|
if __name__ == "__main__":
|
|
# Create a SparkSession session.
|
|
sparkSession = SparkSession.builder.appName("datasource-mongo").getOrCreate()
|
|
|
|
# Create a data table for DLI - associated mongo
|
|
sparkSession.sql(
|
|
"create table test_dds(id string, name string, age int) using mongo options(\
|
|
'url' = '192.168.4.62:8635,192.168.5.134:8635/test?authSource=admin',\
|
|
'uri' = 'mongodb://username:pwd@host:8635/db',\
|
|
'database' = 'test',\
|
|
'collection' = 'test', \
|
|
'user' = 'rwuser', \
|
|
'password' = '######')")
|
|
|
|
# Insert data into the DLI-table
|
|
sparkSession.sql("insert into test_dds values('3', 'Ann',23)")
|
|
|
|
# Read data from DLI-table
|
|
sparkSession.sql("select * from test_dds").show()
|
|
|
|
# close session
|
|
sparkSession.stop()</pre>
|
|
</li></ul>
|
|
</div>
|
|
</div>
|
|
<div>
|
|
<div class="familylinks">
|
|
<div class="parentlink"><strong>Parent topic:</strong> <a href="dli_09_0113.html">Connecting to Mongo</a></div>
|
|
</div>
|
|
</div>
|
|
|