forked from docs/doc-exports
Reviewed-by: Pruthi, Vineet <vineet.pruthi@t-systems.com> Co-authored-by: chenxiaoxiong <chenxiaoxiong@huawei.com> Co-committed-by: chenxiaoxiong <chenxiaoxiong@huawei.com>
99 lines
12 KiB
HTML
99 lines
12 KiB
HTML
<a name="dataartsstudio_01_0525"></a><a name="dataartsstudio_01_0525"></a>
|
|
|
|
<h1 class="topictitle1">Developing an MRS Spark Python Job</h1>
|
|
<div id="body0000001191924299"><p id="dataartsstudio_01_0525__p8060118">This section describes how to develop an MRS Spark Python on <span id="dataartsstudio_01_0525__en-us_topic_0127305016_text1666316172612">DataArts Factory</span>.</p>
|
|
<div class="section" id="dataartsstudio_01_0525__section1015762201619"><h4 class="sectiontitle">Case 1: Using an MRS Spark Python Job to Count the Number of Words</h4><p id="dataartsstudio_01_0525__p1980141611462"><strong id="dataartsstudio_01_0525__b14747150473">Prerequisites</strong></p>
|
|
<p id="dataartsstudio_01_0525__p637143319460">You have the permission to access OBS paths.</p>
|
|
<p id="dataartsstudio_01_0525__p2517182614161"><strong id="dataartsstudio_01_0525__b1343184710820">Data preparation</strong></p>
|
|
</div>
|
|
<ul id="dataartsstudio_01_0525__ul23017427610"><li id="dataartsstudio_01_0525__li143017421268">Prepare the script file <strong id="dataartsstudio_01_0525__b149000211241">wordcount.py</strong> with the following content:<pre class="screen" id="dataartsstudio_01_0525__screen514018569136"># -*- coding: utf-8 -*
|
|
import sys
|
|
from pyspark import SparkConf, SparkContext
|
|
def show(x):
|
|
print(x)
|
|
if __name__ == "__main__":
|
|
if len(sys.argv) < 2:
|
|
print ("Usage: wordcount <inputPath> <outputPath>")
|
|
exit(-1)
|
|
# Create SparkConf.
|
|
conf = SparkConf().setAppName("wordcount")
|
|
# Create SparkContext. Pass the conf=conf parameter.
|
|
sc = SparkContext(conf=conf)
|
|
inputPath = sys.argv[1]
|
|
outputPath = sys.argv[2]
|
|
lines = sc.textFile(name = inputPath)
|
|
# Split each line of data by space to obtain words.
|
|
words = lines.flatMap(lambda line:line.split(" "),True)
|
|
# Pair each word into a tuple count 1.
|
|
pairWords = words.map(lambda word:(word,1),True)
|
|
# Use three partitions (reduceByKey) for summarization.
|
|
result = pairWords.reduceByKey(lambda v1,v2:v1+v2)
|
|
# Print the result.
|
|
result.foreach(lambda t :show(t))
|
|
# Save the result to a file.
|
|
result.saveAsTextFile(outputPath)
|
|
# Stop SparkContext.
|
|
sc.stop()</pre>
|
|
<div class="note" id="dataartsstudio_01_0525__note1675782613610"><img src="public_sys-resources/note_3.0-en-us.png"><span class="notetitle"> </span><div class="notebody"><p id="dataartsstudio_01_0525__p475820261167">The encoding format must be set to UTF-8. Otherwise, an error will occur during script execution.</p>
|
|
</div></div>
|
|
</li><li id="dataartsstudio_01_0525__li05449715">Prepare the data file <strong id="dataartsstudio_01_0525__b628614482715">in.txt</strong>, which contains some English words.</li></ul>
|
|
<p id="dataartsstudio_01_0525__p103193125213"><strong id="dataartsstudio_01_0525__b2020075891118">Procedure</strong></p>
|
|
<ol id="dataartsstudio_01_0525__ol8971437141917"><li id="dataartsstudio_01_0525__li10971133721911"><span>Upload the script and data file to the OBS bucket.</span><p><div class="fignone" id="dataartsstudio_01_0525__en-us_topic_0127305016_fig693875618223"><span class="figcap"><b>Figure 1 </b>Uploading files to an OBS bucket</span><br><span><img id="dataartsstudio_01_0525__image1652911491562" src="en-us_image_0000002269198269.png" title="Click to enlarge" class="imgResize"></span></div>
|
|
<div class="note" id="dataartsstudio_01_0525__note34461964110"><img src="public_sys-resources/note_3.0-en-us.png"><span class="notetitle"> </span><div class="notebody"><p id="dataartsstudio_01_0525__p2012532714910">In this example, upload <strong id="dataartsstudio_01_0525__b10983127183717">wordcount.py</strong> and <strong id="dataartsstudio_01_0525__b1753202211419">in.txt</strong> to <strong id="dataartsstudio_01_0525__b14899143882616">obs://obs-tongji/python/</strong>.</p>
|
|
</div></div>
|
|
</p></li><li id="dataartsstudio_01_0525__li11509836152014"><span>Create an empty job named <strong id="dataartsstudio_01_0525__b311712131618">job_MRS_Spark_Python</strong>.</span><p><div class="fignone" id="dataartsstudio_01_0525__fig12954111473116"><span class="figcap"><b>Figure 2 </b>Creating a job</span><br><span><img id="dataartsstudio_01_0525__image17444543415" src="en-us_image_0000002269198253.png" title="Click to enlarge" class="imgResize"></span></div>
|
|
</p></li><li id="dataartsstudio_01_0525__li4417847103416"><span>Go to the job development page, drag the <strong id="dataartsstudio_01_0525__b1818018392166">MRS Spark Python</strong> node to the canvas, and click the node to configure its properties.</span><p><div class="fignone" id="dataartsstudio_01_0525__fig129571684325"><span class="figcap"><b>Figure 3 </b>Configuring properties for an MRS Spark Python node</span><br><span><img id="dataartsstudio_01_0525__image19420194412315" src="en-us_image_0000002234078972.png" title="Click to enlarge" class="imgResize"></span></div>
|
|
<p id="dataartsstudio_01_0525__p856625694611">Parameter descriptions:</p>
|
|
<pre class="screen" id="dataartsstudio_01_0525__screen95612498611">--master
|
|
yarn
|
|
--deploy-mode
|
|
cluster
|
|
obs://obs-tongji/python/wordcount.py
|
|
obs://obs-tongji/python/in.txt
|
|
obs://obs-tongji/python/out</pre>
|
|
<p id="dataartsstudio_01_0525__p1757010359476">Specifically:</p>
|
|
<p id="dataartsstudio_01_0525__p17570193516475"><strong id="dataartsstudio_01_0525__b124738032711">obs://obs-tongji/python/wordcount.py</strong> is the directory where the script is stored.</p>
|
|
<p id="dataartsstudio_01_0525__p1457033515471"><strong id="dataartsstudio_01_0525__b42525207278">obs://obs-tongji/python/in.txt</strong> is the directory where the <strong id="dataartsstudio_01_0525__b9274162015288">wordcount.py</strong> parameters are passed. You can pass the words to count.</p>
|
|
<p id="dataartsstudio_01_0525__p357012353473"><strong id="dataartsstudio_01_0525__b1233114792815"> obs://obs-tongji/python/out</strong> is the directory where output parameters are stored. This directory will also be created in the OBS bucket automatically. If the <strong id="dataartsstudio_01_0525__b17652455132911">out</strong> directory already exists in the OBS bucket, an error will occur.</p>
|
|
</p></li><li id="dataartsstudio_01_0525__li10787161819415"><span>Click <strong id="dataartsstudio_01_0525__b12426104762010">Test</strong> to execute the script job.</span></li><li id="dataartsstudio_01_0525__li5407621112112"><span>After the test is complete, click <strong id="dataartsstudio_01_0525__b19281425161314">Submit</strong>.</span></li><li id="dataartsstudio_01_0525__li10469112515717"><span>Choose <strong id="dataartsstudio_01_0525__b10184184820216">Monitor Job</strong> in the navigation pane and view the job execution result.</span><p><div class="fignone" id="dataartsstudio_01_0525__fig2166132014333"><span class="figcap"><b>Figure 4 </b>Viewing the job execution result</span><br><span><img id="dataartsstudio_01_0525__image39035832710" src="en-us_image_0000002269198261.png" title="Click to enlarge" class="imgResize"></span></div>
|
|
<p id="dataartsstudio_01_0525__p148581113111">The job log shows that the job was successfully executed.</p>
|
|
<div class="fignone" id="dataartsstudio_01_0525__fig1142161012345"><span class="figcap"><b>Figure 5 </b>Job run logs</span><br><span><img id="dataartsstudio_01_0525__image1499152517817" src="en-us_image_0000002269118173.png" title="Click to enlarge" class="imgResize"></span></div>
|
|
<div class="fignone" id="dataartsstudio_01_0525__fig14657143511920"><span class="figcap"><b>Figure 6 </b>Job execution status</span><br><span><img id="dataartsstudio_01_0525__image1993125151015" src="en-us_image_0000002234238812.png" title="Click to enlarge" class="imgResize"></span></div>
|
|
</p></li><li id="dataartsstudio_01_0525__li2043504613010"><span>View the returned records in the OBS bucket. (Skip this step if the return function is not configured.)</span><p><div class="fignone" id="dataartsstudio_01_0525__fig6206104393518"><span class="figcap"><b>Figure 7 </b>Viewing the returned records in the OBS bucket</span><br><span><img id="dataartsstudio_01_0525__image6579535971" src="en-us_image_0000002234238820.png" title="Click to enlarge" class="imgResize"></span></div>
|
|
</p></li></ol>
|
|
<div class="section" id="dataartsstudio_01_0525__section11329152875915"><h4 class="sectiontitle">Case 2: Using an MRS Spark Python Job to Print <strong id="dataartsstudio_01_0525__b15630205511145">hello python</strong></h4><p id="dataartsstudio_01_0525__p136375517475"><strong id="dataartsstudio_01_0525__b4637451124717">Prerequisites</strong></p>
|
|
<p id="dataartsstudio_01_0525__p46372051184717">You have the permission to access OBS paths.</p>
|
|
<p id="dataartsstudio_01_0525__p142358344593"><strong id="dataartsstudio_01_0525__b042563172319">Data preparation</strong></p>
|
|
<p id="dataartsstudio_01_0525__p2333181920317">Prepare the script file <strong id="dataartsstudio_01_0525__b1730053492313">zt_test_sparkPython1.py</strong> with the following content:</p>
|
|
<pre class="screen" id="dataartsstudio_01_0525__screen2506152615910">from pyspark import SparkContext, SparkConf
|
|
conf = SparkConf().setAppName("master"). setMaster("yarn")
|
|
sc = SparkContext(conf=conf)
|
|
print("hello python")
|
|
sc.stop()</pre>
|
|
<p id="dataartsstudio_01_0525__p1415571521016"><strong id="dataartsstudio_01_0525__b41679125214">Procedure</strong></p>
|
|
<ol id="dataartsstudio_01_0525__ol1528952325412"><li id="dataartsstudio_01_0525__li152891523155410"><span>Upload the script file to an OBS bucket.</span></li><li id="dataartsstudio_01_0525__li1032393614546"><span>Create an empty job.</span></li><li id="dataartsstudio_01_0525__li205021427125516"><span>Go to the job development page, drag the <strong id="dataartsstudio_01_0525__b539216165240">MRS Spark Python</strong> node to the canvas, and click the node to configure its properties.</span><p><p id="dataartsstudio_01_0525__p625266165716">Parameter descriptions:</p>
|
|
<pre class="screen" id="dataartsstudio_01_0525__screen35791526205717">--master
|
|
yarn
|
|
--deploy-mode
|
|
cluster
|
|
obs://obs-tongji/python/zt_test_sparkPython1.py</pre>
|
|
<p id="dataartsstudio_01_0525__p13501193219576"><strong id="dataartsstudio_01_0525__b126112032162420">zt_test_sparkPython1.py</strong> indicates the directory where the script is stored.</p>
|
|
</p></li><li id="dataartsstudio_01_0525__li1216122385816"><span>Click <strong id="dataartsstudio_01_0525__b19276124182415">Test</strong> to execute the script job.</span></li><li id="dataartsstudio_01_0525__li10907164572110"><span>After the test is complete, click <strong id="dataartsstudio_01_0525__b1686919128152">Submit</strong>.</span></li><li id="dataartsstudio_01_0525__li1315555135917"><span>Choose <strong id="dataartsstudio_01_0525__b193934612419">Monitor Job</strong> in the navigation pane and view the job execution result.</span><p><div class="fignone" id="dataartsstudio_01_0525__fig135051165371"><span class="figcap"><b>Figure 8 </b>Viewing the job execution result</span><br><span><img id="dataartsstudio_01_0525__image1852423912614" src="en-us_image_0000002234078988.png" title="Click to enlarge" class="imgResize"></span></div>
|
|
</p></li><li id="dataartsstudio_01_0525__li1462614201202"><span>Verify the log.</span><p><p id="dataartsstudio_01_0525__p9354338811">Log in to MRS Manager and check that the log on YARN contains <strong id="dataartsstudio_01_0525__b1391412391115">hello python</strong>.</p>
|
|
<div class="fignone" id="dataartsstudio_01_0525__fig1995095444813"><span class="figcap"><b>Figure 9 </b>Viewing logs on YARN</span><br><span><img id="dataartsstudio_01_0525__image1844911133495" src="en-us_image_0000002234078980.png" title="Click to enlarge" class="imgResize"></span></div>
|
|
</p></li></ol>
|
|
</div>
|
|
</div>
|
|
<div>
|
|
<div class="familylinks">
|
|
<div class="parentlink"><strong>Parent topic:</strong> <a href="dataartsstudio_01_0520.html">Usage Guidance</a></div>
|
|
</div>
|
|
</div>
|
|
|
|
|
|
<script language="JavaScript">
|
|
<!--
|
|
initImageViewer('.imgResize');
|
|
var msg_imageMax = "view original image";
|
|
var msg_imageClose = "close";
|
|
//--></script> |