⚙️
Setting Up Cloudera Data Platform(CDP)
  • CDP Overview
  • Why CDP?
  • CDP Services
  • Setting up Google Cloud Platform(GCP) for Cloudera
  • Creating User
  • Configuring Network Settings
  • Configuring Oracle Java
  • Installing Server
  • Configuring MySQL
  • Set Firewall rule on GCP
  • Cloudera Data Platform Installation
  • Working with Cloudera Manager
  • Set Up a Cluster
  • Testing Your Hadoop Installation
  • Installing Hive
  • Hive Validation
  • Deploying Spark 2.4
  • Running Job on Apache Spark2
  • Installing Kafka
  • Kafka Validation
  • Common Warnings and Errors
Powered by GitBook
On this page

Running Job on Apache Spark2

Upload sherlock.txt in ~/hadoop-admin/data to HDFS

hdfs dfs -put sherlock.txt /user/training/

Open the spark shell

pyspark --master yarn

Making RDD from the textFile

avglens = sc.textFile("sherlock.txt")
avglens
avglensFM = avglens.flatMap(lambda line : line.split())
avglensFM
avglensMap = avglensFM.map(lambda word: (word[0], len(word)))
avglensMap
avglensGrp = avglensMap.groupByKey(2)
avglensGrp
avglensGMap = avglensGrp.map(lambda (k, values): (k, sum(values)/len(values)))
avglensGMap
PreviousDeploying Spark 2.4NextInstalling Kafka

Last updated 3 years ago