Processing Big Data with Spark - IT Assignment Help

Download Solution Order New Solution
Assignment Task

 

Processing Big Data with Spark
Apache Spark is a well established system for processing Big Data. A large part of Spark's appeal is that programmers are able to create jobs involving multiple MapReduce stages in only a few lines of clearly readable code. Furthermore, Spark boasts better execution times than existing MapReduce implementations such as Hadoop, giving developers a big incentive to use Spark. In this lab we will be using the Scala programming language to perform various tasks in Spark.
At various points in this lab you will be directed to write your own code. You should keep track of your answers where indicated in the provided text le lab05.scala so that you can refer to it later.
Spark has many API calls and many of them are very powerful. It is highly recommended that you get familiar with the web site that lists all the Spark API calls. You can look at the API calls.
In particular look at the left hand column of the web page and see the two entries, RDD and Pair-RDDFunctions. These contain by far the most useful Spark functions. The API on the spark web site does not have many examples at all, so although you can see the full list of API calls, you may not know how to use them. Fortunately, Matthias and Zhen have gone over the entire Spark RDD API and have written examples for almost all of the API calls.

You can see the examples here: Spark API Examples
Just like in the previous lab there are a lot of code examples already given to you. Please type them all into the Spark shell and see what happens. This is the best way to learn. Don't worry there are still plenty of exercises left that will require you to apply your knowledge. In fact there are 6 exercises for you to do by yourself. Once again, there's a helper video available in ECHO360, so watch that before attempting the lab.

Task A: Creating RDDs and performing operations on them
You like the sound of a faster-and-easier-MapReduce-thingermejigger or Spark as it's normally referred to. You also know a bit about coding in Scala so you're ready to jump right in!
1. In Spark, data is stored in RDDs (Resilient Distributed Datasets). You can think of RDDs as immutable Scala collections whose data is spread across multiple cluster nodes. Each RDD is divided into multiple partitions and each partition is processed by one worker thread. There are two main ways for creating RDDs. The first way is to use the key word parallelize to create an RDD from a Scala sequence collection. The second way is to load data from an input file. We will focus on the rst way in this task and we will show you the second way in Task C.
2. Open a terminal, then start a Spark shell.

 

This IT Assignment has been solved by our IT experts at My Uni Papers. Our Assignment Writing Experts are efficient to provide a fresh solution to this question. We are serving more than 10000+ Students in Australia, UK & US by helping them to score HD in their academics. Our Experts are well trained to follow all marking rubrics & referencing style.
Be it a used or new solution, the quality of the work submitted by our assignment experts remains unhampered. You may continue to expect the same or even better quality with the used and new assignment solution files respectively. There’s one thing to be noticed that you could choose one between the two and acquire an HD either way. You could choose a new assignment solution file to get yourself an exclusive, plagiarism (with free Turnitin file), expert quality assignment or order an old solution file that was considered worthy of the highest distinction.

Get It Done! Today

Country
Applicable Time Zone is AEST [Sydney, NSW] (GMT+11)
+

Every Assignment. Every Solution. Instantly. Deadline Ahead? Grab Your Sample Now.