Big Data


Objective:

The course will begin with an introduction to Big Data, Hadoop and its components. The two main areas of interest in this course are Apache Spark and Apache Scala. Starting from the basics of these, the participants will be taught the advanced topics in these tools. Participants should be able to do programming for solving complex business problems dealing with big data using Spark and Scala by the end of this course. There will be a use case at the end of the course to help putting all the learning in action.

Learning Outcomes:

1. Learn the basic as well as advanced features and applications of Hadoop
2. Get a deeper understanding of Spark and Scala
3. Learn to solve complex analytical and business problems using Big Data with implementation in Spark and Scala

Lecture-wise Content (1 hour per lecture):

Session Topic
1

Introduction to Big Data

2

Deep understanding of Big Data analytics in business context

3

Understanding Hadoop and its components

4

Understanding HDFS and MapReduce

5

Understanding Hadoop 2.x including YARN

6

Introduction to Spark: Spark Core, Spark SQL, Spark Streaming, Mlib, Graph X, Cluster Managers

7

Python and Scala Shells for Spark, Core Spark Concepts, standalone application using Spark

8

RDD basics, creating and working with RDDs, operations on RDDs, Transformations and Actions on RDDs

9

Working with pair RDDs, transformations and actions on pair RDDs, working on advanced topics like Partitioning

10

Working with different data files like Text files, JSON, CSVs, Sequence Files, Object Files. Working with structured data and databases.

11

Advanced programming with Spark: Accumulators, Broadcasting, Piping to external programs and Numeric operations on RDDs

12

Runtime architecture, Spark-submit, packaging code and dependencies to make it ready for execution, Cluster managers

13

Administering, tuning and debugging Spark using SparkConf, working with WebUI, and key performace considerations

14

Working with SparkSQL, working with data from Hive, Parquet and other tools, Spark SQL UDFs, Hive UDFs

15

Machine Learning with Mlib, data types and algorithms

16

Classification, Regression and Clustering

17

Introduction to Scala and its components

18

Working with data in Scala: data types, literals, valriables and data

19

Expressions and conditions in Scala: understanding the control structures

20

Working with functions in scala - 1

21

Working with functions in scala - 2

22

First class functions in Scala

23

Working with common collections in Scala - Lists, sets and maps

24

Advanced collections - mutable, arrays, sequences, streams, and monadic

25

Introduction to object oriented programming in Big Data context

26

OOP: Classes - defining, packaging, methods, modifiers, and sealed classes

27

OOP: Objects, case classes, importing instance variables

28

Advanced topics in Scala - Tuples and Function Value Classes, impicit parmeters, and implicit classes

29

Case Study: Putting the learning into action

30

Case Study: Putting the learning into action