Apache Spark and Scala Certification Training

Secure enrollment now


Introduction to Big Data Hadoop and Spark


•    What is Big Data?
•    Big Data Customer Scenarios
•    Limitations and Solutions of Existing Data Analytics

•    Architecture with Uber Use Case

•    How Hadoop Solves the Big Data Problem?
•    What is Hadoop? 
•    Hadoop’s Key Characteristics
•    Hadoop Ecosystem and HDFS
•    Hadoop Core Components
•    Rack Awareness and Block Replication
•    YARN and its Advantage
•    Hadoop Cluster and its Architecture
•    Hadoop: Different Cluster Modes
•    Hadoop Terminal Commands 
•     Big Data Analytics with Batch & Real-time Processing
•    Why Spark is needed?
•    What is Spark?
•    How Spark differs from other frameworks?
•    Spark at Yahoo!


Introduction to Scala for Apache Spark


•    What is Scala? 
•    Why Scala for Spark?
•    Scala in other Frameworks
•    Introduction to Scala REPL
•    Basic Scala Operations
•    Variable Types in Scala
•    Control Structures in Scala 
•    Foreach loop, Functions and Procedures
•    Collections in Scala- Array
•    ArrayBuffer, Map, Tuples, Lists, and more


Functional Programming and OOPs Concepts in Scala


•    Functional Programming
•    Higher Order Functions
•    Anonymous Functions
•    Class in Scala Preview
•    Getters and Setters
•    Custom Getters and Setters
•    Properties with only Getters
•    Auxiliary Constructor and Primary Constructor
•    Singletons
•    Extending a Class 
•    Overriding Methods
•    Traits as Interfaces and Layered Traits


Deep Dive into Apache Spark Framework


•    Spark’s Place in Hadoop Ecosystem
•    Spark Components & its Architecture Preview
•    Spark Deployment Modes
•    Introduction to Spark Shell
•    Writing your first Spark Job Using SBT
•    Submitting Spark Job
•    Spark Web UI
•    Data Ingestion using Sqoop


Playing with Spark RDDs


•    Challenges in Existing Computing Methods
•    Probable Solution & How RDD Solves the Problem
•    What is RDD, It’s Operations, Transformations & Actions 
•    Data Loading and Saving Through RDDs 
•    Key-Value Pair RDDs
•    Other Pair RDDs, Two Pair RDDs
•    RDD Lineage
•    RDD Persistence
•    WordCount Program Using RDD Concepts
•    RDD Partitioning & How It Helps Achieve Parallelization
•    Passing Functions to Spark


DataFrames and Spark SQL


•    Need for Spark SQL
•    What is Spark SQL? 
•    Spark SQL Architecture
•    SQL Context in Spark SQL
•    User Defined Functions
•    Data Frames & Datasets 
•    Interoperating with RDDs
•    JSON and Parquet File Formats
•    Loading Data through Different Sources
•    Spark – Hive Integration


Machine Learning using Spark MLlib


•    Why Machine Learning?
•    What is Machine Learning? 
•    Where Machine Learning is Used?
•    Face Detection: USE CASE
•    Different Types of Machine Learning Techniques 
•    Introduction to MLlib
•    Features of MLlib and MLlib Tools
•    Various ML algorithms supported by MLlib

Deep Dive into Spark MLlib

•    Supervised Learning - Linear Regression, Logistic Regression, Decision Tree, Random Forest Preview
•    Unsupervised Learning - K-Means Clustering & How It Works with MLlib Preview
•    Analysis on US Election Data using MLlib (K-Means)


Understanding Apache Kafka and Apache Flume


•    Need for Kafka
•    What is Kafka? Preview
•    Core Concepts of Kafka
•    Kafka Architecture
•    Where is Kafka Used?
•    Understanding the Components of Kafka Cluster
•    Configuring Kafka Cluster
•    Kafka Producer and Consumer Java API
•    Need of Apache Flume
•    What is Apache Flume? Preview
•    Basic Flume Architecture
•    Flume Sources
•    Flume Sinks
•    Flume Channels
•    Flume Configuration Preview
•    Integrating Apache Flume and Apache Kafka


Apache Spark Streaming - Processing Multiple Batches


•    Drawbacks in Existing Computing Methods
•    Why Streaming is Necessary?
•    What is Spark Streaming? 
•    Spark Streaming Features
•    Spark Streaming Workflow 
•    How Uber Uses Streaming Data
•    Streaming Context & DStreams
•    Transformations on DStreams
•    Describe Windowed Operators and Why it is Useful
•    Important Windowed Operators
•    Slice, Window and ReduceByWindow Operators
•    Stateful Operators


Apache Spark Streaming - Data Sources


•    Apache Spark Streaming: Data Sources
•    Streaming Data Source Overview 
•    Apache Flume and Apache Kafka Data Sources
•    Example: Using a Kafka Direct Data Source
•    Perform Twitter Sentimental Analysis Using Spark Streaming

Complimentary sessions on communication presentation and leadership skills.

Benefits from the course

Mode of Teaching

Live Interactive

  • If your company is focusing on the Internet of Things, Spark can drive it through its capability of handling many analytics tasks concurrently. This is accomplished through well-developed libraries for ML, advanced algorithms for analyzing graphs, and in-memory processing of data at low latency.

  • Low latency data transmitted by IoT sensors can be analysed as continuous streams by Spark. Dashboards that capture and display data in real time can be created for exploring improvement avenues.

  • Spark has dedicated high-level libraries for analyzing graphs, creating queries in SQL, ML, and data streaming. As such, you can create complex big data analytical workflows with ease through minimal coding.

  • As a Data Scientist, you can utilize Scala’s ease of programming and Spark’s framework for creating prototype solutions that offer enlightening insights into the analytical model.


  • There are no such prerequisites for our Spark and Scala Certification Training. However, prior knowledge of Java Programming and SQL will be helpful but is not at all mandatory.

Course Duration:

30 Hours

Class Hours:

2 Hours Day time slots or 3 Hours week end Slots (May change)

Video Clip