Python Spark Certification Training using PySpark

Secure enrollment now


Introduction to Big Data Hadoop and Spark


•    What is Big Data?
•    Big Data Customer Scenarios
•    Limitations and Solutions of Existing Data Analytics

•    Architecture with Uber Use Case

•    How Hadoop Solves the Big Data Problem?
•    What is Hadoop?
•    Hadoop’s Key Characteristics
•    Hadoop Ecosystem and HDFS
•    Hadoop Core Components
•    Rack Awareness and Block Replication
•    YARN and its Advantage
•    Hadoop Cluster and its Architecture
•    Hadoop: Different Cluster Modes
•    Big Data Analytics with Batch & Real-Time Processing
•    Why Spark is Needed?
•    What is Spark?
•    How Spark Differs from its Competitors?
•    Spark at eBay
•    Spark’s Place in Hadoop Ecosystem

Introduction to Python for Apache Spark

•    Overview of Python
•    Different Applications where Python is Used
•    Values, Types, Variables
•    Operands and Expressions
•    Conditional Statements
•    Loops
•    Command Line Arguments
•    Writing to the Screen
•    Python files I/O Functions
•    Numbers
•    Strings and related operations
•    Tuples and related operations
•    Lists and related operations
•    Dictionaries and related operations
•    Sets and related operations


Functions, OOPs, and Modules in Python


•    Functions
•    Function Parameters
•    Global Variables
•    Variable Scope and Returning Values
•    Lambda Functions
•    Object-Oriented Concepts
•    Standard Libraries
•    Modules Used in Python
•    The Import Statements
•    Module Search Path
•    Package Installation Ways


Deep Dive into Apache Spark Framework


•    Spark Components & its Architecture
•    Spark Deployment Modes
•    Introduction to PySpark Shell
•    Submitting PySpark Job
•    Spark Web UI
•    Writing your first PySpark Job Using Jupyter Notebook
•    Data Ingestion using Sqoop


Playing with Spark RDDs


•    Challenges in Existing Computing Methods
•    Probable Solution & How RDD Solves the Problem
•    What is RDD, It’s Operations, Transformations & Actions
•    Data Loading and Saving Through RDDs
•    Key-Value Pair RDDs
•    Other Pair RDDs, Two Pair RDDs
•    RDD Lineage
•    RDD Persistence
•    WordCount Program Using RDD Concepts
•    RDD Partitioning & How it Helps Achieve Parallelization
•    Passing Functions to Spark


DataFrames and Spark SQL


•    Need for Spark SQL
•    What is Spark SQL
•    Spark SQL Architecture
•    SQL Context in Spark SQL
•    Schema RDDs
•    User Defined Functions
•    Data Frames & Datasets
•    Interoperating with RDDs
•    JSON and Parquet File Formats
•    Loading Data through Different Sources
•    Spark-Hive Integration


Machine Learning using Spark MLlib


•    Why Machine Learning
•    What is Machine Learning
•    Where Machine Learning is used
•    Face Detection: USE CASE
•    Different Types of Machine Learning Techniques
•    Introduction to MLlib
•    Features of MLlib and MLlib Tools
•    Various ML algorithms supported by MLlib

Deep Dive into Spark MLlib

•    Supervised Learning: Linear Regression, Logistic Regression, Decision Tree, Random Forest
•    Unsupervised Learning: K-Means Clustering & How It Works with MLlib
•    Analysis of US Election Data using MLlib (K-Means)


Understanding Apache Kafka and Apache Flume


•    Need for Kafka
•    What is Kafka
•    Core Concepts of Kafka
•    Kafka Architecture
•    Where is Kafka Used
•    Understanding the Components of Kafka Cluster
•    Configuring Kafka Cluster
•    Kafka Producer and Consumer Java API
•    Need of Apache Flume
•    What is Apache Flume
•    Basic Flume Architecture
•    Flume Sources
•    Flume Sinks
•    Flume Channels
•    Flume Configuration
•    Integrating Apache Flume and Apache Kafka


Apache Spark Streaming - Processing Multiple Batches


•    Drawbacks in Existing Computing Methods
•    Why Streaming is Necessary
•    What is Spark Streaming
•    Spark Streaming Features
•    Spark Streaming Workflow
•    How Uber Uses Streaming Data
•    Streaming Context & DStreams
•    Transformations on DStreams
•    Describe Windowed Operators and Why it is Useful
•    Important Windowed Operators
•    Slice, Window and ReduceByWindow Operators
•    Stateful Operators


Apache Spark Streaming - Data Sources


•    Apache Spark Streaming: Data Sources
•    Streaming Data Source Overview
•    Apache Flume and Apache Kafka Data Sources
•    Example: Using a Kafka Direct Data Source


Implementing an End-to-End Project

Project 1- Domain: Finance

Project 2- Domain: Media and Entertainment


Spark GraphX (Self-Paced)

•    Introduction to Spark GraphX
•    Information about a Graph
•    GraphX Basic APIs and Operations
•    Spark GraphX Algorithm - PageRank, Personalized PageRank, Triangle Count, Shortest Paths, Connected Components, Strongly Connected Components, Label Propagation


Complimentary sessions on communication presentation and leadership skills.

Benefits from the course

Mode of Teaching

Live Interactive

Spark is one of the most growing and widely used tool for Big Data & Analytics. It has been adopted by multiple companies falling into various domains around the globe and therefore, offers promising career opportunities. In order to take part in these kind of opportunities, you need a structured training that is aligned as per Cloudera Hadoop and Spark Developer Certification (CCA175) and current industry requirements and best practices. Besides strong theoretical understanding, it is quite essential to have a strong hands-on experience. you will be working on various industry-based use-cases and projects incorporating big data and spark tools as a part of solution strategy.

The Course offers:

  • Overview of Big Data & Hadoop including HDFS (Hadoop Distributed File System), YARN (Yet Another Resource Negotiator)

  • Comprehensive knowledge of various tools that falls in Spark Ecosystem like Spark SQL, Spark MlLib, Sqoop, Kafka, Flume and Spark Streaming

  • The capability to ingest data in HDFS using Sqoop & Flume, and analyze those large datasets stored in the HDFS

  • The power of handling real time data feeds through a publish-subscribe messaging system like Kafka

  • The exposure to many real-life industry-based projects which will be executed using Edureka’s CloudLab

  • Projects which are diverse in nature covering banking, telecommunication, social media, and govenment domains

  • Rigorous involvement of a SME throughout the Spark Training to learn industry standards and best practices




  • There are no prerequisites for this training Course.

  • However, prior knowledge of Python Programming and SQL will be helpful but is not at all mandatory.


  • There are no prerequisites for this training Course.
  • However, prior knowledge of Python Programming and SQL will be helpful but is not at all mandatory.

Course Duration:

36 Hours

Class Hours:

2 Hours Day time slots or 3 Hours week end Slots (May change)

Video Clip