+91 98636 36336


Data Engineering with Pyspark
In this course, you'll learn how to use Spark from Python! Spark is a tool for doing parallel computation with large datasets and it integrates well with Python. PySpark is the Python package that makes the magic happen. You'll use this package to work with real time data. You'll learn to wrangle this data and build a whole machine learning pipeline. Get ready to put some Spark in your Python code and dive into the world of high-performance machine learning.This is very uniquely descigned in our course.

Data Engineering with Pyspark live online classes
06 Jul, 2020 |
Mon to Fri Batch
Mon to Fri Batch |
Filling FastTiming - 09:00 to 13:30 PM (IST) |
18 Jul, 2020 |
Weekend Batch
Weekend Batch |
Filling FastTiming - 09:00 to 13:00 PM (IST) |
Course Price at
39000.00
Enroll NowCan’t find a batch you were looking for?
Request a BatchCurriculum
Download Curriculum
-
Statistics for Data Science Module 1: Introduction to Statistics
- Descriptive and Inferential Statistics. Definitions , terms, types of data
Chapter 2: Harnessing Data
- Types of Sampling Data. Simple random sampling, Stratified, Cluster sampling. Sampling error.
Module 3: Exploratory Analysis
- Mean, Median and Mode, Data variability, Standard deviation, Z-score, Outliers
Module 4: Distributions
- Normal Distribution, Central Limit Theorem, Histogram, Normalization, Normality tests, skewness, Kurtosis.
Module 5: Hypothesis & computational Techniques
- Hypothesis Testing, Null Hypothesis, P-value, Type I & II errors, parametric testing: t- tests, anova test, non-parametric testing
Module 6:
- Correlation & Regression
-
Machine Learning - Basics Module 1: Machine Learning Introduction
- What is ML? ML vs AI. ML workflow, statistical modeling of ML. Application of ML
Module 2: Machine Learning Algorithms
- Popular ML algorithms, clustering, classification and regression, supervised vs unsupervised. Choice of ML
Module 3:
- Supervised Learning Simple and Multiple Linear regression, KNN, and more.
Module 4:
- Linear Regression and Logistic Regression Theory of Linear regression, hands on with use cases
Module 5:
- K-Nearest Neighbour (KNN)
Module 6:
- Decision Tree
Module 7:
- Naïve Bayes Classifier
Module 8:
- Unsupervised Learning K-means Clustering.
-
Machine Learning Expert Module 1:
- Advanced Machine Learning Concepts Tuning with Hyper parameters. Popular ML algorithms, clustering, classification and regression, supervised vs unsupervised. Choice of ML
Module 2:
- Random Forest – Ensemble Ensemble theory, random forest tuning
Module 3:
- Support Vector Machine (SVM) Simple and Multiple Linear regression, KNN,
Module 4:
- Natural Language Processing (NLP) Text Processing with Vectorization, Sentiment analysis with TextBlob, Twitter sentiment analysis.
Module 5:
- Naïve Bayes Classifier Naïve Bayes for text classification, new articles tagging
Module 6:
- Artificial Neural Network (ANN) Basic ANN network for regression and classification
Module 7:
- Tensorflow overview and Deep Learning Intro Tensorflow work flow demo and intro to deep learning.
-
Python for Data Science Module 1:
- Introduction to Data Science with Python
Module 2:
- Python Basics: Basic Syntax, Data Structures Installing Python, Programming basics, Native Data types Data objects, Math, comparison operators, condition statements, loops, lists, tuples, sets, dicts, functions
Module 3:
- Numpy Package Overview, Array, selecting data, Slicing, Iterating, Manuplications, stacking, splitting arrays, functions
Module 4:
- Pandas Package Overview, Series and DataFrame, manuplication.
Module 5:
- Python Advanced: Data Mugging with Pandas Histogramming, grouping, aggregation, treating missing values, removing duplicates, Transforming data
Module 6:
- Python Advanced: Visualization with MatPlotLib
Module 7:
- Exploratory Data Analysis: Data Cleaning, Data Wrangling
Module 8:
- Exploratory Data Analysis: Case Study
-
Time Series Analysis Module 1:
- What is Time Series?
- Trend, Seasonality, cyclical and random
- White Noise
- Auto Regressive Model (AR)
- Moving Average Model (MA)
- ARMA Model
- Stationarity of Time Series
- ARIMA Model – Prediction Concepts
- ARIMA Model Hands on with Python
- Case Study Assignment on ARIMA
-
Deep Learning - CNN Foundation Module 1: REST API
- API concepts, web servers, URL parameters
Module 2: FLASK Web framework
- Installing flask, configuration. Course
Module 3: API in Flask 5+ Industry Projects
- API coding in Flask
Module 4: End to End Deployment
- Exporting trained model, creating end to end API.
-
Data Science & Bigdata Analytics Overview Module 1:
- Introduction to Data Science and Big Data Analytics, Roles played by a Data Scientist, Technologies for Data Scientist like Hadoop,Spark, Scala, Python, R, Machine learning and Analytics used for analysis, Architecture and Methodologies used to solve the Big Data problems. Defining Machine Learning with example. Hadoop Framework
Module 2: HDFS
- That is Big Data? Challenges for processing big data? What technologies support big data? What is Hadoop? Why Hadoop? History of Hadoop,Use Cases of Hadoop, Hadoop eco Systems.
Module 3: Understanding The Cluster
- Typical Workflow, Writing files to HDFS,Reading files from HDFS,Rack Awareness, 5 daemons.
Module 4: Map Reduce
- Before Map reduce,Map Reduce Overview,Job Tracker,Task Tracker Job Scheduling,Mapper and Reducer code,Configuring development environment Eclipse,Anatomy of Map Reduce Jobrun, Job Submission, Job Initialization,Task Assignment,Job Completion, Job Scheduling,Job Failures, Shuffle and sort,
Module 5: P I G
- Pig basics,Install and configure PIG on a cluster PIG Vs MapReduce and SQL Pig Vs Hive Pig Latin Primitive Data Types and Complex Data Types,Types of Modes,Interactive mode Script mode,Embedded mode,Modes of running PIG,Running in Grunt shell,Programming in Eclipse, Loading and Storing Datasets, Filters, Groups, Co-Groups,Foreach, Nested Foreach, Parallel, Distinct, Limit, Sample, Different Types of Joins, Debugging Commands(Illustrate and Explain),Processing Logfiles using Regex,Working with Predefine Functions, User Define Functions,How To Load and Write JSON DATA using PIG
Module 6: H I V E
- Hive Introductions,Hive Architecture,Different Modes to Access HIVE Command Line Interface Web Interface(HWI) Thrift Interface Hive Meta Store Hive QL Primitive Data Types and Complex Data Types Working with Partitions Hive Bucketed Tables and Sampling External Tables Nested Queries Multiple Inserts Dynamic Partitions Different Types of Joins ORDER BY,SORT BY, DISTRIBUT BY,CLUSTER BY INDEXES,VIEWS Compression on Hive Tables and Migrating Hive Tables. Hive SerDe's Processing XML Files using Regex Processing Log Files using Regex Accessing Hbase Tables using Hive Hive UDF Hive UDAF Hive UDTF
Module 7: HBASE
- Hbase introduction Hbase Data Model and Comparison between RDBMS and NOSQL HBase Architecture, master,HregionServer,Zookeeper,Hregion,MemStore,Hlog,AutoSharding File storage architecture HFiles Compction,DeCompactio n,Region Splits HBase Opreations(DDL AND DML)Through Shell Hbase Installation Internal Zookeeper,External Zookeeper Hbase Counters Hbase Filters, HBase use Cases Install and Configure HBase on a Multi Node Cluster Create Database, Develop and Run Sample Applications Access Data Stored in HBase using Clients like Java, Python MapReduce Client to Access the HBase Data HBase and Hive,IntegrationHBase Admin Tasks
Module 8: CASSANDRA
- Introduction Installation Creation of Database Queries and Manipulations
-
Scala Programming for Analytics Module 1:
- Scala and Java - which to use, when and why,Overview of Scala development tools (Eclipse, Scalac, Sbt, Maven, Gradle, REPL, ScalaTest),Overview of Scala Frameworks
Module 2:
- Scala Syntax Fundamentals :Data types,Variables,Operators,Functions and lambdas,Scala,Statements / Loops / Expressions,Extending Builtins,Easy I/O in Scala
Module 3:
- Functional Programming with Scala: What is functional programming?,Using "Match",Case Classes,Wildcards,Case Constructors and Deep Matching,Using Extractors,Pure and First Class Functions,Anonymous Functions,Higher Order Functions,Currying, Closures and Partials.
Module 4:
- Collections and Generics: Java and Scala Collections,Mutable and immutable collections,Using generic types,Lists, tuples and dictionaries,Functional programming and collections,map, fold and filter,Flattening collections and flatMap,The "For Comprehension"Pattern Matching with Scala
-
Spark Frame for Analytics Module 1:
- Introduction to Big Data. Challenges to old Big Data solutions,Batch vs Real-time vs in-Memory processing,MapReduce and its limitations,Apache Storm and itslimitations,Need for a general purpose solution - Apache Spark What is Apache Spark? Components of Spark architecture,Apache Spark design principles,Spark features and characteristics,Apache Spark ecosystem components and their insights
Module 2:
- Setting up the Spark Environment,Installing and configuring prerequisites,Installing Apache Spark in local mode,Working with Spark in local mode,Troubleshooting encountered problems in Spark Installing Spark in standalone mode,Installing Spark in YARN mode,Installing & configuring Spark on a real multi-node cluster,Playing with Spark in cluster mode,Best practices for Spark deployment, Playing with the Spark shell,Executing Scala and Java statements in the shell,Understanding the
Module 3:
- Spark context and driver,Reading data from the local filesystem,Integrating Spark with HDFS,Caching the data in memory for further use,Distributed persistence,Testing and troubleshooting, What is an RDD in Spark,How do RDDs make Spark a feature-rich framework,Transformations in Apache Spark RDDs,Spark RDD action and persistence,Spark Lazy Operations -Transformation and Caching,Fault tolerance in Spark,Loading data and creating RDD in Spark,Persist RDD in memory or disk,Pair operations and key-value in Spark,Spark integration with Hadoop,Apache Spark practicals and workshops
Module 4:
- The need for stream analytics,Comparison with Storm and S4,Real-time data processing using,Spark streaming,Fault tolerance and check-pointing,Stateful stream processing,DStream and window operations,Spark Stream execution flow,Connection to various source systems,Performance optimizations in Spark What is Spark SQL,Apache Spark SQL features and data flow,Spark SQL architecture and ,omponents,Hive and Spark SQL together,Play with Data-frames and data states,Data loading techniques in Spark,Hive queries through Spark,Various Spark SQL DDL and DML operations,Performance tuning in Spark
Module 5:
- Why Machine Learning is needed,What is Spark Machine Learning,Various Spark ML libraries,Algorithms for clustering, statistical analytics, classification etc.What is GraphX,The need for different graph processing engines,Graph handling using Apache Spark Pyspark for Data engineering, Data Science and Big data AnalyticsModule 1 What is PySpark?,Installing and Configuring PySpark,Interactive Use of PySpark, Standalone Programs,PySpark RDD With Operations and Commands,Pyspark Mlib Algorithms & Parameters,Pyspark Profiler – Methods and Functions,Pyspark SparkContext and its parameters.
Like the curriculum? Enroll Now
Structure your learning and get a certificate to prove it.
Features
-
Instructor-led Sessions
Duration: 2 Months
Week Day classes (M-F): 40 Sessions
Daily 2 Hours per Session -
Real-life Case Studies
Live project based on any of the selected use cases, involving the implementation of Data Science. -
Assignments
Every class will be followed by practical assignments which aggregate to a minimum of 60 hours.
-
Lifetime Access
Lifetime access to Learning Management System (LMS) which has class presentations, quizzes, installation guide & class recordings. -
24 x 7 Expert Support
Lifetime access to our 24x7 online support team who will resolve all your technical queries, through ticket based tracking system. -
Certification
Successful completion of the final project will get you certified as a Data Science Professional by GoSkills.
FAQS
-
What if I miss a class? You will never miss a lecture at GoSkill! You can choose either of the two options:
- View the recorded session of the class available in your LMS.
- You can attend the missed session, in any other live batch.
-
Will I get placement assistance? - To help you in this endeavor, we have added a resume builder tool in your LMS. Now, you will be able to create a winning resume in just 3 easy steps. You will have unlimited access to use these templates across different roles and designations. All you need to do is, log in to your LMS and click on the "create your resume" option.
-
Can I attend a demo session before enrollment? - We have limited number of participants in a live session to maintain the Quality Standards. So, unfortunately, participation in a live class without enrollment is not possible. However, you can go through the sample class recording and it would give you a clear insight into how are the classes conducted, quality of instructors and the level of interaction in a class.
-
Who are the instructors? - All the instructors at GoSkill! are practitioners from the Industry with minimum 10-12 yrs of relevant IT experience. They are subject matter experts and are trained by edureka for providing an awesome learning experience to the participants.
-
What if I have more queries? - Just give us a CALL at +9198636 36336 (US Tollfree Number) OR email at Marketing@goskills.in
Success..!!





Get Free counseling to decide your next career step.


Login
Forgot Password?
Don’t have an account? Sign Up


Forgot Password
If you have forgotten your password and would like to change it, enter your email address and we'll send you a new password.
I have a Password?
Go to Login

