• Home
  • Data Engineering : Advance : DE301

Graduate from bounded Big Data to unbounded Big Data by learning the secrets of Stream Data Processing and MLOps.

Data Engineering : Advance : DE301

  • DURATION

    4 Months

  • WEEKLY

    45 hours

  • FEE

    USD 5,760 INR 437,760†

About the Course

There are four functional roles in Data Science, namely, Business Analyst, Data Analyst, Machine Learning Engineer and Data Engineer. The DE track targets the Data Engineer role. The Data Engineer collects, transforms, moves, secures and stores Data to make the Business Analysis, Data Analysis and Machine Learning possible.

Until now, you have been working with bounded static data. In this part three of your Data Engineering journey you will lean how to work with streaming data. Various tools and strategies will be discussed to handle the challenges of real-time data processing of Big Data.

This course will also introduce you to MLOps. You will learn how to create automated machine learning pipelines using containerization orchestration technologies. The unification of the development and deployment of models ensure that your models not only work in the lab, but also in the wild.

Prerequisites

  • Curiosity
  • Patience
  • Basic arithmetic skills - Brackets, division, multiplication, addition, subtraction
  • Ability to operate a computer, keyboard and mouse
  • Ability to use a web browser to access and use the internet
  • Ability to install software on your computer
  • Data Engineering : Intermediate : DE201

Hardware and Software Requirements

  • Physical operational computer (not in virtualization) – Fedora 34 or greater OR PopOS/Ubuntu 20.04 or greater, OR Windows 10 or greater, OR MacOS 10 or greater
  • 16 GB RAM
  • Broadband internet connection > 5 MBPS
  • 100 GB free hard disk space. SSD Drive recommended
  • Dedicated graphic card is not required but recommended. Cloud will be used
  • Access to a credit card for Google Cloud Compute account with billing enabled and free $300 credits

Learning Objective

Batch Data Processing
  • Dask
  • Mapreduce
  • Dataproc (Managed Hadoop)
  • Spark
Stream Data Processing in Google Cloud Compute Platform (GCP)
  • Introduction To Streaming
  • Challenges Of Streaming
  • Cloud Functions
  • Cloud PubSub
  • Dataflow (Apache Beam) Streaming
  • Spark Streaming
  • Real-Time Dashboards
  • Throughput And Latency
  • Bigquery For Real-Time
  • Dataprep
MLOps Using Containers
  • Docker
  • Kubernetes
  • Kubernetes For Data Science
Datalab And Cloud Source Repository
  • Introducing Datalab
  • Introducing Cloud Source Repositories
  • Creating A Datalab Instance
  • Datalab Persistent Disk And Network
  • Executing Python Code On Cloud Shell
  • Execute Python Code On Datalab
  • Datalab for Teams
  • Working With Cloud Storage
  • Working With BigQuery
  • Committing Notebooks To Cloud Source Repositories
  • Creating A New Cloud Source Repository
  • Deploy An App Engine Application
  • CI/CD Pipeline Using Cloud Build
  • Adding A Repo As A Remote Repository
  • Work With Github
  • Integration With Pubsub
  • Teardown

Learning Outcome

  • Learn to work with Big Data using Dask, which is an alternative to Spark and Hadoop and allows you to scale technologies such as Numpy and Pandas beyond what they can normally handle
  • Learn to think in terms of the map-reduce programming paradigm
  • Learn to work with Managed Hadoop (Dataproc)
  • Learn to use Spark and PySpark to process Big Data
  • Learn how stream data processing is different from batch data processing and the challenges associated with it
  • Learn how to various stream data processing technologies such as Apache Beam and Spark
  • Learn MLOps using industry standard containerization and orchestration frameworks

Fineprint

  • The topics presented are tentative and we reserve the right to add or remove a topic to update or improve the bootcamp, or for a technical or time reasons.
  • † 18% Indian taxes extra.
teacher
Manuj Chandra

Manuj Chandra

Data Science

Related Course

Programming Effectively with Generative AI (Code Based)
  • 2 days
  • Data Analytics

Programming Effectively with Generative AI (Code Based)

Introduction The world of programming is rapidly evolving. Generative AI tools, like Github Copilot, …

Apply now
Data Engineering : Intermediate : DE201
  • 4 Months
  • Data Engineering

Data Engineering : Intermediate : DE201

About the Course There are four functional roles in Data Science, namely, Business Analyst, Data …

Apply now
Data Analysis : Introduction : DA101
  • 4 Months
  • Data Analysis

Data Analysis : Introduction : DA101

About the Course The problem is not Big Data. The problem is Small Data.

Apply now