Module 3 Overview

Welcome to Module 3 | Data Product Development. In this module, we will build on the foundation established in Module 2 | Exploratory Data Analysis with a focus on applying statistics and AI/ML to large datasets in the cloud.

Specifically, in this module we will focus on developing key skills and toolsets around preparing datasets for analysis then analyzing those datasets. A key component of this module will be in applications of parallel computing for each of these steps, from data processing to development of machine learning models. The module will be structured to work through a typical sequence of data product development: cleaning the raw data so it becomes useful, feature engineering to prepare for analysis, then applying the analysis and modeling.

By the end of this module, you will be familiar with and conversant in the following areas:

Principles of data cleaning and transformation.
Applications of parallel computing in a cloud.
Differences between statistical analysis and machine learning.
Applications of statistical analysis and machine learning to Earth Systems Data in the Cloud.

Specifically by the end of the course, you will have accomplished the following:

Cleaned data using parallel processing in a distributed cloud environment.
Prepared data using feature engineering for statistical analysis and machine learning.
Scale analysis from small dev environments to larger testing and production environments.
Built reproducible data pipelines to automatically create analysis-ready datasets.
Applied statistical models in parallel over large earth-system datasets.
Applied machine learning models at scale to large earth-system datasets.