Introduction
Overview
Earth System Data Science in the Cloud is an intensive training program offered through the North Carolina Institute for Climate Studies (NCICS). The course is designed to provide Earth Systems scientists with data science paradigms, principles, and practices to scale their research using cloud-native computing. Through an immersive, hands-on, team- and practicum-based approach, participants internalize the strategies and techniques of collaborative, cloud-based data science that make performant scientific analysis possible. Designed for professional researchers, this course will develop hard and soft skills through modules focused on foundational data science concepts, data product development, and production data science in the cloud on Amazon Web Services (AWS). In each of these modules, participants will progressively build and improve their skills in the areas of reproducible scientific analysis, cloud data workflows, machine learning, visualization, and communication. By the course's end, participants will be proficient in the tools of modern cloud-based data science, be able to work productively and synergistically in dynamic teams, and be able to rapidly embrace new computing paradigms and technologies.
Curriculum
The Earth System Data Science in the Cloud (ESDS) curriculum is composed of five core modules that follow the life cycle of a typical data science project. Each of these core modules is progressive, building upon skills developed in the previous module. Each module will also have a team capstone project designed to integrate module skills into ESDS deliverables.
Module 1 | Scientific Programming
This module develops the core skills needed to program effectively using modern scientific scripted programming languages. It is designed for researchers who have extensive technical backgrounds in other programming languages. The module begins with a survey that will assess each individual's general knowledge of the course's materials. This module focuses on syntax, core programming paradigms, and best practices to leverage the strength of each language. An emphasis is be placed on comparing and contrasting Python, R, and JavaScript with other programming languages, like MatLab and Fortran, in order to leverage portability of existing skill sets and jump-start each researcher's ability to effectively use Python for Earth System Data Science. This module also introduces key soft skills around teamwork, shared development environments, and collaborative learning.
Module 2 | Exploratory Data Analysis (EDA)
EDA is the beginning stage of any data science project. We cover project formulation, accessing data, data format considerations, connecting data, setting up analytics tools, reproducible environments, working with Git, accessing the cloud, using Jupyter notebooks, understanding data context, exploring the data, generating summary statistics, visualizing the data, and preparing the foundation for subsequent steps.
For the team capstone for this module, participants work on specific thematic areas to define an Earth System research question, identify data needed to answer this question, and present an exploratory data analysis of these data in preparation for answering this research question. During this module, participants begin working intensively in collaborative, flat team environments and are introduced to presentation skills.
Module 3 | Data Product Development
Data product development is where the bulk of the work will occur and is where scientific insights are generated. We cover data manipulation principles and best practices for tabular and non-tabular datasets, performance considerations, scientific pipeline development, reproducibility, containerization, scalability, statistical analysis, and machine learning. A focus of this module is be cloud-native acceleration of each of these components using AWS and highly parallel, collaborative computing environments.
For the module's team capstone, participants work together using the tools developed in this module to generate the insights needed to address the research question defined in Module 2. This module further develops technical communication skills and relies on integrating work in team environments using collaboration tools.
Module 4 | Production Data Science
Production data science is where data products are tested, evaluated, validated, refined, and polished with an eye towards dissemination, production deployment, and public use. We cover test-driven development, infrastructure-as-code, continuous integration and continuous delivery, event-driven pipelines, and serverless compute technologies. We also begin developing communication skills and soft skills around presentations and collaborative product generation.
For the team capstone for this module, participants develop three Earth System Data Science products using the techniques discussed in this module. These products are: (1) a publication, (2) a data pipeline, and (3) a dataset. While the bulk of the development of these products will have occurred in the previous module, in Production Data Science, participants focus on polishing these products for production.
Module 5 | Product Delivery
This module focuses on synthesizing and polishing the results from the previous modules. Participants finalize the three Earth System Data Science products developed in the previous module and focus on effectively communicating their work. A key skill developed in this module will be how to communicate technical subject matter to non-technical audiences, with this module's capstone project being a highly polished team presentation for a non-technical audience.
Participants
This course is designed to make research scientists better scientists by providing them the tools to scale their analyses as cloud-native data scientists. The course is designed for graduate level scientists (MS or above) who have extensive domain expertise in an area of Earth System science and a need for scientific computing. Participants are expected to have familiarity with a scientific programming language, have independently developed research projects, and have published in peer-reviewed scientific journals. The course is designed for those research scientists whose primary role is technical research, not administration, management, or communications. Participants are also expected to be interested primarily in applied data science, not basic data science or data analytics.
Basic and Applied Data Science: Basic data science involves algorithm development, research into more performant computing and analytics solutions, and new statistical applications. Basic data science drives applied data science. Applied data science involves applying data science tools, techniques, and technologies to problems in specific domains. Applied Data Science is the focus of data science at NCICS where the goal is to apply data science to scientific challenges in the Earth System.
Data Analytics and Applied Data Science: Data Analytics is a subset of Data Science primarily concerned with descriptive work; answering the 'what' instead of the 'why' or the 'how'. Data Analytics work is typically done to place observed phenomena in context or to highlight spatio-temporal trends. Data Science extends that descriptive work to be perceptive and prescriptive; seeking to understand why an observed phenomena happened, how it works, and predicting its broader impact in time and space. While a lot of what will be covered in this course will be directly applicable to data analytics, a focus of the course will be on the data science side with an emphasis on understanding and predicting natural phenomena in the Earth system context.
Outcomes
At the end of this course, participants will be experienced at:
Collaborating on technical projects in diverse teams.
Using collaborative version control and project management tools to facilate teamwork.
Sharing and reusuing reproducible environments for analysis and publications.
Leveraging cloud native tools to scale analytics and machine learning workflows.
Building highly parallelized, efficient cloud-native workflows.
Coding and software development best practices.
Communicating technical outcomes to technical and non-technical stakeholders.
Deploying machine learning models to production environments.
The course will specifically equip participants to:
Rapidly embrace new computing Paradigms and Technologies
Effectively work in teams in a Data Science Context
Based on past experience, after this course, participants will be able to at least 10x their current performance and be able to accomplish research goals that were previously impossible.
Course Format
The course is cohort and module based. The course is split into week long intensives with 20hrs/week of class time during the intensives and extra team project time outside of class. During and between the intensives participants will be applying the principles and techniques in hands-on exercises and independent research designed to instill the lessons as second nature. Please see the additional information on this site for more details.