140.636.71
Scalable Computational Bioinformatics

Location

Internet

Term

1st Term

Department

Biostatistics

Credit(s)

Academic Year

2021 - 2022

Instruction Method

Synchronous Online with Some Asynchronous Online

Class Time(s)

M, W, 1:30 - 2:20pm

Lab Times

Friday, 1:30 - 2:20pm (01)

Auditors Allowed

Available to Undergraduate

Grading Restriction

Letter Grade or Pass/Fail

Course Instructor(s)

Benjamin Harvey

Contact Name

Benjamin Harvey

Frequency Schedule

Every Year

Resources

Prerequisite

Students should be comfortable using a command line interface and have previous experience programming in at least one language.

Description

As the size of genomic cohorts continues to expand in size and complexity, many organizations are turning to cloud and high-performance computing (HPC) environments to alleviate computational load. While cloud computing promises elasticity and scalability, traditional bioinformatics tools are designed for single system and single node architectures and cannot effectively leverage cloud computing environments at scale and speed. This leads bioinformatics researchers to spend much of their time wrangling data, crafting complex algorithmic techniques, and single task pipelines which are slow to run and difficult to optimize. Researchers have now turned to scalable systems like Apache Spark.

Discusses a distributed programming paradigm, high level APIs, and scalable analytics platforms that simplify implementing algorithms for analyzing large genomic datasets. Discusses tools built on Apache Spark, enabling students to scale to thousands of cores, achieving a balance necessary for processing genomics data. Discusses how to solve some of these problems by bridging bioinformatics, data science, machine learning and the big data ecosystem. Enables students to leverage statistical methods of bioinformaticians and computational biologists in combination with best practices used by data engineers and data scientists across industry.

Learning Objectives

Upon successfully completing this course, students will be able to:

Develop just enough experience with Python to begin using the Apache Spark programming APIs including Spark SQL, Spark R, and PySpark
Describe the Apache Spark architecture, the DataFrames API and SparkR, covering the fundamentals of the Apache Spark framework
Describe processes of tuning Spark applications, developing best practices and avoiding many of the common pitfalls associated with developing Spark applications
Use Delta Lake for ETL processing on data lakes including using Spark SQL to query from a data lake or Spark DataFrames
Implement unified analysis of large-scale genomics data through distributed computation on and distributed storage of genotype data containing pre-packaged pipelines to align reads and detect and annotate variants in individual samples, parallelized using Apache Spark
Develop code to analyze Aggregate genetic variants using the GATK’s GenotypeGVCF implemented on Apache Spark and extract, transform and load (ETL) genomic variant data into Spark DataFrames, enabling seamless manipulation, filtering, quality control and transformation between file formats
Use Machine Learning fundamentals and Data Science techniques to analyze healthcare and genetics/genomics datasets as well as an overview of deep learning and how to scale it with Apache Spark

Methods of Assessment

This course is evaluated as follows:

50% Homework
50% Programming project that is presented in class

Special Comments

Lab is required

140.636.71 Scalable Computational Bioinformatics

140.636.71
Scalable Computational Bioinformatics