140.636.01
Scalable Computational Bioinformatics

Location

East Baltimore

Term

1st Term

Department

Biostatistics

Credit(s)

Academic Year

2022 - 2023

Instruction Method

In-person

Class Time(s)

M, W, F, 1:30 - 2:20pm

Auditors Allowed

Yes, with instructor consent

Available to Undergraduate

Grading Restriction

Letter Grade or Pass/Fail

Course Instructor(s)

Benjamin Harvey

Contact Name

Benjamin Harvey

Frequency Schedule

Every Year

Resources

Prerequisite

Students are recommended to have previous experience programming in at least one language and know the basics of coding such as iteration, recursion, arrays, matrix. Knowledge in Python is recommended but not required.

Description

As the size of genomic cohorts continues to expand in size and complexity, many organizations are turning to cloud and high-performance computing (HPC) environments to alleviate computational load. While cloud computing promises elasticity and scalability, traditional bioinformatics tools are designed for single system and single node architectures and cannot effectively leverage cloud computing environments at scale and speed. This leads bioinformatics researchers to spend much of their time wrangling data, crafting complex algorithmic techniques, and single task pipelines which are slow to run and difficult to optimize. Researchers have now turned to scalable systems like Apache Spark.

Discusses a distributed programming paradigm, high level APIs, and scalable analytics platforms that simplify implementing algorithms for analyzing large genomic datasets. Discusses tools built on Apache Spark, enabling students to scale to thousands of cores, achieving a balance necessary for processing genomics data. Discusses how to solve some of these problems by bridging bioinformatics, data science, machine learning and the big data ecosystem. Enables students to leverage statistical methods of bioinformaticians and computational biologists in combination with best practices used by data engineers and data scientists across industry.

Learning Objectives

Upon successfully completing this course, students will be able to:

Develop just enough experience with Python to begin using the Apache Spark programming APIs including Spark SQL, Spark R, and PySpark
Develop experience with Jupyter Notebook, AnVIL and Terra
Describe the Apache Spark architecture, the DataFrames API and SparkR, covering the fundamentals of the Apache Spark framework
Describe processes of tuning Spark applications, developing best practices and avoiding many of the common pitfalls associated with developing Spark applications
Acquire knowledge with fundamental concepts in machine learning: linear regression, logistic regression, cross-validation, random forest, etc.
Develop code to analyze Aggregate genetic variants using the GATK’s GenotypeGVCF implemented on Apache Spark and extract, transform and load (ETL) genomic variant data into Spark DataFrames, enabling seamless manipulation, filtering, quality control and transformation between file formats
Use Machine Learning fundamentals and Data Science techniques to analyze healthcare and genetics/genomics datasets as well as an overview of deep learning and how to scale it with Apache Spark

Methods of Assessment

This course is evaluated as follows:

100% Assignments

140.636.01 Scalable Computational Bioinformatics

140.636.01
Scalable Computational Bioinformatics