Skip Navigation

Course Directory

Scalable Computational Bioinformatics

East Baltimore
1st term
4 credits
Academic Year:
2022 - 2023
Instruction Method:
Class Times:
  • M W F,  1:30 - 2:20pm
Auditors Allowed:
Yes, with instructor consent
Undergrads Allowed:
Grading Restriction:
Letter Grade or Pass/Fail
Course Instructor:
  • Benjamin Harvey
Benjamin Harvey

Students are recommended to have previous experience programming in at least one language and know the basics of coding such as iteration, recursion, arrays, matrix. Knowledge in Python is recommended but not required.


As the size of genomic cohorts continues to expand in size and complexity, many organizations are turning to cloud and high-performance computing (HPC) environments to alleviate computational load. While cloud computing promises elasticity and scalability, traditional bioinformatics tools are designed for single system and single node architectures and cannot effectively leverage cloud computing environments at scale and speed. This leads bioinformatics researchers to spend much of their time wrangling data, crafting complex algorithmic techniques, and single task pipelines which are slow to run and difficult to optimize. Researchers have now turned to scalable systems like Apache Spark.

Discusses a distributed programming paradigm, high level APIs, and scalable analytics platforms that simplify implementing algorithms for analyzing large genomic datasets. Discusses tools built on Apache Spark, enabling students to scale to thousands of cores, achieving a balance necessary for processing genomics data. Discusses how to solve some of these problems by bridging bioinformatics, data science, machine learning and the big data ecosystem. Enables students to leverage statistical methods of bioinformaticians and computational biologists in combination with best practices used by data engineers and data scientists across industry.

Learning Objectives:

Upon successfully completing this course, students will be able to:

  1. Develop just enough experience with Python to begin using the Apache Spark programming APIs including Spark SQL, Spark R, and PySpark
  2. Develop experience with Jupyter Notebook, AnVIL and Terra
  3. Describe the Apache Spark architecture, the DataFrames API and SparkR, covering the fundamentals of the Apache Spark framework
  4. Describe processes of tuning Spark applications, developing best practices and avoiding many of the common pitfalls associated with developing Spark applications
  5. Acquire knowledge with fundamental concepts in machine learning: linear regression, logistic regression, cross-validation, random forest, etc.
  6. Develop code to analyze Aggregate genetic variants using the GATK’s GenotypeGVCF implemented on Apache Spark and extract, transform and load (ETL) genomic variant data into Spark DataFrames, enabling seamless manipulation, filtering, quality control and transformation between file formats
  7. Use Machine Learning fundamentals and Data Science techniques to analyze healthcare and genetics/genomics datasets as well as an overview of deep learning and how to scale it with Apache Spark
Methods of Assessment:

This course is evaluated as follows:

  • 100% Assignments

Instructor Consent:

No consent required