tl;dr: There's a lot of public data about biological systems under various perturbations. But they are also noisy. Our lab develops algorithms that integrate that data to model complex biological systems. We do this because we want to understand how biological systems work. This way when we need to intervene to improve health we aren't taking a shot in the dark. We recognize that our lab won't have all the answers, or even all of the questions, so we develop systems that any biologist can use.

Overview

Right now anyone can download genome-wide measurements from more than 2 million assays of diverse physiological conditions. We develop and apply computational methods that analyze these large and heterogeneous data compendia to provide a data-driven lens into pathways, cell-lineages, diseases, and other biological systems of interest.

Our projects are grouped into three major areas:

  • new algorithms for noisy data,
  • improving the accessibility of data and methods,
  • and applications that span from basic biology to precision treatments for human diseases.

Members of our lab work on specific research questions related to these topics, and frequently find that their research eventually touches on more than one of these core areas. We have a strong commitment to open and reproducible science: our webservers are freely available and our code is released under open source licenses on github and bitbucket.

Our work is funded by grants from the National Science Foundation (NSF), the Gordon and Betty Moore Foundation, and the Cystic Fibrosis Foundation. In the past we have received funding from the Norris Cotton Cancer Center and the American Cancer Society.

 
Our recent research, compressed into sketch form by YoSon Park during the 2016 #PennGenRetreat.

Our recent research, compressed into sketch form by YoSon Park during the 2016 #PennGenRetreat.

 

Algorithms for noisy public data

The goal of this aim is to develop algorithms that can integrate data across multiple experiments and platforms to capture useful biological patterns. We develop both supervised and unsupervised approaches. We focus on supervised methods when we know precisely what question we’d like to answer and unsupervised methods for discovery-oriented projects

We’ve developed supervised methods that address key biological challenges related to genes’ developmental roles, tissue-specific expression, and tissue-specific function. We’ve developed unsupervised methods to identify the key biological patterns in cancer and microbial genomics data.

Many of our methods are based on “deep learning” approaches. These techniques are trendy right now. However, we adapt them because the denoising autoencoder framing works well on the types of data that we deal with, i.e. noisy data. We were the first to apply these methods to genomic data integration, and we have found them to be particularly robust to cross-dataset noise. Our recent work has also demonstrated that these methods can be broadly applied, e.g. to electronic health records.

In summary, we develop robust machine learning based approaches. Our methods perform well in many contexts, and we continue to devise new strategies to tackle pressing biomedical challenges.

Reproducible "big data" analysis for everyone

We don’t feel that it’s enough to develop an algorithm and publish a one-off paper demonstrating “good” performance. We want to see these methods used by the wettest molecular biologists and the driest bioinformaticians.

Our driving principle is that “big data” analysis should be as routinely used in biological labs as common wet-bench techniques such as PCR. But we’re not willing to wait the many years required for molecular biology training programs to refocus on providing their students with these capabilities. Instead, we develop robust webservers that provide molecular labs with these analytical capabilities.

For primarily computational researchers, we make our source code available and develop techniques that make such research routinely reproducible. For example, we have repurposed continuous integration systems for scientific analysis. We use these in our own work and we have described how others can employ them. These methods ensure that analyses can be reproduced from start to finish in a new computing environment with little effort on the part of the researcher.

APPLICATIONs from basic biology to human disease

The methods that we develop are grounded in biology. We focus relatively little effort on improving traditional performance metrics such as cross-validated AUC and accuracy. Our own work and the work of many others has demonstrated the limits of such evaluations. In short: the biases present in how scientific discovery proceeds render these analyses fraught with peril at best. Instead, we aim to validate findings where the scientific rubber meets the road: in new experiments designed to perturb the system.

We’ve worked on zebrafish development, bacterial responses to the environment, cancer genomics data, hypertension, Alzheimer’s disease, nephrotic syndromes, and many other areas. We choose to focus where can partner with biologists who will engage in the process of discovery and, when the time comes, test predictions from our methods.