MacArthur Lab of MGH and the Broad Institute of MIT and Harvard

The MacArthur Lab focuses on extracting useful information from massive datasets generated from large-scale next-generation sequencing projects. Some of the main focuses of the lab are accurate identification of sequence variants from massive data-sets, annotation of loss of function (LoF) variants, using transcriptome sequencing approaches to characterize the impact of DNA sequence variants on human gene function, and using genomic approaches to discover disease-causing mutations in severe disease patients.

Multi-Nucleotide Polymorphisms in ExAC Dataset

The MacArthur Lab has released a variant callset derived from 60,000 human exome sequences on behalf of the Exome Aggregation Consortium (ExAC). This is the largest exome sequencing reference panel to date and has a wide variety of applications.

As a small part of this massive collaborative effort, Beryl Cummings and I are contributing an analysis of multi-nucleotide polymorphisms (MNPs) present in samples from the project. Specifically, we have been investigating combinations of two or three single-nucleotide polymorphisms (SNPs) that when found in the same haplotype in a given sample change the interpretation of protein-coding variants. We have found several compelling examples illustrating the importance of using phase information in variant annotation and are now in a position to provide an open-source tool for MNP annotation.

An analysis of the MNPs found through this project will be part of the main ExAC publication and presented at ASHG 2015.

Finding Misannotated Exons in GENCODE V19

The goal of this project was to find exons that are incorrectly annoatated as protein-coding, when they may not actually be, as well as exons that may not be annotated with the correct reading frame in the GENCODE V19 gene model.

Using a combination of PhyloCSF (evolutionary conservation metrics in 29 mammals phylogeny), constraint scores based on depletion of certain types of variation beyond genome-wide expectation in a large reference population, as well as RPKM tissue expression values from the GTEx Project, we have identified a subset of GENCODE exons that may not be protein coding and require further investigation.

We have also developed a simple approach using PhyloCSF to find exons that are likely annotated with the incorrect reading frame and require further investigation.

Collectively, these two sources of misannotation impact the way we interpret potential LoF variants during annotation. Our lab is seeking to integrate these metrics into our VEP plugin for LoF annotation -- LOFTEE. Results from this project will be incorporated into the LOFTEE publication.

Leiden Muscular Dystrophy Variant Database

I developed a python API and lightweight scripts to automate extraction and remapping of variants from any Leiden Open Variation Database installation. These databases contain potentially invaluable information for human disease research, but are fraught with inconsistencies and in unusable formats for large-scale data analytics.

Technical aspects of this project involved screen-scraping using BeautifulSoup (while managing several LOVD versions with heterogeneous markup), remapping HGVS cDNA variant coordinates to VCF-format genomic coordinates, dealing with highly varied data quality, annotation with variant effect predictor, cross-checking various data sources for QC, and annotating variants with various other data sources.

Extensive project documentation can be found on ReadTheDocs. The relevant scripts and python packages can be found on GitHub.

All Materials Copyright Andrew John Hill