This is an R markdown document to accompany my blog post on dimensionality reduction for scATAC-seq data. It downloads all the data and generates all the figures for the blog (except for results drawn from other papers). I hope it will serve as a useful resource for the exploration of methods described in the post.

Dependencies

options(stringsAsFactors = FALSE)
library(Matrix)
library(DelayedArray)
library(Seurat)
library(ggplot2)
library(irlba)
library(patchwork)
library(plyr)
library(dplyr)
library(stringr)
library(SnapATAC)
library(GenomicRanges)
library(cisTopic)

The libraries loaded above must all be installed. Note that several PCA steps are involved that can be much faster depending on which version of BLAS your R installation is linked against. For example, I am running this on my Mac and it uses the Accelerate framework version of BLAS, which is quite fast (and if you are on a Mac, you will probably find the same). OpenBLAS and similar are also quite fast. Being linked against the default R version of BLAS will tend to be slower. I just want to mention it as it can result in fairly large speed differences.

sessionInfo()
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-apple-darwin17.6.0 (64-bit)
## Running under: macOS  10.14
## 
## Matrix products: default
## BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
## LAPACK: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libLAPACK.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] cisTopic_0.2.1       GenomicRanges_1.34.0 GenomeInfoDb_1.16.0 
##  [4] SnapATAC_1.0.0       rhdf5_2.24.0         stringr_1.3.1       
##  [7] dplyr_0.7.8          plyr_1.8.4           patchwork_0.0.1     
## [10] irlba_2.3.3          ggplot2_3.1.1        Seurat_3.0.0.9000   
## [13] DelayedArray_0.6.6   BiocParallel_1.14.2  IRanges_2.16.0      
## [16] S4Vectors_0.20.1     BiocGenerics_0.26.0  matrixStats_0.54.0  
## [19] Matrix_1.2-14       
## 
## loaded via a namespace (and not attached):
##   [1] snow_0.4-2                  backports_1.1.2            
##   [3] igraph_1.2.4.1              lazyeval_0.2.2             
##   [5] sp_1.3-1                    GSEABase_1.42.0            
##   [7] splines_3.5.1               listenv_0.7.0              
##   [9] feather_0.3.3               digest_0.6.18              
##  [11] foreach_1.4.4               htmltools_0.3.6            
##  [13] gdata_2.18.0                lda_1.4.2                  
##  [15] magrittr_1.5                memoise_1.1.0              
##  [17] cluster_2.0.7-1             doParallel_1.0.14          
##  [19] ROCR_1.0-7                  limma_3.36.5               
##  [21] globals_0.12.4              annotate_1.58.0            
##  [23] R.utils_2.8.0               colorspace_1.3-2           
##  [25] blob_1.1.1                  ggrepel_0.8.0              
##  [27] crayon_1.3.4                RCurl_1.95-4.11            
##  [29] jsonlite_1.6                bigmemory.sri_0.1.3        
##  [31] graph_1.58.2                bindr_0.1.1                
##  [33] survival_2.42-6             zoo_1.8-3                  
##  [35] iterators_1.0.10            glue_1.3.0                 
##  [37] gtable_0.3.0                zlibbioc_1.26.0            
##  [39] XVector_0.22.0              Rhdf5lib_1.2.1             
##  [41] future.apply_1.0.1          scales_1.0.0               
##  [43] DBI_1.0.0                   edgeR_3.22.5               
##  [45] bibtex_0.4.2                Rcpp_1.0.1                 
##  [47] metap_1.0                   viridisLite_0.3.0          
##  [49] xtable_1.8-3                reticulate_1.10            
##  [51] bit_1.1-14                  rsvd_1.0.0                 
##  [53] SDMTools_1.1-221            tsne_0.1-3                 
##  [55] htmlwidgets_1.3             httr_1.3.1                 
##  [57] gplots_3.0.1                RColorBrewer_1.1-2         
##  [59] ica_1.0-2                   pkgconfig_2.0.2            
##  [61] XML_3.98-1.16               R.methodsS3_1.7.1          
##  [63] locfit_1.5-9.1              tidyselect_0.2.5           
##  [65] rlang_0.3.4                 later_0.8.0                
##  [67] AnnotationDbi_1.44.0        munsell_0.5.0              
##  [69] tools_3.5.1                 RSQLite_2.1.1              
##  [71] ggridges_0.5.1              evaluate_0.11              
##  [73] yaml_2.2.0                  npsurv_0.4-0               
##  [75] knitr_1.20                  bit64_0.9-7                
##  [77] fitdistrplus_1.0-14         caTools_1.17.1.1           
##  [79] purrr_0.2.5                 RANN_2.6.1                 
##  [81] bindrcpp_0.2.2              pbapply_1.3-4              
##  [83] future_1.10.0               mime_0.6                   
##  [85] R.oo_1.22.0                 compiler_3.5.1             
##  [87] plotly_4.8.0                png_0.1-7                  
##  [89] lsei_1.2-0                  tibble_2.1.1               
##  [91] stringi_1.2.4               lattice_0.20-35            
##  [93] pillar_1.3.1                Rdpack_0.8-0               
##  [95] lmtest_0.9-36               data.table_1.12.2          
##  [97] cowplot_0.9.3               bitops_1.0-6               
##  [99] bigmemory_4.5.33            gbRd_0.4-11                
## [101] raster_2.8-19               httpuv_1.5.1               
## [103] AUCell_1.5.5                R6_2.4.0                   
## [105] promises_1.0.1              KernSmooth_2.23-15         
## [107] RcisTarget_1.3.5            codetools_0.2-15           
## [109] MASS_7.3-50                 gtools_3.8.1               
## [111] assertthat_0.2.1            SummarizedExperiment_1.10.1
## [113] rprojroot_1.3-2             withr_2.1.2                
## [115] GenomeInfoDbData_1.1.0      doSNOW_1.0.16              
## [117] hms_0.4.2                   grid_3.5.1                 
## [119] tidyr_0.8.1                 rmarkdown_1.10             
## [121] Rtsne_0.15                  Biobase_2.40.0             
## [123] shiny_1.3.1

Setting up utility functions

There are a few utility functions that we’ll need later on. Note that the TF-IDF function is longer than it really needs to be as it provides several options and also has an out of memory matrix multiplication implementation that it falls back on.

Note that I use the lsi_workflow function with log_scale_tf=TRUE for the LSI logTF discussed in the blog post. This also seems to work pretty much same as the version of LSI that 10x uses, which I have tried to replicate in the tenx_lsi_workflow function. Both are maintained here in case you want to compare.

########################################
# Utility functions
########################################
# Loads 10x dataset into a sparse matrix object
# Args:
#   matrix_fn (string): name of mtx or mtx.gz file coontaining MM formatted matrix
#   peaks_fn (string): name of peaks BED file (matches rows of matrix)
#   barcodes_fn (string): name of file containing cell barcodes (matches columns of matrix)
# Returns:
#   sparse matrix: matrix represented by files above with row and column names set (chr_start_stop format used for rownames)
load_tenx_atac = function(matrix_fn, peaks_fn, barcodes_fn) {
  atac_matrix = readMM(matrix_fn)
  colnames(atac_matrix) = read.delim(barcodes_fn, header=FALSE)$V1
  peaks = read.delim(peaks_fn, header=FALSE)
  peaks = paste(peaks$V1, peaks$V2, peaks$V3, sep = '_')
  rownames(atac_matrix) = peaks
  return(atac_matrix)
}

# Allows filtering of sites measured as non-zero in less than a given number of cells
# Args:
#   bmat (sparse matrix): sparse matrix (binarized)
#   cells (int): filter sites if they have less than this number of cells above zero
# Returns:
#   sparse matrix: filtered sparse matrix
filter_features = function(bmat, cells=10) {
  bmat = bmat[Matrix::rowSums(bmat) >= cells, ]
  return(bmat)
}

# Allows filtering of cells with below a given number of non-zero features
# Args:
#   bmat (sparse matrix): sparse matrix (binarized)
#   features_above_zero (int): filter cells if they have less than this number of features above zero
# Returns:
#   sparse matrix: filtered sparse matrix
filter_cells = function(bmat, features_above_zero=100) {
  bmat = bmat[, Matrix::colSums(bmat > 0) >= features_above_zero]
  return(bmat)
}

# Takes sparse matrix object and downsamples to a given fraction of entries remaining.
# Args:
#   bmat (sparse matrix): sparse matrix to downsample
#   fraction_remaining (float): float (0, 1) that indicates fraction of non-zero entries to retain
#   cells_per_site_min (int): min cells a site must be measured in to retain the site in matrix
#   sites_per_cell_min (int): min sites a cell must have non-zero entries in to retain the cell in matrix
# Returns:
#   sparse matrix: downsampled sparse matrix
downsample_matrix = function(bmat, fraction_remaining=0.5, cells_per_site_min=1, sites_per_cell_min=1) {
  set.seed(2019)
  non_zero_entries = which(bmat@x > 0)
  indices_to_zero = sample(non_zero_entries, size=ceiling(length(non_zero_entries) * (1 - fraction_remaining)))
  bmat@x[indices_to_zero] = 0
  
  # Make sure to get rid of stuff that has gone to ~ 0 after downsampling
  bmat = filter_features(bmat, cells=cells_per_site_min)
  bmat = filter_cells(bmat, features_above_zero=sites_per_cell_min)
  return(bmat)
}

########################################
# Functions for LSI
########################################
# Helper function to do fast version of row scaling of sparse TF matrix by IDF vector.
# Exploits the way that data is stored within sparseMatrix object. Seems to be much more memory efficient than tf * idf and faster than DelayedArray.
# Args:
#   tf (sparse matrix): term frequency matrix
#   idf (vector): IDF vector
# Returns:
#   sparse matrix: TF-IDF matrix
safe_tfidf_multiply = function(tf, idf) {
   tf = t(tf)
   tf@x <- tf@x * rep.int(idf, diff(tf@p))
   tf = t(tf)
   return(tf)
}

# Perform TF-IDF on binary matrix
# Args:
#   bmat (sparse matrix): sparse matrix to downsample
#   frequencies (bool): divide bmat by colSums (if FALSE simply use bmat for TF matrix)
#   log_scale_tf (bool): log scale TF matrix if TRUE
#   scale_factor (float): multiply terms in TF matrix by scale_factor prior to log1p. Equivalent to adding small pseudocount but doesn't cast to dense matrix at any point.
# Returns:
#   sparse matrix: TF-IDF matrix
tfidf = function(bmat, frequencies=TRUE, log_scale_tf=TRUE, scale_factor=100000) {
  # Use either raw counts or divide by total counts in each cell
  if (frequencies) {
    # "term frequency" method
    tf = t(t(bmat) / Matrix::colSums(bmat))
  } else {
    # "raw count" method
    tf = bmat
  }
  
  # Either TF method can optionally be log scaled
  if (log_scale_tf) {
    if (frequencies) {
      tf@x = log1p(tf@x * scale_factor)
    } else {
      tf@x = log1p(tf@x * 1)
    }
  }
  
  # IDF w/ "inverse document frequency smooth" method
  idf = log(1 + ncol(bmat) / Matrix::rowSums(bmat))
  
  # TF-IDF
  tf_idf_counts = safe_tfidf_multiply(tf, idf)
  rownames(tf_idf_counts) = rownames(bmat)
  colnames(tf_idf_counts) = colnames(bmat)
  return(tf_idf_counts)
}

# Perform current version of TF-IDF used by 10x on binary matrix
# Args:
#   bmat (sparse matrix): sparse matrix to downsample
# Returns:
#   sparse matrix: TF-IDF matrix
tenx_tfidf = function(bmat) {
  idf = log(ncol(bmat) + 1) - log(1 + Matrix::rowSums(bmat))
  tf_idf_counts = safe_tfidf_multiply(bmat, idf)
  
  rownames(tf_idf_counts) = rownames(bmat)
  colnames(tf_idf_counts) = colnames(bmat)
  tf_idf_counts = as(tf_idf_counts, "sparseMatrix")
  return(tf_idf_counts)
}

# Perform fast PCA (irlba) on matrix, retaining observation names
# Args:
#   mat (sparse matrix): matrix to use for PCA (no further scaling or centering done)
#   dims (int): number of PCs to calculate
# Returns:
#   sparse matrix: TF-IDF matrix
do_pca = function(mat, dims=50) {
  pca.results = irlba(t(mat), nv=dims)
  final_result = pca.results$u %*% diag(pca.results$d)
  rownames(final_result) = colnames(mat)
  colnames(final_result) = paste0('PC_', 1:dims)
  return(final_result)
}

########################################
# Helper functions for dim reduction
########################################
# Wrapper for performing further dim reduction (tSNE/UMAP) and clustering given PCA space via Seurat.
# Args:
#   atac_matrix (sparse matrix): matrix to store in Seurat object (not used in computations)
#   cell_embeddings (matrix): typically PCA coordinates of cells but could be any set of reduced dim coordinates
#   dims (vector of int): vector of dims to use from cell_embeddings in downstream analysis
#   metadata (dataframe): dataframe of metadata (rowonames are cell names) to add to Seurat object
#   reduction (string): reduction to use for downstream steps. Can be 'pca' (cell_embeddings) or 'pca.l2' (L2 normalized cell_embeddings)
# Returns:
#   Seurat object: seurat object
run_dim_reduction = function(atac_matrix, cell_embeddings, dims, metadata=NULL, reduction='pca.l2') {
  if (is.null(metadata)) {
    seurat_obj = Seurat::CreateSeuratObject(atac_matrix)
  } else {
    seurat_obj = Seurat::CreateSeuratObject(atac_matrix, meta.data = metadata)
  }

  seurat_obj[['pca']] = Seurat::CreateDimReducObject(embeddings=cell_embeddings, key='PC_', assay='RNA')
  seurat_obj = seurat_obj %>%
                Seurat::L2Dim(reduction='pca') %>%
                Seurat::RunUMAP(reduction = reduction, dims = dims) %>%
                Seurat::RunTSNE(reduction = reduction, dims = dims) %>%
                Seurat::FindNeighbors(reduction=reduction, nn.eps=0.25, dims=dims)
  return(seurat_obj)
}

# Helper function for plotting Spearman correlations of a given metadata column with all dimensions in a reduced space.
# Args:
#   seurat_obj (seurat object): Seurat object to use
#   reduction (string): name of reduction to use
#   column (string): name of column in metadata to use
# Returns:
#   ggplot object: plot object
plot_pc_correlation = function(seurat_obj, reduction, column='nCount_RNA') {
  coords = Seurat::Embeddings(seurat_obj, reduction=reduction)
  column_value = seurat_obj@meta.data[, column]
  correlations = apply(coords, 2, function(x) {cor(x, column_value, method='spearman')})
  correlations_df = data.frame(correlation=correlations, PC=1:ncol(coords))
  
  plot_obj = ggplot(correlations_df, aes(PC, correlation)) +
    geom_point() +
    theme_classic() +
    geom_hline(yintercept = 0, linetype='dashed', color='red')
    
  return(plot_obj)
}

########################################
# Wrapper functions for workflows
########################################
# Wrapper for full LSI workflow (TF-IDF and PCA + clustering + further dim reduction)
# Args:
#   bmat (sparse matrix): sparse matrix (binarized)
#   dims (vector of int): vector of dims to use from cell_embeddings in downstream analysis
#   metadata: dataframe of metadata (rowonames are cell names) to add to Seurat object
#   log_scale_tf (bool): log scale TF matrix if TRUE
#   reduction (string): reduction to use for downstream steps. Can be 'pca' (cell_embeddings) or 'pca.l2' (L2 normalized cell_embeddings)
#   resolution (float): resolution parameter to Seurat Louvain clustering
# Returns:
#   Seurat object: Seurat object. clustering + tSNE + UMAP done on PCA results from TF-IDF matrix.
lsi_workflow = function(bmat, dims, metadata=NULL, log_scale_tf=TRUE, reduction='pca.l2', resolution=0.3) {
  tfidf_mat = tfidf(bmat, log_scale_tf=log_scale_tf)
  pca_mat = do_pca(tfidf_mat, dims=max(dims))
  
  seurat_obj = run_dim_reduction(bmat, pca_mat, dims, metadata) %>%
                Seurat::FindClusters(reduction=reduction, n.start=20, resolution=resolution)
  return(seurat_obj)
}

# Wrapper for 10x version of full LSI workflow (TF-IDF and PCA + clustering + further dim reduction). Only TF-IDF step is modified.
# Args:
#   bmat (sparse matrix): sparse matrix (binarized)
#   dims (vector of int): vector of dims to use from cell_embeddings in downstream analysis
#   metadata: dataframe of metadata (rownames are cell names) to add to Seurat object
#   reduction (string): reduction to use for downstream steps. Can be 'pca' (cell_embeddings) or 'pca.l2' (L2 normalized cell_embeddings)
#   resolution (float): resolution parameter to Seurat Louvain clustering
# Returns:
#   Seurat object: Seurat object. clustering + tSNE + UMAP done on PCA results from TF-IDF matrix.
tenx_lsi_workflow = function(bmat, dims, metadata=NULL, reduction='pca.l2', resolution=0.3) {
  tfidf_mat = tenx_tfidf(bmat)
  pca_mat = do_pca(tfidf_mat, dims=max(dims))
  
  seurat_obj = run_dim_reduction(bmat, pca_mat, dims, metadata) %>%
                Seurat::FindClusters(reduction=reduction, n.start=20, resolution=resolution)
  return(seurat_obj)
}

# Runs cisTopic on binary matrix.
# Args:
#   bmat (sparse matrix): sparse matrix (binarized)
#   topic (vector of int): topic numbers to generate models for
# Returns:
#   cisTopic object: cisTopic object with models generated
cistopic_workflow = function(bmat, topic=c(40, 50)) {
  coords = str_split_fixed(rownames(bmat), '_', 3)
  new_coords = paste0(coords[, 1], ':', coords[, 2], '-', coords[, 3])
  rownames(bmat) = new_coords

  cisTopicObject = cisTopic::createcisTopicObject(bmat, project.name='mouse_atac')
  cisTopicObject = cisTopic::runModels(cisTopicObject, topic, seed=2019, nCores=1, burnin = 250, iterations = 500)
  return(cisTopicObject)
}

# Wrapper for SnapATAC workflow up until PCA step.
# Args:
#   snap_file (string): path to snap file
#   promooter.df (dataframe): dataframe with promoter definitions as shown in SnapATAC tutorials (dataframe not file name; see examples in markdown)
#   blacklist.df (dataframe): dataframe with blacklist region definitions as shown in SnapATAC tutorials (dataframe not file name; see examples in markdown)
#   fragment_number_threshold (int): threshold for number of unique fragments per cell
#   promoter_ratio_range (c(float, float)): vector with lower and upper bound of acceptable fraction of reads in promoter regions for cell filtering as used in SnapATAC tutorial.
#   window_z_range (c(float, float)): vector with lower and upper bound of acceptable window z-scores for non-zero entries for site filtering as used in SnapATAC tutorial.
#   sample_name (string): sample_name provided to SnapATAC
#   pc.num (int): total PCs to compute
# Returns:
#   snap object: SnapATAC object
# Notes:
#   This uses single core because multithreaded implementation interferes with Knitr. In running this code, any do.par=FALSE and num.cores=1 could be changed as needed.
snapatac_workflow = function(snap_file, promoter.df=NULL, blacklist.df=NULL, fragment_number_threshold=500, promoter_ratio_range=c(0.2, 0.8), window_z_range=c(-1.5, 1.5), sample_name='default', pc.num=50) {
  x.sp = createSnap(
    file=snap_file,
    sample=sample_name,
    do.par=FALSE,
    num.cores=1
  )
  
  plotBarcode(
    obj=x.sp, 
    pdf.file.name=NULL, 
    pdf.width=7, 
    pdf.height=7, 
    col="grey",
    border="grey",
    breaks=50
  )
  
  x.sp = filterCells(
    obj=x.sp, 
    subset.names=c("fragment.num", "UMI"),
    low.thresholds=c(fragment_number_threshold, fragment_number_threshold),
    high.thresholds=c(Inf, Inf)
  )
  
  x.sp = addBmatToSnap(x.sp, bin.size=5000, num.cores=1)

  # Optionally filter cells based on ratio of reads in promoters
  if (!is.null(promoter.df)) {
    promoter.gr = GRanges(promoter.df[,1], IRanges(promoter.df[,2], promoter.df[,3]))
    ov = findOverlaps(x.sp@feature, promoter.gr)
    idy = queryHits(ov)
    promoter_ratio = SnapATAC::rowSums(x.sp[,idy, mat="bmat"], mat="bmat") / SnapATAC::rowSums(x.sp, mat="bmat")
    plot(log(SnapATAC::rowSums(x.sp, mat="bmat") + 1,10), promoter_ratio, cex=0.5, col="grey", xlab="log(count)", ylab="FIP Ratio", ylim=c(0,1 ))
    idx = which(promoter_ratio > promoter_ratio_range[1] & promoter_ratio < promoter_ratio_range[2])
    x.sp = x.sp[idx,]
  }
  
  x.sp = makeBinary(x.sp, mat="bmat");
  
  # Filter out non-standard contigs if present
  idy2 = grep("chrM|random", x.sp@feature)
  
  if (!is.null(blacklist.df)) {
  black_list.gr = GRanges(
    blacklist.df[,1], 
    IRanges(blacklist.df[,2], blacklist.df[,3])
  )
    idy1 = queryHits(findOverlaps(x.sp@feature, black_list.gr))
    
  } else {
    # No blacklist provided, so just ignore
    idy1 = c()
  }
  
  idy = unique(c(idy1, idy2))
  x.sp = x.sp[,-idy, mat="bmat"]
  
  # Filter based on frequency
  x.sp = filterBins(
    x.sp,
    low.threshold=window_z_range[1],
    high.threshold=window_z_range[2],
    mat="bmat"
  )
  
  plotBinCoverage(
    x.sp,
    pdf.file.name=NULL,
    col="grey",
    border="grey",
    breaks=10,
    xlim=c(-6,6)
  )
  
  x.sp = runJaccard(
    obj = x.sp,
    tmp.folder=tempdir(),
    mat = "bmat",
    max.var=2000,
    ncell.chunk=1000,
    do.par=FALSE,
    num.cores=1,
    seed.use=10
  )

  x.sp = runNormJaccard(
    obj = x.sp,
    tmp.folder=tempdir(),
    ncell.chunk=1000,
    method="normOVE",
    row.center=TRUE,
    row.scale=TRUE,
    low.threshold=-5,
    high.threshold=5,
    do.par=FALSE,
    num.cores=1,
    seed.use=10
  )

  x.sp = runDimReduct(
    x.sp,
    pc.num=pc.num,
    input.mat="jmat",
    method="svd",
    center=TRUE,
    scale=FALSE,
    seed.use=10
  )

  rownames(x.sp@bmat) = x.sp@barcode
  colnames(x.sp@bmat) = as.character(1:ncol(x.sp@bmat))

  return(x.sp)
}

# Reperform Jaccard + PCA on SnapATAC object (used for redoing after modifying matrix)
# Args:
#   snao_obj (snap object): snap object
#   pc.num (int): total PCs to compute
# Returns:
#   snap object: SnapATAC object
# Notes:
#   This uses single core because multithreaded implementation interferes with Knitr. In running this code, any do.par=FALSE and num.cores=1 could be changed as needed.
snapatac_rerun_jaccard = function(snap_obj, pc.num=50) {
  
  snap_obj = runJaccard(
    obj = snap_obj,
    tmp.folder=tempdir(),
    mat = "bmat",
    max.var=2000,
    ncell.chunk=1000,
    do.par=FALSE,
    num.cores=1,
    seed.use=10
  )

  snap_obj = runNormJaccard(
    obj = snap_obj,
    tmp.folder=tempdir(),
    ncell.chunk=1000,
    method="normOVE",
    row.center=TRUE,
    row.scale=TRUE,
    low.threshold=-5,
    high.threshold=5,
    do.par=FALSE,
    num.cores=1,
    seed.use=10
  )

  snap_obj = runDimReduct(
    snap_obj,
    pc.num=pc.num,
    input.mat="jmat",
    method="svd",
    center=TRUE,
    scale=FALSE,
    seed.use=10
  )
}

# Wrapper for cisTopic workflow (choose from models that have been run + clustering + further dim reduction)
# Args:
#   cistopicObject (cistopic object): cisTopic object with runModels having already been run
#   dims (vector of int): vector of dims to use from cell_embeddings in downstream analysis
#   method (string): method argument to modelMatSelection function in cisTopic
#   reduction (string): reduction to use for downstream steps. Can be 'pca' (cell_embeddings) or 'pca.l2' (L2 normalized cell_embeddings)
#   resolution (float): resolution parameter to Seurat Louvain clustering
# Returns:
#   Seurat object: Seurat object. clustering + tSNE + UMAP done on topic matrix from cisTopic
seurat_workflow_on_cistopic = function(cistopicObject, method='Z-score', reduction='pca', resolution=0.3) {
  cistopicObject = cisTopic::selectModel(cistopicObject)

  cistopicObject.reduced_space = t(cisTopic::modelMatSelection(cistopicObject, target='cell', method=method))
  colnames(cistopicObject.reduced_space) = paste0('PC_', 1:ncol(cistopicObject.reduced_space))
  dimensions = ncol(cistopicObject.reduced_space)
  
  cistopicObject.seurat = run_dim_reduction(cistopicObject@binary.count.matrix, cistopicObject.reduced_space, dims=1:dimensions, reduction='pca')
  
  cistopicObject.seurat = cistopicObject.seurat %>% 
    Seurat::FindNeighbors(reduction=reduction, nn.eps=0.25, dims=1:dimensions) %>%
    Seurat::FindClusters(reduction=reduction, n.start=20, resolution=resolution)
  return(cistopicObject.seurat)
}

# Wrapper for running downstream Seurat workflow (clustering + further dim reduction) on PCA from Jaccard matrix generated by SnapATAC
# Args:
#   snao_obj (snap object): snap object with runDimReduct already run
#   dims (vector of int): vector of dims to use from cell_embeddings in downstream analysis
#   reduction (string): reduction to use for downstream steps. Can be 'pca' (cell_embeddings) or 'pca.l2' (L2 normalized cell_embeddings)
#   resolution (float): resolution parameter to Seurat Louvain clustering
# Returns:
#   Seurat object: Seurat object. clustering + tSNE + UMAP done on PCA matrix from SnapATAC (note PCA is weighted by variance explained)
seurat_workflow_on_jaccard_pca = function(snap_obj, dims, reduction='pca', resolution=0.3) {
  pca_results.jaccard = snap_obj@smat@dmat %*% diag(snap_obj@smat@sdev)
  colnames(pca_results.jaccard) = paste0('PC_', 1:ncol(pca_results.jaccard))
  rownames(pca_results.jaccard) = snap_obj@barcode

  seurat_obj.jaccard = run_dim_reduction(t(snap_obj@bmat), pca_results.jaccard, dims, reduction=reduction)
  seurat_obj.jaccard = seurat_obj.jaccard %>%
    Seurat::FindNeighbors(nn.eps=0.25, dims=dims, reduction=reduction) %>%
    Seurat::FindClusters(n.start=20, resolution=resolution, dims=dims, reduction=reduction)
}

# Helper function to plot embeddings corresponding to same set of cells with one another's cluster assignments for comparison.
# Args:
#   seurat_obj1 (snap object): snap object with runDimReduct already run
#   seurat_obj2 (vector of int): vector of dims to use from cell_embeddings in downstream analysis
#   reduction (string): reduction to use for plot (can be tsne or umap)
#   description1 (string): title for seurat_obj1 (used in plot)
#   description2 (string): title for seurat_obj1 (used in plot)
#   cluster_column1 (string): column from metadata of seurat_obj1 to use for coloring plot
#   cluster_column2 (string): column from metadata of seurat_obj2 to use for coloring plot
# Returns:
#   ggolot object: ggplot object where each embedding is plotted colored by its own clusters and then again with the opposite object's clusters assigned for comparison. Four total panels shown.
plot_clustering_comparison = function(seurat_obj1, seurat_obj2, reduction, description1='', description2 = '', cluster_column1='RNA_snn_res.0.3', cluster_column2='RNA_snn_res.0.3') {
  # Clusters as called on each dataset
  seurat_obj1 = SetIdent(seurat_obj1, value=cluster_column1)
  seurat_obj2 = SetIdent(seurat_obj2, value=cluster_column2)
  
  p1 = DimPlot(seurat_obj1, reduction = 'tsne', pt.size=0.25) +
            ggtitle(description1)
  
  p2 = DimPlot(seurat_obj2, reduction = 'tsne', pt.size=0.25) +
            ggtitle(description2)
  
  # Now swap the labels to verify they are finding the same groups
  seurat_obj1@meta.data$cluster_seurat_obj2 = seurat_obj2@meta.data[, cluster_column2]
  seurat_obj2@meta.data$cluster_seurat_obj1 = seurat_obj1@meta.data[, cluster_column1]
  
  seurat_obj1 = SetIdent(seurat_obj1, value='cluster_seurat_obj2')
  seurat_obj2 = SetIdent(seurat_obj2, value='cluster_seurat_obj1')
  
  p3 = DimPlot(seurat_obj1, reduction = reduction, pt.size=0.25) +
            ggtitle(paste0(description1, ' (', description2, ' clusters)'))
  
  p4 = DimPlot(seurat_obj2, reduction = reduction, pt.size=0.25) +
            ggtitle(paste0(description2, ' (', description1, ' clusters)'))
  
  (p1 + p3) / (p2 + p4)
}

Data

We’ll need some data files from our study Cusanovich and Hill, et al. [Cell 2018], a public dataset from 10X, and some other auxillary files. I have also processed some of these datasets with SnapTools and provide .snap files that allow for comparison of the use of peaks and windows as features.

dir.create('data_downloads', showWarnings = FALSE)

# Cusanovich and Hill et al.
download.file("http://krishna.gs.washington.edu/content/members/ajh24/mouse_atlas_data_release/matrices/atac_matrix.binary.qc_filtered.rds", "data_downloads/atac_matrix.binary.qc_filtered.rds")
download.file("http://krishna.gs.washington.edu/content/members/ajh24/mouse_atlas_data_release/metadata/cell_metadata.tissue_freq_filtered.txt", "data_downloads/cell_metadata.tissue_freq_filtered.txt")

# 10X Adult Mouse Brain
download.file("http://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_v1_adult_brain_fresh_5k/atac_v1_adult_brain_fresh_5k_filtered_peak_bc_matrix.tar.gz", "data_downloads/atac_v1_adult_brain_fresh_5k_filtered_peak_bc_matrix.tar.gz")
untar("data_downloads/atac_v1_adult_brain_fresh_5k_filtered_peak_bc_matrix.tar.gz", exdir = 'data_downloads/')

# CisTopic outputs for both datasets
download.file('https://krishna.gs.washington.edu/content/members/ajh24/2019_04_19_atac_dim_reduction_blogpost/data_release/WholeBrain_PFC.cistopic.rds', 'data_downloads/WholeBrain_PFC.cistopic.rds')
download.file('http://krishna.gs.washington.edu/content/members/ajh24/2019_04_19_atac_dim_reduction_blogpost/data_release/atac_v1_adult_brain_fresh_5k.cistopic.rds', destfile = 'data_downloads/atac_v1_adult_brain_fresh_5k.cistopic.rds')

# *.snap files produced by SnapTools for use with SnapATAC that I've made available
download.file('https://krishna.gs.washington.edu/content/members/ajh24/2019_04_19_atac_dim_reduction_blogpost/data_release/atac_v1_adult_brain_fresh_5k.snap', 'data_downloads/atac_v1_adult_brain_fresh_5k.snap')
download.file('https://krishna.gs.washington.edu/content/members/ajh24/2019_04_19_atac_dim_reduction_blogpost/data_release/WholeBrain_PFC.snap', 'data_downloads/WholeBrain_PFC.snap')

# Auxillary datafiles
## Promoter definitions
download.file('http://renlab.sdsc.edu/r3fang/share/Fang_2019/MOs_snATAC/genes/promoter.bed', 'data_downloads/promoters.mm10.bed')
download.file('https://krishna.gs.washington.edu/content/members/ajh24/2019_04_19_atac_dim_reduction_blogpost/data_release/promoters.mm9.bed', 'data_downloads/promoters.mm9.bed')
download.file('http://renlab.sdsc.edu/r3fang/share/Fang_2019/published_scATAC/atac_v1_pbmc_10k_fastqs/promoter.bed', 'data_downloads/promoters.hg19.bed')

## ENCODE blacklist regions
download.file('http://mitra.stanford.edu/kundaje/akundaje/release/blacklists/mm10-mouse/mm10.blacklist.bed.gz', 'data_downloads/encode_blacklist.mm10.bed.gz')
download.file('http://mitra.stanford.edu/kundaje/akundaje/release/blacklists/mm9-mouse/mm9-blacklist.bed.gz', 'data_downloads/encode_blacklist.mm9.bed.gz')
download.file('http://mitra.stanford.edu/kundaje/akundaje/release/blacklists/hg19-human/wgEncodeHg19ConsensusSignalArtifactRegions.bed.gz', 'data_downloads/encode_blacklist.hg19.bed.gz')

Published flavor of LSI vs. using log-scaled tf matrix

As a first example, I’ll data from Whole Brain and prefrontal cortex published as part of Cusanovich and Hill, et al. [Cell 2018]. I’m choosing this subset of the data because there is a recent preprint that uses a kmer-based method on this same subset (useful for comparison purposes). It’s worth noting that we only used a fairly carefully selected subset of all peaks for dimensionality reduction in the actual paper, which helped us quite a bit in that case. Here I am using the much simpler method of simply filtering out rare sites to show that the version of LSI without log scaling the TF matrix is fairly sensitive to the sites you use as input compared to LSI logTF.

mouse.bmat = readRDS("data_downloads/atac_matrix.binary.qc_filtered.rds")
mouse.metadata = read.delim("data_downloads/cell_metadata.tissue_freq_filtered.txt", sep='\t')
rownames(mouse.metadata) = mouse.metadata$cell

selected_tissues = c("WholeBrain", "PreFrontalCortex") #, "Cerebellum")
selected_cells = subset(mouse.metadata, tissue %in% selected_tissues) %>% rownames()

mouse.metadata = mouse.metadata[selected_cells, ]
mouse.bmat = mouse.bmat[, selected_cells]

# Take top 100000 peaks by frequency.
# Some filtering is important here because the peaks in this study were a merged set of peaks so many would not ever be called 
# or perhaps even observed in more than a few cells within this particular subset (in contrast to datasets composed of individual samples).
# The particular number doesn't impact LSI results much, but smaller makes cisTopic a bit faster.
top_sites = rank(-Matrix::rowSums(mouse.bmat)) <= 100000
mouse.bmat = mouse.bmat[top_sites,]

Show the distribition of TF matrix entries when dividing by total counts. Note that this is not relevant for the 10x method, which just uses the matrix counts rather than dividing by total counts.

tf = t(t(mouse.bmat) / Matrix::colSums(mouse.bmat))
comparison_df = rbind(data.frame(value=tf@x, method='TF'), data.frame(value=log1p(tf@x * 100000), method='log TF'))

ggplot(comparison_df, aes(value)) +
  geom_histogram(bins = 70, aes(fill=method), color='black') +
  theme_classic() +
  scale_fill_manual(values=c("TF"="red", "log TF"="#d3d3d3")) +
  xlab('value in matrix') +
  facet_wrap(~method, scales='free')

Next, run LSI with and without log scaling the TF matrix in TF-IDF and compare, all with the same peaks/cells as input.

# lsi_workflow and most other similar workflow functions return Seurat objects
mouse.seurat.lsi = lsi_workflow(mouse.bmat, dims=2:50, metadata=mouse.metadata, log_scale_tf=FALSE, reduction='pca')
mouse.seurat.lsi_log = lsi_workflow(mouse.bmat, dims=2:50, metadata=mouse.metadata, log_scale_tf=TRUE, reduction='pca.l2')

pca.l2 here refers to an L2 normalized PCA space, which we find can sometimes (but not always) improve clustering results relative to using the raw PCA space. This is used throughout the post but doesn’t seem to matter much for these datasets.

Note that by specifying dims=2:50 we are not using the first PC in downstreame analysis. Here I confirm that the first PC is highly correlated with depth (which is why we drop, although you could also use something like limma to regress out log10(total counts)). I won’t show this in most cases moving forward. As others have noted, we find that dropping the first PC (so long as it is strongly correlated with read depth) oftene improves results somewhat, particularly for very sparse datasets.

p1 = plot_pc_correlation(mouse.seurat.lsi, reduction = 'pca') +
      ggtitle('LSI')

p2 = plot_pc_correlation(mouse.seurat.lsi_log, reduction = 'pca') +
      ggtitle('LSI With Log Scaled TF')

p1 + p2

Compare tSNE plots with original cluster annotations from each method and then applying cluster annotations derived from the method being compared to to evaluate concordance. Note that the primary “metrics” used to assess “better” performance or “equivalent” performance throughout this post are: 1) Clustering results at identical resolution with exactly the same sites and peaks used as input. 2) Concordance of dimensionality reduction/clustering visually between two or more methods.

While these are not perfect metrics, the differences that I care about here are large enough that I think they are sufficient for the claims I am trying to make. I am fairly confident in the claims given that very similar dimensionality reduction and clustering results are achieved across multiple different methods in many cases. Subtle differences would require more extensive comparison.

plot_clustering_comparison(mouse.seurat.lsi,
                           mouse.seurat.lsi_log,
                           reduction='tsne',
                           description1='LSI',
                           description2='LSI logTF',
                           cluster_column1='RNA_snn_res.0.3',
                           cluster_column2='RNA_snn_res.0.3')