Setup

library(ithi.utils)
load_base_libs()

library(methods)
library(factoextra)
library(NbClust)
library(pvclust)
library(clValid)

library(ithi.meta)
library(ithi.figures)
library(ithi.utils)
ihc_table_path <- snakemake@input$ihc_table

tils_for_cluster <- snakemake@params$tils_for_cluster
db_path <- snakemake@params$db
annotation_colours <- ithi.figures::get_annotation_colours()
ihc_table <- fread(ihc_table_path)

Summary

Here we’ll assess the robustness of our TIL clustering.

In terms of understanding robustness, there are 3 major aspects:

  • Internal validity
  • External validity
  • Indirect validity

Since we don’t have a gold standard to compare to, external validity is out of the question.

Internal validity refers to measures that measure intra-cluster connectedness/tightness, between-cluster distance, and stability. This includes indices like the Dunn index (Dunn 1974). Stability refers to measures that determine the consistency of clustering based on subsets of data or features. pvclust can be used to assess the stability of clustering by resampling features.

Indirect validity refers to the (e.g. biological) relevance of the resulting clusters. Based on the results of ITH comparison, we have evidence to suggest that the S-TIL/ES-TIL split is biologically relevant.

Analysis

ihc_table_subset <- subset(ihc_table, select = c("condensed_id", tils_for_cluster))
ihc_table_subset_numeric <- ihc_table_subset %>% as.data.frame %>% column_to_rownames(var = "condensed_id") %>% 
    scale

na_rows <- apply(ihc_table_subset_numeric, 1, function(x) any(is.na(x)))
ihc_table_subset_numeric <- ihc_table_subset_numeric[!na_rows, ]

dat_clipped <- ithi.utils::clip_values(ihc_table_subset_numeric, hi = 2, lo = -2)
cluster_res <- factoextra::fviz_nbclust(dat_clipped, FUNcluster = hcut, method = c("wss"), 
    diss = NULL, k.max = 10)

nbclust_res <- NbClust(dat_clipped, distance = "euclidean", min.nc = 2, max.nc = 7, 
    method = "ward.D2", alphaBeale = 0, index = "all")

*** : The Hubert index is a graphical method of determining the number of clusters.
                In the plot of Hubert index, we seek a significant knee that corresponds to a 
                significant increase of the value of the measure i.e the significant peak in Hubert
                index second differences plot. 
 

*** : The D index is a graphical method of determining the number of clusters. 
                In the plot of D index, we seek a significant knee (the significant peak in Dindex
                second differences plot) that corresponds to a significant increase of the value of
                the measure. 
 
******************************************************************* 
* Among all indices:                                                
* 8 proposed 2 as the best number of clusters 
* 8 proposed 3 as the best number of clusters 
* 1 proposed 4 as the best number of clusters 
* 1 proposed 6 as the best number of clusters 
* 5 proposed 7 as the best number of clusters 

                   ***** Conclusion *****                            
 
* According to the majority rule, the best number of clusters is  2 
 
 
******************************************************************* 
fviz_nbclust(nbclust_res)
Among all indices: 
===================
* 2 proposed  0 as the best number of clusters
* 1 proposed  1 as the best number of clusters
* 8 proposed  2 as the best number of clusters
* 8 proposed  3 as the best number of clusters
* 1 proposed  4 as the best number of clusters
* 1 proposed  6 as the best number of clusters
* 5 proposed  7 as the best number of clusters

Conclusion
=========================
* According to the majority rule, the best number of clusters is  2 .

2 and 3 clusters are both equally well supported by this barrage of methods.

The Dunn index, which is very well cited, reports 3 as the optimal number of clusters.

clValid

We’ll look at the Dunn index again with this other package.

clValid2 <- function(obj, nClust, clMethods = "hierarchical", validation = "stability", 
    maxitems = 600, metric = "euclidean", method = "average", neighbSize = 10, 
    annotation = NULL, GOcategory = "all", goTermFreq = 0.05, dropEvidence = NULL, 
    verbose = FALSE, ...) {
    clMethods <- tolower(clMethods)
    clMethods <- match.arg(clMethods, c("hierarchical", "kmeans", "diana", "fanny", 
        "som", "model", "sota", "pam", "clara", "agnes"), several.ok = TRUE)
    if ("som" %in% clMethods) {
        if (!require(kohonen)) {
            stop("package 'kohonen' required for clustering using SOM")
        }
    }
    if ("model" %in% clMethods) {
        if (!require(mclust)) {
            stop("package 'mclust' required for model-based clustering")
        }
    }
    validation <- match.arg(validation, c("stability", "internal", "biological"), 
        several.ok = TRUE)
    metric <- match.arg(metric, c("euclidean", "correlation", "manhattan"))
    method <- match.arg(method, c("ward", "ward.D2", "single", "complete", "average"))
    GOcategory <- match.arg(GOcategory, c("all", "BP", "CC", "MF"))
    switch(class(obj), matrix = mat <- obj, ExpressionSet = mat <- Biobase::exprs(obj), 
        data.frame = {
            if (any(!sapply(obj, class) %in% c("numeric", "integer"))) stop("data frame 'obj' contains non-numeric data")
            mat <- as.matrix(obj)
        }, stop("argument 'obj' must be a matrix, data.frame, or ExpressionSet object"))
    if (nrow(mat) > maxitems) {
        if (interactive()) {
            cat("\nThe number of items to be clustered is larger than 'maxitems'\n")
            cat("The memory and time required may be excessive, do you wish to continue?\n")
            cat("(y to continue, any other character to exit) ")
            ans <- tolower(substr(readLines(n = 1), 1, 1))
            if (ans != "y") {
                stop("Exiting clValid, number of items exceeds 'maxitems'")
            }
        } else {
            stop("The number of items to be clustered is larger than 'maxitems'\n  Either decrease the number of rows (items) or increase 'maxitems'\n")
        }
    }
    if ("clara" %in% clMethods & metric == "correlation") 
        warning("'clara' currently only works with 'euclidean' or 'manhattan' metrics - metric will be changed to 'euclidean'  ")
    if (is.character(annotation) & length(grep(".db", annotation)) == 0) {
        annotation <- paste(annotation, ".db", sep = "")
    }
    if (is.list(annotation)) {
        if (is.null(rownames(mat))) {
            stop("rownames of data must be present to specify biological annotation from file")
        }
        annotation <- annotationListToMatrix(annotation, genenames = rownames(mat))
    }
    if ("biological" %in% validation & is.null(annotation)) {
        stop("annotation must be specified in order to use biological validation")
    }
    if ("biological" %in% validation & is.character(annotation)) {
        if (!require(Biobase) | !require(GO.db) | !require(annotate)) {
            stop("packages 'Biobase', 'GO.db', and 'annotate' required for 2nd type of biological validation \n\nthese can be downloaded from Bioconductor (www.bioconductor.org)")
        }
    }
    if ("biological" %in% validation & is.null(rownames(mat))) {
        stop("rownames of data must be present to use biological validation")
    }
    nClust <- floor(nClust)
    if (any(nClust < 1)) 
        stop("argument 'nClust' must be a positive integer vector")
    if (metric == "correlation") 
        Dist <- as.dist(1 - cor(t(mat), use = "pairwise.complete.obs")) else Dist <- dist(mat, method = metric)
    clusterObjs <- vector("list", length(clMethods))
    names(clusterObjs) <- clMethods
    measures <- c(if ("stability" %in% validation) c("APN", "AD", "ADM", "FOM"), 
        if ("internal" %in% validation) c("Connectivity", "Dunn", "Silhouette"), 
        if ("biological" %in% validation) c("BHI", "BSI"))
    validMeasures <- array(dim = c(length(measures), length(nClust), length(clMethods)))
    dimnames(validMeasures) <- list(measures, nClust, clMethods)
    for (i in 1:length(clMethods)) {
        cvalid <- vClusters(mat, clMethods[i], nClust, validation = validation, 
            Dist = Dist, method = method, metric = metric, annotation = annotation, 
            GOcategory = GOcategory, goTermFreq = goTermFreq, neighbSize = neighbSize, 
            dropEvidence = dropEvidence, verbose = verbose, ...)
        clusterObjs[[i]] <- cvalid$clusterObj
        validMeasures[, , i] <- cvalid$measures
    }
    if (is.null(rownames(mat))) {
        rownames(mat) <- 1:nrow(mat)
        warning("rownames for data not specified, using 1:nrow(data)")
    }
    new("clValid", clusterObjs = clusterObjs, measures = validMeasures, measNames = measures, 
        clMethods = clMethods, labels = rownames(mat), nClust = nClust, validation = validation, 
        metric = metric, method = method, neighbSize = neighbSize, GOcategory = GOcategory, 
        goTermFreq = goTermFreq, annotation = annotation, call = match.call())
}

environment(clValid2) <- asNamespace("clValid")
intern <- clValid2(dat_clipped, nClust = 2:6, clMethods = "hierarchical", validation = "internal", 
    metric = "euclidean", method = "ward.D2")

as.data.frame(intern@measures)["Dunn", ]

3 is the minimum number of clusters with the maximal Dunn index.

stability <- clValid2(dat_clipped, nClust = 2:6, clMethods = "hierarchical", 
    validation = "stability", metric = "euclidean", method = "ward.D2")

pvclust

result <- pvclust(t(dat_clipped), method.dist = "euclidean", method.hclust = "ward.D2", 
    nboot = 100, r = seq(1, 5, 0.1))
Bootstrap (r = 1.0)... Done.
Bootstrap (r = 1.0)... Done.
Bootstrap (r = 1.12)... Done.
Bootstrap (r = 1.25)... Done.
Bootstrap (r = 1.38)... Done.
Bootstrap (r = 1.5)... Done.
Bootstrap (r = 1.5)... Done.
Bootstrap (r = 1.62)... Done.
Bootstrap (r = 1.75)... Done.
Bootstrap (r = 1.88)... Done.
Bootstrap (r = 2)... Done.
Bootstrap (r = 2)... Done.
Bootstrap (r = 2.12)... Done.
Bootstrap (r = 2.25)... Done.
Bootstrap (r = 2.38)... Done.
Bootstrap (r = 2.5)... Done.
Bootstrap (r = 2.5)... Done.
Bootstrap (r = 2.62)... Done.
Bootstrap (r = 2.75)... Done.
Bootstrap (r = 2.88)... Done.
Bootstrap (r = 3)... Done.
Bootstrap (r = 3)... Done.
Bootstrap (r = 3.12)... Done.
Bootstrap (r = 3.25)... Done.
Bootstrap (r = 3.38)... Done.
Bootstrap (r = 3.5)... Done.
Bootstrap (r = 3.5)... Done.
Bootstrap (r = 3.62)... Done.
Bootstrap (r = 3.75)... Done.
Bootstrap (r = 3.88)... Done.
Bootstrap (r = 4)... Done.
Bootstrap (r = 4)... Done.
Bootstrap (r = 4.12)... Done.
Bootstrap (r = 4.25)... Done.
Bootstrap (r = 4.38)... Done.
Bootstrap (r = 4.5)... Done.
Bootstrap (r = 4.5)... Done.
Bootstrap (r = 4.62)... Done.
Bootstrap (r = 4.75)... Done.
Bootstrap (r = 4.88)... Done.
Bootstrap (r = 5)... Done.
plot(result)
pvrect(result, alpha = 0.96)

pvclust seems to have a habit of doing this – it only calls branches that are very long/near the tips as being significant.

So pvclust seems to indicate that all of these clusters are stable.

Caveats:

  • Results change drastically when parameter r is changed, i.e. changing r to seq(0.5, 4.0, 0.1) instead of the default seq(0.5, 1.5, 0.1) gives you significant results for virtually every branching point.
  • The bootstrap method applied by pvclust does not make sense for this type of data (for values of r below 1). pvclust is meant to operate on data like phylogenetic data – where resampling the features of the dataset makes sense (i.e. how bootstrap confidence is defined for phylogenies). On this type of data, pvclust would be most adequately used to compare the significance of FEATURE clustering – where data can be subsampled – rather than OBSERVATION clustering. For example, in this clustergram epithelial CD8+ TIL density contributes greatly to the difference between clusters S-TIL and ES-TIL – excluding it in some resamples does not make sense and is likely the reason for the previous bullet point

The second point isn’t saying that this is invalid – but it isn’t the only way we should assess cluster tightness.

---
title: "TIL cluster robustness"
---
                        ```{r, echo=FALSE, message=FALSE, warning=FALSE}

######## Snakemake header ########
library(methods)
Snakemake <- setClass(
    "Snakemake",
    slots = c(
        input = "list",
        output = "list",
        params = "list",
        wildcards = "list",
        threads = "numeric",
        log = "list",
        resources = "list",
        config = "list",
        rule = "character"
    )
)
snakemake <- Snakemake(
    input = list('notebooks/tilcluster_robustness.Rmd', '/shahlab/alzhang/projects/ITH_Immune/paper/results/tables/run2/ihc_table.tsv', "notebook" = 'notebooks/tilcluster_robustness.Rmd', "ihc_table" = '/shahlab/alzhang/projects/ITH_Immune/paper/results/tables/run2/ihc_table.tsv'),
    output = list('/shahlab/alzhang/projects/ITH_Immune/paper/results/review/notebooks/run2/tilcluster_robustness.nb.html'),
    params = list('tilcluster_robustness_analysis', '/shahlab/alzhang/projects/ITH_Immune/metadata/db/immune_project.sqlite3', c('E_CD8_density', 'E_CD4_density', 'E_CD20_density', 'E_Plasma_density', 'S_CD8_density', 'S_CD4_density', 'S_CD20_density', 'S_Plasma_density'), "name" = 'tilcluster_robustness_analysis', "db" = '/shahlab/alzhang/projects/ITH_Immune/metadata/db/immune_project.sqlite3', "tils_for_cluster" = c('E_CD8_density', 'E_CD4_density', 'E_CD20_density', 'E_Plasma_density', 'S_CD8_density', 'S_CD4_density', 'S_CD20_density', 'S_Plasma_density')),
    wildcards = list(),
    threads = 1,
    log = list('/shahlab/alzhang/clusttmp/paperreview2/notebooks/tilcluster_robustness_analysis.log'),
    resources = list(),
    config = list("bcr_diversity" = '/shahlab/alzhang/pipeline_outputs/ith_immune/mixcr/mixcr_runs/ith_1_2_3/mixcr5/postprocess/IGH/postfilter_diversity_stats/diversity.strict.resampled.txt', "breakpoint_table" = '/shahlab/amcpherson/projects/ith3/ith3/notebooks/bespoke/ith_breakpoints.tsv', "clone_trees" = '/shahlab/alzhang/projects/ITH_Immune/paper/results/tables/run2/clones/tree_data.tsv', "tcr_diversity" = '/shahlab/alzhang/pipeline_outputs/ith_immune/mixcr/mixcr_runs/ith_1_2_3/mixcr5/postprocess/TRB/postfilter_diversity_stats/diversity.strict.resampled.txt', "he_results_dir" = '/shahlab/alzhang/data/ithi/finn_results/he_output_Nov29', "somatic_coding_result_dir" = '/shahlab/alzhang/projects/ITH_Immune/paper/results/tables/run2/somatic_coding_variants', "distance_method" = 'horn', "copynumber_table" = '/shahlab/alzhang/data/ithi/master_copynumber_file.tsv', "ith_stats" = '/shahlab/alzhang/projects/ITH_Immune/paper/results/tables/run2/ith_statistics.tsv', "tilcluster_supervised_ipynb" = '/shahlab/alzhang/projects/ITH_Immune/paper/review/ipy/tilcluster_supervisedmulticlass.ipynb', "epitopes_unique_filtered" = '/shahlab/alzhang/projects/ITH_Immune/paper/results/tables/run2/epitopes_unique_filtered.tsv', "til_clusters_output" = '/shahlab/alzhang/projects/ITH_Immune/paper/results/intermediates/run2/til_clusters_output.txt', "remixt_cellularity_ploidy" = '/shahlab/alzhang/projects/ITH_Immune/paper/results/tables/run2/remixt_cellularity_ploidy.tsv', "table_dir" = '/shahlab/alzhang/projects/ITH_Immune/paper/results/review/tables/run2', "variability_type" = 'stabilize', "ith_icgc_bc" = '/shahlab/alzhang/projects/ITH_Immune/paper/results/tables/run2/ith_icgc_merged_bc.tsv', "array_expression_file" = '/shahlab/alzhang/projects/ITH_Immune/data/expression/array/gene_exprs_rma_batch_corrected.txt', "nanostring_data" = '/shahlab/alzhang/projects/ITH_Immune/results/nanostring_results/ith_full/qc/limma_quantile/normalized_expression_voa_labels_filtered.tsv', "molsubtypes" = '/shahlab/alzhang/projects/ITH_Immune/paper/results/tables/run2/molsubtypes.tsv', "logdir" = '/shahlab/alzhang/clusttmp/paperreview2', "icgc_subtypes" = '/shahlab/alzhang/data/ICGC/icgc_primary_tumour_subtypes.tsv', "total_tiltypes" = c('T_CD8_density', 'T_CD4_density', 'T_CD20_density', 'T_Plasma_density'), "tils_for_variability" = c('T_CD8_density', 'T_CD4_density', 'T_CD20_density', 'T_Plasma_density'), "tumour_purity" = '/shahlab/alzhang/projects/ITH_Immune/paper/results/tables/run2/tumour_purity.tsv', "clola_result_file" = '/shahlab/alzhang/pipeline_outputs/ith_immune/clola/run4/clola_condensed_results/beta/clola_results.tsv', "tils_for_cluster" = c('E_CD8_density', 'E_CD4_density', 'E_CD20_density', 'E_Plasma_density', 'S_CD8_density', 'S_CD4_density', 'S_CD20_density', 'S_Plasma_density'), "clone_branch_lengths" = '/shahlab/alzhang/projects/ITH_Immune/paper/results/tables/run2/clones/branch_data.tsv', "patients_for_clonal" = c(1, 2, 3, 4, 7, 9, 10, 11, 12, 13, 14, 15, 16, 17), "notebook_dir" = '/shahlab/alzhang/projects/ITH_Immune/paper/results/review/notebooks/run2', "ihc_features_output" = '/shahlab/alzhang/projects/ITH_Immune/paper/results/intermediates/run2/ihc_features_output.txt', "ihc_table" = '/shahlab/alzhang/projects/ITH_Immune/paper/results/tables/run2/ihc_table.tsv', "finnhe_pipeline_results_dir" = '/shahlab/alzhang/pipeline_outputs/ith_immune/finnhe/run1', "refseq_gene_file" = '/shahlab/alzhang/data/genome/hg19/refseq_genes.bed', "clone_prevalences" = '/shahlab/alzhang/projects/ITH_Immune/paper/results/tables/run2/clones/clone_data.tsv', "mmctm_final_patient_dir" = '/shahlab/alzhang/projects/ITH_Immune/results/mmctm_results/ith_by-patient_with-ov', "benchmarkdir" = '/shahlab/alzhang/benchmarks/paperreview2', "xcr_table" = '/shahlab/alzhang/projects/ITH_Immune/paper/results/tables/run2/xcr_table.tsv', "ith_stat_types" = c('entropy', 'postprocessed_divergence', 'combined_ith_normalized', 'proportion_subclonal'), "neoediting_outdir" = '/shahlab/alzhang/pipeline_outputs/ith_immune/neoediting/run6', "snv_table" = '/shahlab/amcpherson/projects/ith3/ith3/notebooks/bespoke/ith_snvs.tsv', "rooney_mutsigcv_file" = '/shahlab/alzhang/projects/ITH_Immune/external/other_papers/mmc6.xlsx', "snv_cluster_dir" = '/shahlab/alzhang/projects/ITH_Immune/paper/results/tables/run2/clones/snv_cluster', "image_summary2" = '/shahlab/alzhang/data/ithi/yuan_hecr_image_results_2.csv', "image_summary" = '/shahlab/alzhang/data/ithi/yuan_hecr_image_results.csv', "nanostring_annotations" = '/shahlab/alzhang/projects/ITH_Immune/data/expression/nanostring/pancancer_annotations.tsv', "db" = '/shahlab/alzhang/projects/ITH_Immune/metadata/db/immune_project.sqlite3', "prevalence_threshold" = 0.01, "icgc_specimen" = '/shahlab/alzhang/data/ICGC/specimen.tsv', "known_subtypes_array" = '/shahlab/alzhang/projects/ITH_Immune/data/expression/array/subtypes/known_subtypes.tsv', "all_tiltypes" = c('T_CD8_density', 'T_CD4_density', 'T_CD20_density', 'T_Plasma_density', 'E_CD8_density', 'E_CD4_density', 'E_CD20_density', 'E_Plasma_density', 'S_CD8_density', 'S_CD4_density', 'S_CD20_density', 'S_Plasma_density'), "igpartition_outdir" = '/shahlab/alzhang/pipeline_outputs/ith_immune/igpartition/run22'),
    rule = 'tilcluster_robustness_analysis'
)
######## Original script #########

                        ```


## Setup

```{r global_chunk_options, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, tidy=TRUE, warning=FALSE, message=FALSE, cache=TRUE) #cache=TRUE
```

```{r}
library(ithi.utils)
load_base_libs()

library(methods)
library(factoextra)
library(NbClust)
library(pvclust)
library(clValid)

library(ithi.meta)
library(ithi.figures)
library(ithi.utils)
```

```{r}
ihc_table_path <- snakemake@input$ihc_table

tils_for_cluster <- snakemake@params$tils_for_cluster
db_path <- snakemake@params$db
```

```{r}
annotation_colours <- ithi.figures::get_annotation_colours()
ihc_table <- fread(ihc_table_path)
```


## Summary

Here we'll assess the robustness of our TIL clustering. 

In terms of understanding robustness, there are 3 major aspects:

* Internal validity
* External validity
* Indirect validity

Since we don't have a gold standard to compare to, external validity is out of the question. 

Internal validity refers to measures that measure intra-cluster connectedness/tightness, between-cluster distance, and stability. This includes indices like the Dunn index (Dunn 1974). Stability refers to measures that determine the consistency of clustering based on subsets of data or features. pvclust can be used to assess the stability of clustering by resampling features. 

Indirect validity refers to the (e.g. biological) relevance of the resulting clusters. Based on the results of ITH comparison, we have evidence to suggest that the S-TIL/ES-TIL split is biologically relevant. 

## Analysis

```{r}
ihc_table_subset <- subset(ihc_table, select = c("condensed_id", tils_for_cluster))
ihc_table_subset_numeric <- ihc_table_subset %>% as.data.frame %>% column_to_rownames(var = "condensed_id") %>% scale

na_rows <- apply(ihc_table_subset_numeric, 1, function(x) any(is.na(x)))
ihc_table_subset_numeric <- ihc_table_subset_numeric[!na_rows,]

dat_clipped <- ithi.utils::clip_values(ihc_table_subset_numeric, hi = 2, lo = -2)
```

```{r}
cluster_res <- factoextra::fviz_nbclust(dat_clipped, FUNcluster = hcut, method = c("wss"), diss = NULL, k.max = 10)

nbclust_res <- NbClust(dat_clipped, distance = "euclidean", min.nc = 2, max.nc = 7, method = "ward.D2", alphaBeale = 0, index = "all")

fviz_nbclust(nbclust_res)
```


2 and 3 clusters are both equally well supported by this barrage of methods. 

The Dunn index, which is very well cited, reports 3 as the optimal number of clusters. 

## clValid

We'll look at the Dunn index again with this other package. 


```{r}
clValid2 <- function (obj, nClust, clMethods = "hierarchical", validation = "stability", 
    maxitems = 600, metric = "euclidean", method = "average", 
    neighbSize = 10, annotation = NULL, GOcategory = "all", goTermFreq = 0.05, 
    dropEvidence = NULL, verbose = FALSE, ...) 
{
    clMethods <- tolower(clMethods)
    clMethods <- match.arg(clMethods, c("hierarchical", "kmeans", 
        "diana", "fanny", "som", "model", "sota", "pam", "clara", 
        "agnes"), several.ok = TRUE)
    if ("som" %in% clMethods) {
        if (!require(kohonen)) {
            stop("package 'kohonen' required for clustering using SOM")
        }
    }
    if ("model" %in% clMethods) {
        if (!require(mclust)) {
            stop("package 'mclust' required for model-based clustering")
        }
    }
    validation <- match.arg(validation, c("stability", "internal", 
        "biological"), several.ok = TRUE)
    metric <- match.arg(metric, c("euclidean", "correlation", 
        "manhattan"))
    method <- match.arg(method, c("ward", "ward.D2", "single", "complete", 
        "average"))
    GOcategory <- match.arg(GOcategory, c("all", "BP", "CC", 
        "MF"))
    switch(class(obj), matrix = mat <- obj, ExpressionSet = mat <- Biobase::exprs(obj), 
        data.frame = {
            if (any(!sapply(obj, class) %in% c("numeric", "integer"))) stop("data frame 'obj' contains non-numeric data")
            mat <- as.matrix(obj)
        }, stop("argument 'obj' must be a matrix, data.frame, or ExpressionSet object"))
    if (nrow(mat) > maxitems) {
        if (interactive()) {
            cat("\nThe number of items to be clustered is larger than 'maxitems'\n")
            cat("The memory and time required may be excessive, do you wish to continue?\n")
            cat("(y to continue, any other character to exit) ")
            ans <- tolower(substr(readLines(n = 1), 1, 1))
            if (ans != "y") {
                stop("Exiting clValid, number of items exceeds 'maxitems'")
            }
        }
        else {
            stop("The number of items to be clustered is larger than 'maxitems'\n  Either decrease the number of rows (items) or increase 'maxitems'\n")
        }
    }
    if ("clara" %in% clMethods & metric == "correlation") 
        warning("'clara' currently only works with 'euclidean' or 'manhattan' metrics - metric will be changed to 'euclidean'  ")
    if (is.character(annotation) & length(grep(".db", annotation)) == 
        0) {
        annotation <- paste(annotation, ".db", sep = "")
    }
    if (is.list(annotation)) {
        if (is.null(rownames(mat))) {
            stop("rownames of data must be present to specify biological annotation from file")
        }
        annotation <- annotationListToMatrix(annotation, genenames = rownames(mat))
    }
    if ("biological" %in% validation & is.null(annotation)) {
        stop("annotation must be specified in order to use biological validation")
    }
    if ("biological" %in% validation & is.character(annotation)) {
        if (!require(Biobase) | !require(GO.db) | !require(annotate)) {
            stop("packages 'Biobase', 'GO.db', and 'annotate' required for 2nd type of biological validation \n\nthese can be downloaded from Bioconductor (www.bioconductor.org)")
        }
    }
    if ("biological" %in% validation & is.null(rownames(mat))) {
        stop("rownames of data must be present to use biological validation")
    }
    nClust <- floor(nClust)
    if (any(nClust < 1)) 
        stop("argument 'nClust' must be a positive integer vector")
    if (metric == "correlation") 
        Dist <- as.dist(1 - cor(t(mat), use = "pairwise.complete.obs"))
    else Dist <- dist(mat, method = metric)
    clusterObjs <- vector("list", length(clMethods))
    names(clusterObjs) <- clMethods
    measures <- c(if ("stability" %in% validation) c("APN", "AD", 
        "ADM", "FOM"), if ("internal" %in% validation) c("Connectivity", 
        "Dunn", "Silhouette"), if ("biological" %in% validation) c("BHI", 
        "BSI"))
    validMeasures <- array(dim = c(length(measures), length(nClust), 
        length(clMethods)))
    dimnames(validMeasures) <- list(measures, nClust, clMethods)
    for (i in 1:length(clMethods)) {
        cvalid <- vClusters(mat, clMethods[i], nClust, validation = validation, 
            Dist = Dist, method = method, metric = metric, annotation = annotation, 
            GOcategory = GOcategory, goTermFreq = goTermFreq, 
            neighbSize = neighbSize, dropEvidence = dropEvidence, 
            verbose = verbose, ...)
        clusterObjs[[i]] <- cvalid$clusterObj
        validMeasures[, , i] <- cvalid$measures
    }
    if (is.null(rownames(mat))) {
        rownames(mat) <- 1:nrow(mat)
        warning("rownames for data not specified, using 1:nrow(data)")
    }
    new("clValid", clusterObjs = clusterObjs, measures = validMeasures, 
        measNames = measures, clMethods = clMethods, labels = rownames(mat), 
        nClust = nClust, validation = validation, metric = metric, 
        method = method, neighbSize = neighbSize, GOcategory = GOcategory, 
        goTermFreq = goTermFreq, annotation = annotation, call = match.call())
}

environment(clValid2) <- asNamespace('clValid')
```

```{r}
intern <- clValid2(dat_clipped, nClust = 2:6, clMethods = "hierarchical", validation = "internal", metric = "euclidean", method = "ward.D2")

as.data.frame(intern@measures)['Dunn',]
```

3 is the minimum number of clusters with the maximal Dunn index. 

```{r}
stability <- clValid2(dat_clipped, nClust = 2:6, clMethods = "hierarchical", validation = "stability", metric = "euclidean", method = "ward.D2")
```

## pvclust

```{r}
result <- pvclust(t(dat_clipped), method.dist = "euclidean", method.hclust = "ward.D2", nboot = 100, r = seq(1.0, 5.0, 0.1))
```


```{r}
plot(result)
pvrect(result, alpha = 0.96)
```

pvclust seems to have a habit of doing this -- it only calls branches that are very long/near the tips as being significant. 

So pvclust seems to indicate that all of these clusters are stable. 


Caveats:

* Results change drastically when parameter r is changed, i.e. changing r to seq(0.5, 4.0, 0.1) instead of the default seq(0.5, 1.5, 0.1) gives you significant results for virtually every branching point. 
* The bootstrap method applied by pvclust does not make sense for this type of data (for values of r below 1). pvclust is meant to operate on data like phylogenetic data -- where resampling the features of the dataset makes sense (i.e. how bootstrap confidence is defined for phylogenies). On this type of data, pvclust would be most adequately used to compare the significance of FEATURE clustering -- where data can be subsampled -- rather than OBSERVATION clustering. For example, in this clustergram epithelial CD8+ TIL density contributes greatly to the difference between clusters S-TIL and ES-TIL -- excluding it in some resamples does not make sense and is likely the reason for the previous bullet point

The second point isn't saying that this is invalid -- but it isn't the only way we should assess cluster tightness. 


