---
title: "Using 'DiffXCoExpMat' to create second the order re-wired network from a genomic dataset"
author: "Ruby Sharma, Sajal Kumar and Joe Song"
date: "Created November 9th 2020"

vignette: >
  %\VignetteIndexEntry{Using the 'DiffXCoExpMat' R package to detect second-order differential patterns}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

### Overview
Molecular network re-wiring in cells can activate pluripotency and differentiation. These re-wiring can be a result of genetic or epigentic variations. Due to its crucial role in many biological system fate, we are introducing a systemic pipeline to utilize the 'DiffXCoExpMat' R package which uses a [sharma.song.test()](https://cran.r-project.org/web/packages/DiffXTables/index.html) and create a second-order network-rewired from any genomic dataset.

In this tutorial, we examine a uni-source FANTOM5 cage data of mouse and compare two developing tissues, cerebellum and liver to build a second-order re-wired network. We extracted 132 genes belonging to spliceosome(mmu03040) pathways using the R package ‘KEGGREST’. Out of 132 genes, a total of 112 TSSs had p1 proﬁles available in mouse FANTOM5. We start with preprocessing the data in which we filter out unchanged TSS using MAD (median absolute deviation). We then build a co-expression network using Pearson’s chi-squared test for both cerebellum and liver tissues. 
To capture strong dynamics in at least one condition, we consider co-expression patterns with a Benjamini-Hochberg adjusted P<0.1 and Cramér’s V effect size > 0.8. The unique co-expressed genes pairs will be utilized to create a second-order differential network using Sharma-Song test. The differential co-expression network holds the Benjamini-Hochberg adjusted Sharma-Song P-value and Sharma-Song effect size for each gene pair, representing the degree of second order change observed for that gene pair across conditions.

```{r setup, message=F, warning=F, results = "hide"}
require(stringr)
require(DiffXCoExpNet)

```


### Load Fantom5 TSS dataset of cerebellum and liver
```{r, out.width='60%',fig.asp=0.6}
load(paste0("../Data/F5CbSplTss.RData"))
load(paste0("../Data/F5LivSplTss.RData"))

```

Both cerebellum `CbSplTss` and liver `LivSplTss` data contains 112 TSS as columns. Cerebellum has 36 and liver has 15 developing samples of mouse embryo and neonate as rows. Following is the cerebellum and liver data with 5 rows and 5 columns. Rows represents the samples of each TSS(columns).

***Subset of cerebellum expression profile***

```{r, out.width='60%',fig.asp=0.6}
# First five rows and columns of Cerebellum data 
CbSplTss[1:5,1:5]

```

***Subset of liver expression profile***

```{r, out.width='60%',fig.asp=0.6}
# First five rows and columns of Liver data 
LivSplTss[1:5,1:5]

```
**Collecting TSS names**
```{r, out.width='60%',fig.asp=0.6}
tss_names = colnames(CbSplTss)
```

### Step 1: Data Preprocessing

In order to capture high dynamics differential patterns, we remove TSS whose MAD (median absolute deviation) lie among the bottom 5\%. `FilterGenesByMAD` provides a boolean vector of the same size as the number of TSS in `LivSplTss` or `CbSplTss`, representing which genes should be removed. Here, we evaluate the overall MAD for each gene across both cererbellum and liver datasets, however, one can use `FilterGenesByMAD` function on single expression matrices as well.

```{r, out.width='60%',fig.asp=0.6}
# Collecting MAD(median absolute deviation) scores of each TSSs
rem_tss = FilterGenesByMAD(rbind(LivSplTss,CbSplTss), nthreads = 10)
```

We use `rem_tss` to remove low MAD Tss from both cerebellum and liver datasets and `tss_names`.
```{r, out.width='60%',fig.asp=0.6}
# Remove TSSs that have low MAD(median absolute deviation)
LivSplTss = LivSplTss[ , !rem_tss]
CbSplTss = CbSplTss[ , !rem_tss]
tss_names = tss_names[!rem_tss]
  
```

Even after removing low MAD Tss, it is still possible to encounter zero variance TSS in individual conditions, thus, we remove them as well.
```{r, out.width='60%',fig.asp=0.6}
# Finding variance of each TSS
LivSplTssvar = apply(LivSplTss, 2, var)
CbSplTssvar = apply(CbSplTss, 2, var)
  
# TSSs with zero variance
zero_var_tss = LivSplTssvar == 0 | CbSplTssvar == 0
  
# Removing TSSs having zero variance
LivSplTss = LivSplTss[ , !zero_var_tss]
CbSplTss = CbSplTss[ , !zero_var_tss]
tss_names = tss_names[!zero_var_tss]
  
```
We obtained 104 high dynamic genes after removing low MAD and low variance genes. The selected 104 genes will be used to create co-expression networks for both cerebellum and liver.

### Step 2: Creating co-expression network

Now we will create the co-expression network for cerebellum (condition 1) with 104x104 patterns evaluated using the Pearson’s chi-squared test. Since, we are using a symmetric test to evaluate each pair, the order of parent and child matrices do not matter. Since we want to construct a complete graph of all possible gene pairs, we specify `parent_expr` and `child_expr` to be the same expression matrix `CbSplTss` in `Build_Coexpnet` function. The function also takes two vectors containing the names of child and parent TSS `c_names` and `p_names`, respectively. `nthreads`, is the number of parallel threads to use. We ask each gene to be discretized by means of a marginal discretization strategy, Ckmeans.1d.dp (`method = 'univariate'`).
```{r, out.width='60%',fig.asp=0.6}
# Building co-expression network of Cerebellum tissue
cb_coexpnet = Build_Coexpnet(parent_expr = CbSplTss,
                             child_expr = CbSplTss, 
                             c_names = tss_names,
                             p_names = tss_names,
                             method = 'univariate',
                             nthreads = 10)
```


```{r, out.width='60%',fig.asp=0.6}
# Selecting co-expression pairs with p-value<0.1 and Cramér’s V>0.8 in 
# Cerebellum tissue
filter =  cb_coexpnet$PVALUE < 0.1 &  cb_coexpnet$ESTIMATE > 0.8
cb_coexpnet = cb_coexpnet[filter, ]
  
```
We found 78 significant ( Benjamini-Hochberg adjusted P<0.1 and E>0.8) co-expression patterns in Cerebellum. We retrieve a dataframe containing parent and child column indices along with their matched TSS names, the Pearson’s chi-squared adjusted p-value and Cramér’s V effect size.
```{r, out.width='60%',fig.asp=0.6}
# First five rows of Cerebellum co-expression result
cb_coexpnet[1:5,]
  
```

Similar to cerebellum, we find the co-expression network of liver (condition 2) using a similar configuration. 
```{r, out.width='60%',fig.asp=0.6}
# Building co-expression network of Liver tissue    
liv_coexpnet = Build_Coexpnet(parent_expr = LivSplTss,
                              child_expr = LivSplTss, 
                              c_names = tss_names,
                              p_names = tss_names,
                              method = 'univariate',
                              nthreads = 10)
```


```{r, out.width='60%',fig.asp=0.6}
# Selecting co-expression pairs with p-value<0.1 and Cramér’s V>0.8 in 
# Liver tissue
filter =  liv_coexpnet$PVALUE < 0.1 &  liv_coexpnet$ESTIMATE > 0.8
liv_coexpnet = liv_coexpnet[filter, ]
  
```
We found 612 significant ( Benjamini-Hochberg adjusted P<0.1 and E>0.8) co-expressed patterns in Liver.

### Step 3: Creating a second-order differential network

We find the indices of gene pairs from co-expression networks that are either significant in cerebellum or liver tissues using `GetInteractionIndices` function. We provide a list of co-expression networks across the two conditions as an input to this function.

The retrived gene pairs will be used to create a second-order re-wired network. We prepare an expression matrix `exp_matr` by row-binding cerebellum and liver datasets. We also create a vector `conditions` containing the row-wise condition ids.
```{r, out.width='60%',fig.asp=0.6}
# Obtaining unique significant co-expression pairs from Cerebellum and Liver 
inter_indices = GetInteractionIndices(list(cb_coexpnet, liv_coexpnet))
exp_matr = rbind(CbSplTss, LivSplTss)
conditions = c(rep(1, nrow(CbSplTss)), rep(2, nrow(LivSplTss)))
```

We set the number of experimental conditions to 2.
```{r, out.width='60%',fig.asp=0.6}
# number of conditions
  n_conditions = 2
  
```

We find the second-order differential network using `Build_DiffCoexpnet` that evaluates each gene pair using the Sharma-Song test. We provide `exp_matr` with rows as the combination of samples from both cerebellum and liver tissues, and columns as TSS. We also provide the number of conditions `n_conditions`, a vector of row-wise condition ids `conditions`, TSS names `tss_names` and indices of gene pairs to consider as inputs `inter_indices`. Similar to co-expression network results, We retrieve a dataframe containing parent and child column indices along with their matched TSS names, the Benjamini-Hochberg adjusted Sharma-Song p-value and the Sharma-Song effect size.
```{r, out.width='60%',fig.asp=0.6}
# Building second-order re-wired network between Cerebellum and Liver
cb_liv_diffcoexpnet = Build_DiffCoexpnet(exp_matr = exp_matr,
                                         n_conditions = n_conditions,
                                         conditions = conditions,
                                         indices = inter_indices,
                                         g_names = tss_names,
                                         method = 'univariate',
                                         nthreads = 10)
```


To offer a more conservative list of differential patterns, we threshold the Sharma-Song effect size along with the Sharma-Song p-value. We obtain the threshold for Sharma-Song effect size by simulating the alternate population and selecting the Sharma-Song effect size at 60\%. However, the Sharma-Song effect size is subject to the table size, thus, given the combined total number of samples across all conditions, we determine the maximum allowable contingency table dimension from the data.
```{r, out.width='60%',fig.asp=0.6}
# Maximum sample size among Cerebellum and Liver tissue groups
max_sm = nrow(CbSplTss) + nrow(LivSplTss)

# Maximum number of discretization level 
d = max(2, ceiling((sqrt(max_sm/5)/n_conditions)))
```

We obtain the alternate population using `SharmaSongEsDist`, that takes the number experimental conditions `n_conditions`, a vector of sample size in each condition `N`, number of parallel threads `nthreads` and the maximum dimension size `d`. Next, we calculate the 60\% cut-off.
```{r, out.width='60%',fig.asp=0.6}
#Sharma-Song test effect size distribution for dXd table dimension 
sim_shsong_res = SharmaSongEsDist(N=c(100,100), n_conditions = 2, nthreads = 10, d = d)

#Effect size value at 60% threshold
es_thres = sort(sim_shsong_res[,4])[round(0.6*length(sim_shsong_res[,4]))]
```
The effect size obtained was 0.4565. We found 42 significant (Benjamini-Hochberg adjusted P<0.05 and effect size>0.4565) 2nd-order differential patterns.

```{r, out.width='60%',fig.asp=0.6}
# Selecting significant differential co-expressed patterns p-value<0.05 and effect size >0.456
cb_liv_diffcoexpnet = cb_liv_diffcoexpnet[ cb_liv_diffcoexpnet$PVALUE<0.05 &
                                          cb_liv_diffcoexpnet$ESTIMATE >0.456,]
```


```{r, out.width='60%',fig.asp=0.6}
# First five rows of second-order re-wired network
cb_liv_diffcoexpnet[1:5,]
  
```


### Second-order differential patterns
Following are the four second-order differential plots of TSSs between Cerebellum and Liver of mouse. The red and blue points show the interaction between two TSS. The lines show the loess fit curve to highlight the dynamics between the two TSS. 

```{r plots, echo=FALSE, fig.width= 4.5,fig.height=5}

# Order the second-order differential patterns by high to low effect size
cb_liv_diffcoexpnet = cb_liv_diffcoexpnet[order(cb_liv_diffcoexpnet$ESTIMATE,
                                                decreasing = TRUE) ,]

# Plot first four second-order differential patterns  
grid <- par(par(mfrow=c(2,2)))
PlotDiffCoExpNet(cb_liv_diffcoexpnet[1:4,], exp_matr, n_conditions, conditions, tss_names,    
                 c("firebrick", "dodgerblue"),c("Cerebellum", "Liver") , "Spliceosome")
par(grid)
```

## Spliceosome second-order differential co-expression sub-network

Following is the spliceosome second-order differential co-expression sub-network featuring 200 patterns (edges) across top hub genes in developing mouse cerebellum and liver tissues. The orange edges represent signiﬁcant co-expression patterns among transcripts (aquamarine). The top ﬁve hub genes, ranked by their degree, are represented in gold. Edges representing 2nd-order changes between cerebellum and liver are highlighted in red.

```{r label, out.width = '100%', echo = FALSE}
knitr::include_graphics("../Figures/cereb_liver.pdf")
```

## Conclusions

The step by step process here demonstrate the use of the R package in building second-order re-wired network. Both `Build_Coexpnet` and `Build_DiffCoexpnet` functions can be also be used independently, provided the user has all the required inputs.