Using ‘DiffXCoExpMat’ to create second the order re-wired network from a genomic dataset

Overview

Molecular network re-wiring in cells can activate pluripotency and differentiation. These re-wiring can be a result of genetic or epigentic variations. Due to its crucial role in many biological system fate, we are introducing a systemic pipeline to utilize the ‘DiffXCoExpMat’ R package which uses a sharma.song.test() and create a second-order network-rewired from any genomic dataset.

In this tutorial, we examine a uni-source FANTOM5 cage data of mouse and compare two developing tissues, cerebellum and liver to build a second-order re-wired network. We extracted 132 genes belonging to spliceosome(mmu03040) pathways using the R package ‘KEGGREST’. Out of 132 genes, a total of 112 TSSs had p1 proﬁles available in mouse FANTOM5. We start with preprocessing the data in which we filter out unchanged TSS using MAD (median absolute deviation). We then build a co-expression network using Pearson’s chi-squared test for both cerebellum and liver tissues. To capture strong dynamics in at least one condition, we consider co-expression patterns with a Benjamini-Hochberg adjusted P<0.1 and Cramér’s V effect size > 0.8. The unique co-expressed genes pairs will be utilized to create a second-order differential network using Sharma-Song test. The differential co-expression network holds the Benjamini-Hochberg adjusted Sharma-Song P-value and Sharma-Song effect size for each gene pair, representing the degree of second order change observed for that gene pair across conditions.

require(stringr)
require(DiffXCoExpNet)

Load Fantom5 TSS dataset of cerebellum and liver

load(paste0("../Data/F5CbSplTss.RData"))
load(paste0("../Data/F5LivSplTss.RData"))

Both cerebellum CbSplTss and liver LivSplTss data contains 112 TSS as columns. Cerebellum has 36 and liver has 15 developing samples of mouse embryo and neonate as rows. Following is the cerebellum and liver data with 5 rows and 5 columns. Rows represents the samples of each TSS(columns).

Subset of cerebellum expression profile

# First five rows and columns of Cerebellum data 
CbSplTss[1:5,1:5]

##                 p1@Ddx46 p1@Ddx39b p1@Dhx16 p1@Dhx38 p1@Cdc40
## cerebellum.E11 104.09430  383.4236 31.40001 16.59948 52.74220
## cerebellum.E11 107.90870  345.2648 29.29258 15.29245 37.04650
## cerebellum.E11 103.88211  395.3417 36.51751 14.51628 37.87841
## cerebellum.E12 106.15477  400.9146 30.91886 15.45943 60.80710
## cerebellum.E12  89.02757  382.3849 33.67232 18.36672 46.93717

Subset of liver expression profile

# First five rows and columns of Liver data 
LivSplTss[1:5,1:5]

##            p1@Ddx46 p1@Ddx39b p1@Dhx16 p1@Dhx38 p1@Cdc40
## liver.E12 108.61616  263.2816 53.72412 17.01820 38.20753
## liver.E13 101.22788  272.6273 46.97760 17.29719 41.27739
## liver.E14 105.91005  266.6865 50.81884 17.53924 42.04922
## liver.E15 103.86438  209.1488 57.81513 15.41737 40.36916
## liver.E16  81.75906  240.4536 45.10013 12.05886 47.02955

Collecting TSS names

tss_names = colnames(CbSplTss)

Step 1: Data Preprocessing

In order to capture high dynamics differential patterns, we remove TSS whose MAD (median absolute deviation) lie among the bottom 5%. FilterGenesByMAD provides a boolean vector of the same size as the number of TSS in LivSplTss or CbSplTss, representing which genes should be removed. Here, we evaluate the overall MAD for each gene across both cererbellum and liver datasets, however, one can use FilterGenesByMAD function on single expression matrices as well.

# Collecting MAD(median absolute deviation) scores of each TSSs
rem_tss = FilterGenesByMAD(rbind(LivSplTss,CbSplTss), nthreads = 10)

We use rem_tss to remove low MAD Tss from both cerebellum and liver datasets and tss_names.

# Remove TSSs that have low MAD(median absolute deviation)
LivSplTss = LivSplTss[ , !rem_tss]
CbSplTss = CbSplTss[ , !rem_tss]
tss_names = tss_names[!rem_tss]

Even after removing low MAD Tss, it is still possible to encounter zero variance TSS in individual conditions, thus, we remove them as well.

# Finding variance of each TSS
LivSplTssvar = apply(LivSplTss, 2, var)
CbSplTssvar = apply(CbSplTss, 2, var)
  
# TSSs with zero variance
zero_var_tss = LivSplTssvar == 0 | CbSplTssvar == 0
  
# Removing TSSs having zero variance
LivSplTss = LivSplTss[ , !zero_var_tss]
CbSplTss = CbSplTss[ , !zero_var_tss]
tss_names = tss_names[!zero_var_tss]

We obtained 104 high dynamic genes after removing low MAD and low variance genes. The selected 104 genes will be used to create co-expression networks for both cerebellum and liver.

Step 2: Creating co-expression network

Now we will create the co-expression network for cerebellum (condition 1) with 104x104 patterns evaluated using the Pearson’s chi-squared test. Since, we are using a symmetric test to evaluate each pair, the order of parent and child matrices do not matter. Since we want to construct a complete graph of all possible gene pairs, we specify parent_expr and child_expr to be the same expression matrix CbSplTss in Build_Coexpnet function. The function also takes two vectors containing the names of child and parent TSS c_names and p_names, respectively. nthreads, is the number of parallel threads to use. We ask each gene to be discretized by means of a marginal discretization strategy, Ckmeans.1d.dp (method = 'univariate').

# Building co-expression network of Cerebellum tissue
cb_coexpnet = Build_Coexpnet(parent_expr = CbSplTss,
                             child_expr = CbSplTss, 
                             c_names = tss_names,
                             p_names = tss_names,
                             method = 'univariate',
                             nthreads = 10)

# Selecting co-expression pairs with p-value<0.1 and Cramér’s V>0.8 in 
# Cerebellum tissue
filter =  cb_coexpnet$PVALUE < 0.1 &  cb_coexpnet$ESTIMATE > 0.8
cb_coexpnet = cb_coexpnet[filter, ]

We found 78 significant ( Benjamini-Hochberg adjusted P<0.1 and E>0.8) co-expression patterns in Cerebellum. We retrieve a dataframe containing parent and child column indices along with their matched TSS names, the Pearson’s chi-squared adjusted p-value and Cramér’s V effect size.

# First five rows of Cerebellum co-expression result
cb_coexpnet[1:5,]

##     PID CID    PNAMES   CNAMES       PVALUE  ESTIMATE
## 482  65   4  p1@Hspa8 p1@Cdc40 1.191426e-05 0.9191450
## 846  13   8  p1@Snrpf p1@Dhx15 2.642082e-06 1.0000000
## 858  25   8  p1@Sf3a3 p1@Dhx15 1.191426e-05 0.8980265
## 873  40   8   p1@Lsm2 p1@Dhx15 7.162370e-05 0.8174367
## 886  53   8 p1@Eftud2 p1@Dhx15 7.162370e-05 0.8174367

Similar to cerebellum, we find the co-expression network of liver (condition 2) using a similar configuration.

# Building co-expression network of Liver tissue    
liv_coexpnet = Build_Coexpnet(parent_expr = LivSplTss,
                              child_expr = LivSplTss, 
                              c_names = tss_names,
                              p_names = tss_names,
                              method = 'univariate',
                              nthreads = 10)

# Selecting co-expression pairs with p-value<0.1 and Cramér’s V>0.8 in 
# Liver tissue
filter =  liv_coexpnet$PVALUE < 0.1 &  liv_coexpnet$ESTIMATE > 0.8
liv_coexpnet = liv_coexpnet[filter, ]

We found 612 significant ( Benjamini-Hochberg adjusted P<0.1 and E>0.8) co-expressed patterns in Liver.

Step 3: Creating a second-order differential network

We find the indices of gene pairs from co-expression networks that are either significant in cerebellum or liver tissues using GetInteractionIndices function. We provide a list of co-expression networks across the two conditions as an input to this function.

The retrived gene pairs will be used to create a second-order re-wired network. We prepare an expression matrix exp_matr by row-binding cerebellum and liver datasets. We also create a vector conditions containing the row-wise condition ids.

# Obtaining unique significant co-expression pairs from Cerebellum and Liver 
inter_indices = GetInteractionIndices(list(cb_coexpnet, liv_coexpnet))
exp_matr = rbind(CbSplTss, LivSplTss)
conditions = c(rep(1, nrow(CbSplTss)), rep(2, nrow(LivSplTss)))

We set the number of experimental conditions to 2.

# number of conditions
  n_conditions = 2

We find the second-order differential network using Build_DiffCoexpnet that evaluates each gene pair using the Sharma-Song test. We provide exp_matr with rows as the combination of samples from both cerebellum and liver tissues, and columns as TSS. We also provide the number of conditions n_conditions, a vector of row-wise condition ids conditions, TSS names tss_names and indices of gene pairs to consider as inputs inter_indices. Similar to co-expression network results, We retrieve a dataframe containing parent and child column indices along with their matched TSS names, the Benjamini-Hochberg adjusted Sharma-Song p-value and the Sharma-Song effect size.

# Building second-order re-wired network between Cerebellum and Liver
cb_liv_diffcoexpnet = Build_DiffCoexpnet(exp_matr = exp_matr,
                                         n_conditions = n_conditions,
                                         conditions = conditions,
                                         indices = inter_indices,
                                         g_names = tss_names,
                                         method = 'univariate',
                                         nthreads = 10)

To offer a more conservative list of differential patterns, we threshold the Sharma-Song effect size along with the Sharma-Song p-value. We obtain the threshold for Sharma-Song effect size by simulating the alternate population and selecting the Sharma-Song effect size at 60%. However, the Sharma-Song effect size is subject to the table size, thus, given the combined total number of samples across all conditions, we determine the maximum allowable contingency table dimension from the data.

# Maximum sample size among Cerebellum and Liver tissue groups
max_sm = nrow(CbSplTss) + nrow(LivSplTss)

# Maximum number of discretization level 
d = max(2, ceiling((sqrt(max_sm/5)/n_conditions)))

We obtain the alternate population using SharmaSongEsDist, that takes the number experimental conditions n_conditions, a vector of sample size in each condition N, number of parallel threads nthreads and the maximum dimension size d. Next, we calculate the 60% cut-off.

#Sharma-Song test effect size distribution for dXd table dimension 
sim_shsong_res = SharmaSongEsDist(N=c(100,100), n_conditions = 2, nthreads = 10, d = d)

#Effect size value at 60% threshold
es_thres = sort(sim_shsong_res[,4])[round(0.6*length(sim_shsong_res[,4]))]

The effect size obtained was 0.4565. We found 42 significant (Benjamini-Hochberg adjusted P<0.05 and effect size>0.4565) 2nd-order differential patterns.

# Selecting significant differential co-expressed patterns p-value<0.05 and effect size >0.456
cb_liv_diffcoexpnet = cb_liv_diffcoexpnet[ cb_liv_diffcoexpnet$PVALUE<0.05 &
                                          cb_liv_diffcoexpnet$ESTIMATE >0.456,]

# First five rows of second-order re-wired network
cb_liv_diffcoexpnet[1:5,]

##     PID CID    PNAMES   CNAMES      PVALUE  ESTIMATE
## 1    65   4  p1@Hspa8 p1@Cdc40 0.006761310 0.5029330
## 37   93  25  p1@Pcbp1 p1@Sf3a3 0.007825690 0.4952902
## 77   94  93  p1@Srsf1 p1@Pcbp1 0.007825690 0.4952902
## 123  15   8  p1@Snrpc p1@Dhx15 0.006205210 0.5116309
## 124  19   8 p1@Tcerg1 p1@Dhx15 0.003789085 0.5500304

Second-order differential patterns

Following are the four second-order differential plots of TSSs between Cerebellum and Liver of mouse. The red and blue points show the interaction between two TSS. The lines show the loess fit curve to highlight the dynamics between the two TSS.

Spliceosome second-order differential co-expression sub-network

Following is the spliceosome second-order differential co-expression sub-network featuring 200 patterns (edges) across top hub genes in developing mouse cerebellum and liver tissues. The orange edges represent signiﬁcant co-expression patterns among transcripts (aquamarine). The top ﬁve hub genes, ranked by their degree, are represented in gold. Edges representing 2nd-order changes between cerebellum and liver are highlighted in red.

Conclusions

The step by step process here demonstrate the use of the R package in building second-order re-wired network. Both Build_Coexpnet and Build_DiffCoexpnet functions can be also be used independently, provided the user has all the required inputs.