Molecular network re-wiring in cells can activate pluripotency and differentiation. These re-wiring can be a result of genetic or epigentic variations. Due to its crucial role in many biological system fate, we are introducing a systemic pipeline to utilize the ‘DiffXCoExpMat’ R package which uses a sharma.song.test() and create a second-order network-rewired from any genomic dataset.
In this tutorial, we examine a uni-source FANTOM5 cage data of mouse and compare two developing tissues, cerebellum and liver to build a second-order re-wired network. We extracted 132 genes belonging to spliceosome(mmu03040) pathways using the R package ‘KEGGREST’. Out of 132 genes, a total of 112 TSSs had p1 profiles available in mouse FANTOM5. We start with preprocessing the data in which we filter out unchanged TSS using MAD (median absolute deviation). We then build a co-expression network using Pearson’s chi-squared test for both cerebellum and liver tissues. To capture strong dynamics in at least one condition, we consider co-expression patterns with a Benjamini-Hochberg adjusted P<0.1 and Cramér’s V effect size > 0.8. The unique co-expressed genes pairs will be utilized to create a second-order differential network using Sharma-Song test. The differential co-expression network holds the Benjamini-Hochberg adjusted Sharma-Song P-value and Sharma-Song effect size for each gene pair, representing the degree of second order change observed for that gene pair across conditions.
require(stringr)
require(DiffXCoExpNet)
load(paste0("../Data/F5CbSplTss.RData"))
load(paste0("../Data/F5LivSplTss.RData"))
Both cerebellum CbSplTss
and liver LivSplTss
data contains 112 TSS as columns. Cerebellum has 36 and liver has 15 developing samples of mouse embryo and neonate as rows. Following is the cerebellum and liver data with 5 rows and 5 columns. Rows represents the samples of each TSS(columns).
Subset of cerebellum expression profile
# First five rows and columns of Cerebellum data
CbSplTss[1:5,1:5]
## p1@Ddx46 p1@Ddx39b p1@Dhx16 p1@Dhx38 p1@Cdc40
## cerebellum.E11 104.09430 383.4236 31.40001 16.59948 52.74220
## cerebellum.E11 107.90870 345.2648 29.29258 15.29245 37.04650
## cerebellum.E11 103.88211 395.3417 36.51751 14.51628 37.87841
## cerebellum.E12 106.15477 400.9146 30.91886 15.45943 60.80710
## cerebellum.E12 89.02757 382.3849 33.67232 18.36672 46.93717
Subset of liver expression profile
# First five rows and columns of Liver data
LivSplTss[1:5,1:5]
## p1@Ddx46 p1@Ddx39b p1@Dhx16 p1@Dhx38 p1@Cdc40
## liver.E12 108.61616 263.2816 53.72412 17.01820 38.20753
## liver.E13 101.22788 272.6273 46.97760 17.29719 41.27739
## liver.E14 105.91005 266.6865 50.81884 17.53924 42.04922
## liver.E15 103.86438 209.1488 57.81513 15.41737 40.36916
## liver.E16 81.75906 240.4536 45.10013 12.05886 47.02955
Collecting TSS names
tss_names = colnames(CbSplTss)
In order to capture high dynamics differential patterns, we remove TSS whose MAD (median absolute deviation) lie among the bottom 5%. FilterGenesByMAD
provides a boolean vector of the same size as the number of TSS in LivSplTss
or CbSplTss
, representing which genes should be removed. Here, we evaluate the overall MAD for each gene across both cererbellum and liver datasets, however, one can use FilterGenesByMAD
function on single expression matrices as well.
# Collecting MAD(median absolute deviation) scores of each TSSs
rem_tss = FilterGenesByMAD(rbind(LivSplTss,CbSplTss), nthreads = 10)
We use rem_tss
to remove low MAD Tss from both cerebellum and liver datasets and tss_names
.
# Remove TSSs that have low MAD(median absolute deviation)
LivSplTss = LivSplTss[ , !rem_tss]
CbSplTss = CbSplTss[ , !rem_tss]
tss_names = tss_names[!rem_tss]
Even after removing low MAD Tss, it is still possible to encounter zero variance TSS in individual conditions, thus, we remove them as well.
# Finding variance of each TSS
LivSplTssvar = apply(LivSplTss, 2, var)
CbSplTssvar = apply(CbSplTss, 2, var)
# TSSs with zero variance
zero_var_tss = LivSplTssvar == 0 | CbSplTssvar == 0
# Removing TSSs having zero variance
LivSplTss = LivSplTss[ , !zero_var_tss]
CbSplTss = CbSplTss[ , !zero_var_tss]
tss_names = tss_names[!zero_var_tss]
We obtained 104 high dynamic genes after removing low MAD and low variance genes. The selected 104 genes will be used to create co-expression networks for both cerebellum and liver.
Now we will create the co-expression network for cerebellum (condition 1) with 104x104 patterns evaluated using the Pearson’s chi-squared test. Since, we are using a symmetric test to evaluate each pair, the order of parent and child matrices do not matter. Since we want to construct a complete graph of all possible gene pairs, we specify parent_expr
and child_expr
to be the same expression matrix CbSplTss
in Build_Coexpnet
function. The function also takes two vectors containing the names of child and parent TSS c_names
and p_names
, respectively. nthreads
, is the number of parallel threads to use. We ask each gene to be discretized by means of a marginal discretization strategy, Ckmeans.1d.dp (method = 'univariate'
).
# Building co-expression network of Cerebellum tissue
cb_coexpnet = Build_Coexpnet(parent_expr = CbSplTss,
child_expr = CbSplTss,
c_names = tss_names,
p_names = tss_names,
method = 'univariate',
nthreads = 10)
# Selecting co-expression pairs with p-value<0.1 and Cramér’s V>0.8 in
# Cerebellum tissue
filter = cb_coexpnet$PVALUE < 0.1 & cb_coexpnet$ESTIMATE > 0.8
cb_coexpnet = cb_coexpnet[filter, ]
We found 78 significant ( Benjamini-Hochberg adjusted P<0.1 and E>0.8) co-expression patterns in Cerebellum. We retrieve a dataframe containing parent and child column indices along with their matched TSS names, the Pearson’s chi-squared adjusted p-value and Cramér’s V effect size.
# First five rows of Cerebellum co-expression result
cb_coexpnet[1:5,]
## PID CID PNAMES CNAMES PVALUE ESTIMATE
## 482 65 4 p1@Hspa8 p1@Cdc40 1.191426e-05 0.9191450
## 846 13 8 p1@Snrpf p1@Dhx15 2.642082e-06 1.0000000
## 858 25 8 p1@Sf3a3 p1@Dhx15 1.191426e-05 0.8980265
## 873 40 8 p1@Lsm2 p1@Dhx15 7.162370e-05 0.8174367
## 886 53 8 p1@Eftud2 p1@Dhx15 7.162370e-05 0.8174367
Similar to cerebellum, we find the co-expression network of liver (condition 2) using a similar configuration.
# Building co-expression network of Liver tissue
liv_coexpnet = Build_Coexpnet(parent_expr = LivSplTss,
child_expr = LivSplTss,
c_names = tss_names,
p_names = tss_names,
method = 'univariate',
nthreads = 10)
# Selecting co-expression pairs with p-value<0.1 and Cramér’s V>0.8 in
# Liver tissue
filter = liv_coexpnet$PVALUE < 0.1 & liv_coexpnet$ESTIMATE > 0.8
liv_coexpnet = liv_coexpnet[filter, ]
We found 612 significant ( Benjamini-Hochberg adjusted P<0.1 and E>0.8) co-expressed patterns in Liver.
We find the indices of gene pairs from co-expression networks that are either significant in cerebellum or liver tissues using GetInteractionIndices
function. We provide a list of co-expression networks across the two conditions as an input to this function.
The retrived gene pairs will be used to create a second-order re-wired network. We prepare an expression matrix exp_matr
by row-binding cerebellum and liver datasets. We also create a vector conditions
containing the row-wise condition ids.
# Obtaining unique significant co-expression pairs from Cerebellum and Liver
inter_indices = GetInteractionIndices(list(cb_coexpnet, liv_coexpnet))
exp_matr = rbind(CbSplTss, LivSplTss)
conditions = c(rep(1, nrow(CbSplTss)), rep(2, nrow(LivSplTss)))
We set the number of experimental conditions to 2.
# number of conditions
n_conditions = 2
We find the second-order differential network using Build_DiffCoexpnet
that evaluates each gene pair using the Sharma-Song test. We provide exp_matr
with rows as the combination of samples from both cerebellum and liver tissues, and columns as TSS. We also provide the number of conditions n_conditions
, a vector of row-wise condition ids conditions
, TSS names tss_names
and indices of gene pairs to consider as inputs inter_indices
. Similar to co-expression network results, We retrieve a dataframe containing parent and child column indices along with their matched TSS names, the Benjamini-Hochberg adjusted Sharma-Song p-value and the Sharma-Song effect size.
# Building second-order re-wired network between Cerebellum and Liver
cb_liv_diffcoexpnet = Build_DiffCoexpnet(exp_matr = exp_matr,
n_conditions = n_conditions,
conditions = conditions,
indices = inter_indices,
g_names = tss_names,
method = 'univariate',
nthreads = 10)
To offer a more conservative list of differential patterns, we threshold the Sharma-Song effect size along with the Sharma-Song p-value. We obtain the threshold for Sharma-Song effect size by simulating the alternate population and selecting the Sharma-Song effect size at 60%. However, the Sharma-Song effect size is subject to the table size, thus, given the combined total number of samples across all conditions, we determine the maximum allowable contingency table dimension from the data.
# Maximum sample size among Cerebellum and Liver tissue groups
max_sm = nrow(CbSplTss) + nrow(LivSplTss)
# Maximum number of discretization level
d = max(2, ceiling((sqrt(max_sm/5)/n_conditions)))
We obtain the alternate population using SharmaSongEsDist
, that takes the number experimental conditions n_conditions
, a vector of sample size in each condition N
, number of parallel threads nthreads
and the maximum dimension size d
. Next, we calculate the 60% cut-off.
#Sharma-Song test effect size distribution for dXd table dimension
sim_shsong_res = SharmaSongEsDist(N=c(100,100), n_conditions = 2, nthreads = 10, d = d)
#Effect size value at 60% threshold
es_thres = sort(sim_shsong_res[,4])[round(0.6*length(sim_shsong_res[,4]))]
The effect size obtained was 0.4565. We found 42 significant (Benjamini-Hochberg adjusted P<0.05 and effect size>0.4565) 2nd-order differential patterns.
# Selecting significant differential co-expressed patterns p-value<0.05 and effect size >0.456
cb_liv_diffcoexpnet = cb_liv_diffcoexpnet[ cb_liv_diffcoexpnet$PVALUE<0.05 &
cb_liv_diffcoexpnet$ESTIMATE >0.456,]
# First five rows of second-order re-wired network
cb_liv_diffcoexpnet[1:5,]
## PID CID PNAMES CNAMES PVALUE ESTIMATE
## 1 65 4 p1@Hspa8 p1@Cdc40 0.006761310 0.5029330
## 37 93 25 p1@Pcbp1 p1@Sf3a3 0.007825690 0.4952902
## 77 94 93 p1@Srsf1 p1@Pcbp1 0.007825690 0.4952902
## 123 15 8 p1@Snrpc p1@Dhx15 0.006205210 0.5116309
## 124 19 8 p1@Tcerg1 p1@Dhx15 0.003789085 0.5500304
Following are the four second-order differential plots of TSSs between Cerebellum and Liver of mouse. The red and blue points show the interaction between two TSS. The lines show the loess fit curve to highlight the dynamics between the two TSS.
Following is the spliceosome second-order differential co-expression sub-network featuring 200 patterns (edges) across top hub genes in developing mouse cerebellum and liver tissues. The orange edges represent significant co-expression patterns among transcripts (aquamarine). The top five hub genes, ranked by their degree, are represented in gold. Edges representing 2nd-order changes between cerebellum and liver are highlighted in red.
The step by step process here demonstrate the use of the R package in building second-order re-wired network. Both Build_Coexpnet
and Build_DiffCoexpnet
functions can be also be used independently, provided the user has all the required inputs.