The source for this post is available here
Intro & Motivation
Yes, there is a mantra, that causality can not be inferred from correlation. No doubt if A correlates with B, it is generally impossible to say if A or B is the cause. Moreover there can be yet another latent variable that cause changes in both. In certains cases however, the directionality determined based on the mechanism. Say, if A is a mutation, B is mRNA abundance, then only A can cause changes in B, but not vice versa. Having said that, let’s consider general cases only when we have no mechanistic knowledge.
Though if there are only two variable, inference of causality without any knowledge about thr mechanism is impossible, the situation becomes more interesting with more variables involved. With more variables, there multiple options which nodes are causal, which are intermediate and which are reactive. Those choices may be non-equivalent with respect to degrees of freedom or in other words - number of input nodes. Given the Occam’s razor the preference should be given to the model with the least input nodes that as effectively exprain the observed variance as models with more input nodes. This is just the intuition. The actual inference relies on formalism like cross-validation and information criteria. Do those approaches guarantee that the inferred causal structure is correct? Absolutely not. However, the result will be the best guess given the data.
Toy Example.
Tree structure.
Simple causal structure in which all the changes originate with the root node A
. Only one input node. Yes, the variance can be explained in reverse flow of causality where input nodes are D
,E
,F
and G
, however, that will cost extra 3 dergrees of freedom. So, in theory, the algoritm should prefer the structure with only one causal node.
Synthetic data
Inference
R has a number of packages that deal with causal inference.
bnlearn, pcalg, deal, catnet, gRbase, gRain, rbmn
We’ll give a test to a couple of tools.
bnlearn
package
The bnlearn
R package has a wonderful collection of algorithms and utilities for causal inference. Speficically the built-in boot.strength
function allows to estimate the uncertainities of the inferred edges and their directionality. In the examples below I do not take advantage of uncertainity quantification. However, generally it is a useful guidance if you not sure how believable/reliable the infernce results.
Markov Blanket, iamb
implementation
Grow-Shrink Algoritm
Hill Climbing Optimization
Hill Climbing Optimization. V2
Exactly the same settings as was used for inference of protein signalling network. Causal Protein-Signaling Networks Derived from Multiparameter Single-Cell Data
Tabu Optimization
Hybrid Algorithm, ARACNE followed by tabu optimization
Closing the cluster
pcalg
package
Summary
As evident from the results above, learning the directed acyclic graph structure is not an easy and trivial task. For me it has been mostly been a hit-and-miss process. In the tests above it does not appear that there is clear winner as for the algorithm. Though simulating data with different parameters does result in successful recapitulation of the data structure. For example setting measurement noise to much lower value, 1/10 of the effect, which perhaps unrealistic for biological data, results in proper tree inference using optimizaton-based methods (hc
and tabu
). I’d note, however, that the main point of this document (at least at this time) is not to provide a guidance as for which approach is the best, but rather provide an example of how to use and evaluate the structure learning tools in R.