Probabilistic Inductive Logic Programming
The field of Probabilistic Logic Programming (PLP) has seen significant advances in the last 20 years, with many proposals for languages that combine probability with logic programming. Since the start, the problem of learning probabilistic logic programs has been the focus of much attention and a special issue of Theory and Practice of Logic Programming on Probability, Logic, and Learning has just appeared online. Learning PLP represents a whole subfield of Inductive Logic Programming (ILP). In Probabilistic ILP (PILP) two problems are considered: learning the parameters of a program given the structure (the rules) and learning the structure and the parameters at the same time.
Usually structure learning systems use parameter learning as a subroutine. In this article we present the field of PILP and discuss the main results.
Since then the field has steadily developed and many proposals for the integration of logic programming and probability have appeared. These proposals can be grouped in two classes: those that use a variant of the distribution semantics  and those that follow a Knowledge Base Model Construction approach [27, 1]. The distribution semantics underlines many languages such as Probabilistic Logic Programs , Probabilistic Horn Abduction , Independent Choice Logic , PRISM , pD , Logic Programs with Annotated Disjunctions , P-log , ProbLog  and CP-logic . While the number of languages is large, all these approaches share a common approach so that there are transformation with linear complexity that can translate one language in another. Under the distribution semantics, a probabilistic logic program defines a probability distribution over normal logic programs (termed worlds). The probability of a ground query Q is then obtained from the joint distribution of the query and the worlds: it is the sum of the probability of worlds where the query is true.
The languages following a KBMC approach include CLP(BN) , Bayesian Logic Programs  and the Prolog Factor Language . In this languages, a program is a template for generating a ground graphical model, be it a Bayesian network or a Markov network.
Learning probabilistic logic programs has been considered from the start:  already presented an algorithm for learning the parameters of programs under the distribution semantics. This is the first problem that was considered in the domain of learning and was the only one until recently, when works regarding the induction of the structure (the rules) and the parameters at the same time started to appear. The whole field was called Probabilistic Inductive Logic Programming (PILP) in .
In the following we provide an overview of PILP by concentrating on languages under the distribution semantics.
We illustrate the distribution semantics with ProbLog , the language with the simplest syntax. A ProbLog program consists of a set of rules (certain) and a set of probabilistic facts of the form
where pi ∈ [0,1] and Ai is an atom, meaning that each ground instantiation Aiθ of Ai is true with probability pi and false with probability 1 –pi. From a ProbLog program we obtain normal programs called worlds by including the set C of rules and a subset L of the (ground) probabilistic facts. Each world is obtained by selecting or rejecting each grounding of each probabilistic fact. The probability of a world is given by the product of a factor pi for each grounding of a probabilistic fact pi :: Ai included in the world and of a factor 1 – pi for each grounding of a probabilistic fact not included in the world. The probability of a ground atom (query) is then the sum of the probabilities of the worlds where the query is true.
Example 1 The following program encodes the fact that a person sneezes if he has the flu and this is the active cause of sneezing, or if he has hay fever and hay fever is the active cause for sneezing:
This program has 4 worlds, sneezing(bob) is true in 3 of them and its probability is 0.7 × 0.8 + 0.3 × 0.8 + 0.7 × 0.2 = 0.94.
The problem of computing the probability of queries is called inference. Solving it by computing all the worlds and then identifying those that entail the query is impractical as the number of possible worlds is exponential in the number of ground probabilistic facts. Usually inference is performed by resorting to knowledge compilation : the program and the query are compiled into a language that allows an efficient computation of the probability. The first approach for performing inference  required finding a covering set of explanations for the query. An explanation is a minimal set of probabilistic facts that is sufficient for entailing the query and a covering set of explanations is a set that contains all possible explanations for the query. From the set of explanations, a Boolean formula in Disjunctive Normal Form (DNF) can be obtained, where each probabilistic fact is associated to a Boolean variable, an explanation is the conjunction of the facts it contains and the whole formula is the disjunction of the formulas for the different explanations.
In Example 1, if we associated Boolean variable X1 to fluSneezing(bob) and X2 to hayFeverSneezing(bob) the Boolean formula that encodes the set of explanations is X1 ∨ X2.
Computing the probability of a Boolean formula in DNF is an NP-hard problem. With knowledge compilation, we transform the formula so that the computation of the probability is polynomial in the size of the obtained representation. The hardness is transferred to the compilation phase but this approach works well in practice because considerable effort has been devoted to the development of efficient compilers.
A target language for knowledge compilation is that of Binary Decision Diagrams (BDDs). They are rooted graphs with one level for each Boolean variable. A node n in a BDD has two children: one corresponding to the 1 value of the variable associated with the level of n and one corresponding the 0 value of the variable. The leaves store either 0 or 1.
A BDD for the function X1 ∨ X2 is shown below
From a BDD we can compute the probability of the query with a dynamic
programming algorithm that is linear in the size of the BDD.
Another target language is that of deterministic, decomposable negation normal form (d-DNNF): a d-DNNF is a rooted directed acyclic graph in which each leaf node is labeled with a literal and each internal node is labeled with a conjunction or disjunction. The graphs must moreover satisfy a number of restrictions. d-DNNF has been used in  for performing inference with Problog: again, once the d-DNNF has been built, computing the probability is linear in the size of the graph.
The problem that PILP aims at solving can be expressed as
- a background knowledge as a probabilistic logic program B
- a set of positive and negative examples E+ and E–
- a language bias L
- a probabilistic logic program P such that the probability of positive
examples according to P ∪ B is maximized and the probability of
negative examples is minimized.
- a probabilistic logic program P such that the probability of positive
This problem has two variants: parameter learning and structure learning. In the first, we are given the structure (the rules) of P and we just want to infer the parameters of P, while in the second we want to infer both the structure and the parameters of P.
Parameter learning for languages following the distribution semantics has been
performed by using the Expectation Maximization (EM) algorithm or by gradient descent. The EM algorithm is used to estimate the probability of models containing random variables that are not observed in the data. It consists of a cycle in which the steps of expectation and maximization are repeatedly performed. In the expectation step, the distribution of the hidden variables is computed according to the current values of the parameters, while in the maximization step the new values of the parameters are computed. Examples of EM algorithms are PRISM , LFI-Problog  and EMBLEM . The latter two use knowledge compilation for computing the distribution of the hidden variables. RIB  is a system for parameter learning that uses a special EM algorithm called information bottleneck that was shown to avoid some local maxima of EM.
Gradient descent methods compute the gradient of the target function and iteratively modify the parameters moving in the direction of the gradient. An example of these methods is LeProbLog  that uses a dynamic programming algorithm for computing the gradient exploiting BDDs.
Structure learning is yet relatively unexplored. One of the first works  presented an algorithm for performing theory compression on ProbLog programs. Theory compression means removing as many clauses as possible from the theory in order to maximize the probability. No new clause can be added to the theory.
SEM-CP-logic  learns parameters and structure of ground CP-logic programs.
It performs learning by considering the Bayesian networks equivalent to CP-logic
programs and by applying techniques for learning Bayesian networks.
SLIPCASE  learns Logic Programs with Annotated Disjunctions by iteratively refining theories and learning the parameters of each theory with EMBLEM. This is possible as parameter learning is usually fast. SLIPCOVER  (in the TPLP special issue) is an evolution of SLIPCASE that uses bottom clauses to guide the refinement process, thus reducing the number of revisions and exploring more effectively the search space. Moreover, SLIPCOVER separates the search for promising clauses from that of the theory: the space of clauses is explored with beam search, while the space of theories is searched greedily.
ProbFOIL  combines the rule learner FOIL with ProbLog. Logical rules are learned from probabilistic data in the sense that both the examples themselves and their classifications can be probabilistic. In this setting the parameters (the probability values) are fixed and the structure (the rules) are to be learned.
While this article does not aim at being a complete account of the activity in the field of PLP, we hope to have given an introduction that highlights the important results already achieved. There are many avenues for future research: improving the efficiency of inference, that is a basic component of learning systems, and developing faster and more accurate learning systems. These challenges must be taken up for PLP systems to increase their adoption.
 Elena Bellodi and Fabrizio Riguzzi. Structure learning of probabilistic
logic programs by searching the clause space. Theory and Practice of Logic
Programming, FirstView Articles(CoRR abs/1309.2080), 2014.
 Daan Fierens, Guy Van den Broeck, Joris Renkens, Dimitar Sht.
Shterionov, Bernd Gutmann, Ingo Thon, Gerda Janssens, and Luc De
Raedt. Inference and learning in probabilistic logic programs using weighted
boolean formulas. CoRR, abs/1304.6810, 2013.
 B. Gutmann, A. Kimmig, K. Kersting, and L. De Raedt. Parameter
learning in probabilistic databases: A least squares approach. In
ECML/PKDD 2008, volume 5211 of LNCS, pages 473–488. Springer, 2008.
 J. Vennekens, Marc Denecker, and Maurice Bruynooghe. CP-logic:
A language of causal probabilistic events and its relation to logic
programming. Theory and Practice of Logic Programming, 9(3):245–308,