Full text of the second edition of Artificial Intelligence: foundations of computational agents, Cambridge University Press, 2017 is now available.

### 11.2.4 Structure Learning

Suppose we have complete data and no hidden variables, but we do not
know the structure of the belief network. This is the setting for
**structure learning** of belief networks.

There are two main approaches to structure learning:

- The first is to use the definition of a belief network in terms
of conditional independence.
Given a total ordering of variables, make the parents of a
variable
*X*be a subset of the variables that are predecessors of*X*in the total ordering that render the other variables independent of*X*. This approach has two main challenges: the first is to determine the best total ordering; the second is to find a way to measure independence. It is difficult to determine conditional independence when there is limited data. - The second method is to have a score for networks, for example, using the MAP model, which takes into account fit to the data and model complexity. Given such a measure, you can search for the structure that minimizes this error.

In this section we concentrate on the second method,
often called a **search and score** method.

Assume that the data is a set *E* of examples, where each example has a value for each variable.

The aim of the search and score method is to choose a model that maximizes

P(model|data)∝P(data|model) P(model).

The likelihood, *P(data|model)*, is the product of the probability of each example. Using
the product decomposition, the product of each example given the model
is the product of the probability of each variable given its parents
in the model.
Thus,

P(data|model) P(model) = (∏ _{e∈E}P(e|model)) P(model)= (∏ _{e∈E}∏_{Xi}P_{model}^{e}(X_{i}|par(X_{i},model))) P(model),

where *par(X _{i},model)* denotes the parents of

*X*in the model, and

_{i}*P*denotes the probability of example

_{model}^{e}(·)*e*as specified in the model.

This is maximized when its logarithm is maximized. When taking logarithms, products become sums:

log P(data|model) + log P(model) = ∑ _{e∈E}∑_{Xi}log P_{model}^{e}(X_{i}|par(X_{i},model))) + log P(model).

To make this approach feasible, assume that the prior probability of the model decomposes into components for each
variable. That is, we assume probability of the model decomposes into a product of probabilities of
local models for each variable. Let *model(X _{i})*
be the local model for variable

*X*.

_{i}Thus, we want to maximize the following:

∑ _{e∈E}∑_{Xi}log P_{model}^{e}(X_{i}|par(X_{i},model))) + ∑_{Xi}log P(model(X_{i}))= ∑ _{Xi}(∑_{e∈E}log P_{model}^{e}(X_{i}|par(X_{i},model))) + ∑_{Xi}log P(model(X_{i}))= ∑ _{Xi}(∑_{e∈E}log P_{model}^{e}(X_{i}|par(X_{i},model)) + log P(model(X_{i})).

We could optimize this by optimizing each variable separately, except
for the fact that the parent relation is constrained by the acyclic
condition of the belief network. However, given a total ordering
of the variables, we have a classification problem in which we want
to predict the probability of each variable given the predecessors in
the total ordering. To represent *P(X _{i}|par(X_{i},model))* we could use, for example, a decision tree with
probabilities of the leaves [as described in
Section 7.5.1.1] or learn a squashed linear function. Given the preceding score, we can search
over total orderings of the variables to maximize this score.