357x Filetype PDF File size 0.24 MB Source: projecteuclid.org
Statistical Science
2004, Vol. 19, No. 1, 140–155
DOI10.1214/088342304000000026
©Institute of Mathematical Statistics, 2004
Graphical Models
Michael I. Jordan
Abstract. Statistical applications in fields such as bioinformatics, informa-
tion retrieval, speech processing, image processing and communications of-
ten involve large-scale models in which thousands or millions of random
variables are linked in complex ways. Graphical models provide a gen-
eral methodology for approaching these problems, and indeed many of the
models developed by researchers in these applied fields are instances of the
generalgraphicalmodelformalism.Wereviewsomeofthebasicideasunder-
lying graphical models, including the algorithmic ideas that allow graphical
modelstobedeployedinlarge-scaledataanalysisproblems.Wealsopresent
examples of graphical models in bioinformatics, error-control coding and
language processing.
Key words and phrases: Probabilistic graphical models, junction tree
algorithm, sum-product algorithm, Markov chain Monte Carlo, variational
inference, bioinformatics, error-control coding.
1. INTRODUCTION representation, the formalism provides general algo-
The fields of statistics and computer science have rithms for computing marginal and conditional prob-
generally followed separate paths for the past several abilities of interest. Moreover, the formalism provides
decades, each field providing useful services to the control over the computational complexity associated
other, but with the coreconcernsofthetwofieldsrarely with these operations.
appearing to intersect. In recent years, however, it has The graphical model formalism is agnostic to the
become increasingly evident that the long-term goals distinction between frequentist and Bayesian statis-
tics. However, by providing general machinery for
of the two fields are closely aligned. Statisticians are manipulatingjoint probability distributions and, in par-
increasinglyconcernedwiththecomputationalaspects, ticular, by making hierarchical latent variable models
both theoretical and practical, of models and infer- easy to represent and manipulate, the formalism has
ence procedures. Computer scientists are increasingly proved to be particularly popular within the Bayesian
concerned with systems that interact with the external paradigm. Viewing Bayesian statistics as the sys-
world and interpret uncertain data in terms of under- tematic application of probability theory to statis-
lying probabilistic models. One area in which these tics and viewing graphical models as a systematic
trends are most evident is that of probabilistic graph- application of graph–theoretic algorithms to probabil-
ical models. ity theory, it should not be surprising that many authors
Agraphicalmodelisafamilyofprobabilitydistribu- have viewed graphical models as a general Bayesian
tions definedintermsofadirectedorundirectedgraph. “inference engine” (Cowell, Dawid, Lauritzen and
Thenodesinthegraphareidentifiedwithrandomvari- Spiegelhalter, 1999).
ables, and joint probability distributions are defined by What is perhaps most distinctive about the graph-
taking products over functions defined on connected ical model approach is its naturalness in formulating
subsets of nodes. By exploiting the graph–theoretic probabilistic models of complexphenomenainapplied
fields, while maintaining control over the computa-
tional cost associated with these models. Accordingly,
Michael I. Jordan is Professor, Computer Science in this article our principal focus is on the presentation
Division and Department of Statistics, University of of graphical models that have proved useful in applied
California, Berkeley, California 94720-3860, USA domains and on ways in which the formalism encour-
(e-mail: jordan@stat.berkeley.edu). agestheexplorationofextensionsofclassicalmethods.
140
GRAPHICALMODELS 141
Before turning to these examples, however, we begin example of a plate is shown in Figure 1, which can be
with an overview of basic concepts. viewed as a graphical model representation of the de
Finetti exchangeability theorem.
2. REPRESENTATION Directed graphical models are familiar as represen-
Thetwomostcommonformsofgraphicalmodelare tations of hierarchical Bayesian models. An exampleis
directed graphical models and undirected graphical given in Figure 2.
models,basedondirectedacylicgraphsandundirected The graph provides an appealing visual representa-
graphs, respectively. tion of a joint probability distribution, but it also pro-
Let us begin with the directed case. Let G(V,E) vides a great deal more. First, whatever the functional
be a directed acyclic graph, where V are the nodes forms of the kernels p(x |x ), the factorization in (1)
v πv
and E are the edges of the graph. Let {X :v ∈ V} implies a set of conditional independence statements
v among the variables X , and the entire set of condi-
be a collection of random variables indexed by the v
nodes of the graph. To each node v ∈ V,letπv denote tional independence statements can be obtained from
the subset of indices of its parents. We allow sets of a polynomial time reachability algorithm based on the
indices to appear wherever a single index appears; graph (Pearl, 1988). Second, as we discuss in the fol-
thus X denotes the vector of random variables lowingsection,thegraphicalstructurecanbeexploited
πv
indexed by the parents of v. Given a collection of byalgorithms for probabilistic inference.
kernels {k(x |x ):v ∈ V} that sum (in the discrete Let us now consider the undirected case. Given an
v πv
case) or integrate (in the continuous case) to 1 (with undirectedgraphG(V,E),weagainlet{X :v ∈V}be
respectto x ), we define a joint probability distribution v
v a collection of random variables indexed by the nodes
(a probability mass function or probability density as of the graph and let C denote a collection of cliques
appropriate) as of the graph (i.e., fully connected subsets of nodes).
Associatedwith eachclique C ∈C,letψ (x ) denote
(1) p(x )= k(x |x ). C C
V v πv a nonnegative potential function. We define the joint
v∈V probability p(x ) by taking the product over these
It is easy to verify that this joint probability distribution V
has {k(x |x )} as its conditionals; thus, henceforth, potential functions and normalizing,
v πv
wewrite k(x |x )=p(x |x ). 1
v πv v πv (2) p(x )= ψ (x ),
Note that we have made no distinction between data V Z C C
and parameters, and indeed it is natural to include C∈C
parameters among the nodes in the graph. where Z is a normalization factor obtained by inte-
Aplate is a useful device for capturing replication grating or summing the product with respect to xV.
in graphical models, including the factorial and nested See Figure 3 for an example of an undirected graph-
structures that occur in experimental designs.A simple ical model.
FIG.1. The diagram in (a) is shorthand for the graphical model in (b). This model asserts that the variables Zn are conditionally
independent and identically distributed given θ, and can be viewed as a graphical model representation of the de Finetti theorem. Note
that shading, here and elsewhere in the paper, denotes conditioning.
142 M.I. JORDAN
to restrict undirected models to such domains; in par-
ticular, it is possible to include parameters among the
nodes of an undirected graph to yield an alternative
general tool for Bayesian modeling. It is also possi-
ble to work with hybrids that include both directed and
undirected edges (Lauritzen, 1996).
In general, directed graphs and undirected graphs
makedifferent assertions of conditional independence.
Thus,therearefamiliesofprobabilitydistributions that
are captured by a directed graph and are not captured
byanyundirectedgraph, and vice versa (Pearl, 1988).
The representations shown in (1) and (2) can be
overly coarse for some purposes. In particular, in the
undirectedformalismthecliquesC maybequitelarge,
and it is often useful to consider potential functions
that are themselves factorized in ways that need not
be equated with conditional independencies. Thus, in
general, we consider a set of “factors” {fi(xCi):i ∈ I}
for some index set I,whereC is the subset of nodes
associated with the ith factor.iNote in particular that
the same subset can be repeated multiple times (i.e.,
we allow C = C for i = j). We define a joint
i j
FIG.2. AnexampleofahierarchicalBayesianmodelrepresented probability by taking the product across these factors:
as a directed graphical model. This is the errors-in-covariates 1
(3) p(x )= f x .
logistic regression model of Richardson, Leblond, Jaussent and V Z i Ci
Green (2002). The core of this model is a logistic regression of Yi i∈I
on Xi. The covariate Xi is not observed (in general), but noisy AsshowninFigure4,thisdefinitionisassociatedwith
measurements U of X are available, as are additional observed
i i a graphical representation—the factor graph
covariates Ci. The density model for Xi is taken to be a mixture (Kschischang,FreyandLoeliger,2001).Afactorgraph
model, where K is the number of components, W are the mixing is a bipartite graph in which the random variables are
proportions, Zi are the allocations and θ parameterizes themixture round nodes and the factors appear as square nodes.
components.
There is an edge between the factor node fi and the
variable node X if and only if v ∈ C .
Undirected graphs are often used in problems in ar- v i
eassuchasspatialstatistics,statistical naturallanguage Factor graphs provide a more fine-grained represen-
processing and communication networks—problems tation of the factors that make up a joint probability
in whichthere is little causalstructure to guide the con- andareusefulin defininginferencealgorithms that ex-
struction of a directed graph. However,there is no need ploit this structure (the topic of the following section).
FIG.3. An example of an undirected graphical model. Proba- FIG.4.An example of a factor graph. Probability dis-
bility distributions associated with this graph can be factorized as tributions associated with this graph can be factorized as
p(x )= 1ψ(x ,x )ψ(x ,x )ψ(x ,x )ψ(x ,x )ψ(x ,x ,x ). p(x )= 1f (x ,x )f (x ,x )f (x ,x ,x )f (x ,x ).
V Z 1 2 1 3 2 4 3 5 2 5 6 V Z a 1 3 b 3 4 c 2 4 5 d 1 3
GRAPHICALMODELS 143
Note also that the factor fi(xCi) can often be written computational issue—exact algorithms, sampling al-
as exp(θ χ (x )) for a parameter θ and a fixed func- gorithms and variational algorithms. Our presentation
i i Ci i
tion χi, in which case the representation in (3) is noth- will be brief; for a fuller presentation, see Cowell et al.
ing but the canonical form of the exponential family. (1999) and Jordan (1999).
Thus factor graphs are widely used as graph–theoretic 3.1 Exact Algorithms
representations of exponential family distributions.
We begin with an example. Consider the graphical
3. ALGORITHMSFORPROBABILISTIC model shown in Figure 3 and suppose that we wish
INFERENCE to compute the marginal probability p(x ). We obtain
1
The general problem of probabilistic inference is this marginal by summing (assuming discrete random
that of computing conditional probabilities p(x |x ), variables) over the remaining variables:
F E
where V =E∪F ∪H forgivensubsetsE,F andH. p(x )=1ψ(x,x)ψ(x ,x)
In this section we are concerned with algorithms 1 Z 1 2 1 3
for performing such computations and the role that x2 x3 x4 x5 x6
(4) · ψ(x ,x )ψ(x ,x )
graphical structure can play in making such computa- 2 4 3 5
tions efficient. · ψ(x ,x ,x ).
In discussing inference algorithms, it proves useful 2 5 6
to treat directed graphs and undirected graphs on an Naively, each of these sums is applied to a summand
equal footing. This is done by converting the former that involves six variables, and thus the computa-
to the latter. Note in particular that (1) could be tional complexity scales as r6 (assuming for simplic-
treated as a special case of (2) if it were the case that ity that each variable has cardinality r). We can obtain
each factor p(x |x ) were a function on a clique. In a smaller complexity, however, by exploiting the dis-
i πi tributive law:
general, however, the parents πi of a given node i
are not connected and so the set πi ∪{i} is not a p(x )= 1 ψ(x ,x )ψ(x ,x)ψ(x ,x)
clique. We can force this set to be a clique by adding 1 Z 1 2 1 3 2 4
(undirected) edges between all of the parents π , x2 x3 x4
i
essentially constructing a new graph that is a graphical · ψ(x ,x ) ψ(x ,x ,x )
3 5 2 5 6
cover of the original graph. If we also convert the x5 x6
directed edges (from parents to children) to undirected 1
= ψ(x ,x ) ψ(x ,x ) ψ(x ,x )
edges, the result is an undirected graphical cover—the Z 1 2 1 3 2 4
so-called moral graph—in which all of the arguments x2 x3 x4
of the function p(x |x ) are contained in a clique.
i πi · ψ(x ,x )m (x ,x )
That is, in the moral graph, the factorization in (1) is 3 5 6 2 5
x5
a special case of (2). Thus we can proceed by working 1
exclusively within the undirected formalism. = ψ(x ,x ) ψ(x ,x )m (x ,x )
Z 1 2 1 3 5 2 3
It is also useful to note that from a computational (5) x2 x3
point of view the conditioning plays little essential · ψ(x ,x)
role in the problem. Indeed, to condition on the event 2 4
{X =x }, it suffices to redefine the original clique x4
E E 1
potentials. Thus, for i ∈ E, we multiply the poten- = ψ(x ,x )m (x )
tial ψ (x ) by the Kronecker delta function δ(x ) for Z 1 2 4 2
C C i x2
anyCsuchthat{i}∈C∩E.Theresultisanunnormal- · ψ(x ,x)m (x ,x )
ized representation of the conditional probability that 1 3 5 2 3
has the factorized form in (2). Thus, from a computa- x3
tional point of view, it suffices to focus on the problem = 1 ψ(x ,x)m (x )m (x ,x )
of marginalization of the general factorized expres- Z 1 2 4 2 3 1 2
sion in (2). We are interested in controlling the growth x2
in computational complexity of performing such mar- = 1m (x ),
ginalization,asafunctionofthecardinalityofV.Inthe Z 2 1
following three sections, we describe the three princi- where we have defined intermediate factors m in
i
pal classes of methods that attempt to deal with this an obvious notation. We obtain the value of Z,and
no reviews yet
Please Login to review.