189x Filetype PDF File size 0.24 MB Source: projecteuclid.org
Statistical Science 2004, Vol. 19, No. 1, 140–155 DOI10.1214/088342304000000026 ©Institute of Mathematical Statistics, 2004 Graphical Models Michael I. Jordan Abstract. Statistical applications in fields such as bioinformatics, informa- tion retrieval, speech processing, image processing and communications of- ten involve large-scale models in which thousands or millions of random variables are linked in complex ways. Graphical models provide a gen- eral methodology for approaching these problems, and indeed many of the models developed by researchers in these applied fields are instances of the generalgraphicalmodelformalism.Wereviewsomeofthebasicideasunder- lying graphical models, including the algorithmic ideas that allow graphical modelstobedeployedinlarge-scaledataanalysisproblems.Wealsopresent examples of graphical models in bioinformatics, error-control coding and language processing. Key words and phrases: Probabilistic graphical models, junction tree algorithm, sum-product algorithm, Markov chain Monte Carlo, variational inference, bioinformatics, error-control coding. 1. INTRODUCTION representation, the formalism provides general algo- The fields of statistics and computer science have rithms for computing marginal and conditional prob- generally followed separate paths for the past several abilities of interest. Moreover, the formalism provides decades, each field providing useful services to the control over the computational complexity associated other, but with the coreconcernsofthetwofieldsrarely with these operations. appearing to intersect. In recent years, however, it has The graphical model formalism is agnostic to the become increasingly evident that the long-term goals distinction between frequentist and Bayesian statis- tics. However, by providing general machinery for of the two fields are closely aligned. Statisticians are manipulatingjoint probability distributions and, in par- increasinglyconcernedwiththecomputationalaspects, ticular, by making hierarchical latent variable models both theoretical and practical, of models and infer- easy to represent and manipulate, the formalism has ence procedures. Computer scientists are increasingly proved to be particularly popular within the Bayesian concerned with systems that interact with the external paradigm. Viewing Bayesian statistics as the sys- world and interpret uncertain data in terms of under- tematic application of probability theory to statis- lying probabilistic models. One area in which these tics and viewing graphical models as a systematic trends are most evident is that of probabilistic graph- application of graph–theoretic algorithms to probabil- ical models. ity theory, it should not be surprising that many authors Agraphicalmodelisafamilyofprobabilitydistribu- have viewed graphical models as a general Bayesian tions definedintermsofadirectedorundirectedgraph. “inference engine” (Cowell, Dawid, Lauritzen and Thenodesinthegraphareidentifiedwithrandomvari- Spiegelhalter, 1999). ables, and joint probability distributions are defined by What is perhaps most distinctive about the graph- taking products over functions defined on connected ical model approach is its naturalness in formulating subsets of nodes. By exploiting the graph–theoretic probabilistic models of complexphenomenainapplied fields, while maintaining control over the computa- tional cost associated with these models. Accordingly, Michael I. Jordan is Professor, Computer Science in this article our principal focus is on the presentation Division and Department of Statistics, University of of graphical models that have proved useful in applied California, Berkeley, California 94720-3860, USA domains and on ways in which the formalism encour- (e-mail: jordan@stat.berkeley.edu). agestheexplorationofextensionsofclassicalmethods. 140 GRAPHICALMODELS 141 Before turning to these examples, however, we begin example of a plate is shown in Figure 1, which can be with an overview of basic concepts. viewed as a graphical model representation of the de Finetti exchangeability theorem. 2. REPRESENTATION Directed graphical models are familiar as represen- Thetwomostcommonformsofgraphicalmodelare tations of hierarchical Bayesian models. An exampleis directed graphical models and undirected graphical given in Figure 2. models,basedondirectedacylicgraphsandundirected The graph provides an appealing visual representa- graphs, respectively. tion of a joint probability distribution, but it also pro- Let us begin with the directed case. Let G(V,E) vides a great deal more. First, whatever the functional be a directed acyclic graph, where V are the nodes forms of the kernels p(x |x ), the factorization in (1) v πv and E are the edges of the graph. Let {X :v ∈ V} implies a set of conditional independence statements v among the variables X , and the entire set of condi- be a collection of random variables indexed by the v nodes of the graph. To each node v ∈ V,letπv denote tional independence statements can be obtained from the subset of indices of its parents. We allow sets of a polynomial time reachability algorithm based on the indices to appear wherever a single index appears; graph (Pearl, 1988). Second, as we discuss in the fol- thus X denotes the vector of random variables lowingsection,thegraphicalstructurecanbeexploited πv indexed by the parents of v. Given a collection of byalgorithms for probabilistic inference. kernels {k(x |x ):v ∈ V} that sum (in the discrete Let us now consider the undirected case. Given an v πv case) or integrate (in the continuous case) to 1 (with undirectedgraphG(V,E),weagainlet{X :v ∈V}be respectto x ), we define a joint probability distribution v v a collection of random variables indexed by the nodes (a probability mass function or probability density as of the graph and let C denote a collection of cliques appropriate) as of the graph (i.e., fully connected subsets of nodes). Associatedwith eachclique C ∈C,letψ (x ) denote (1) p(x )= k(x |x ). C C V v πv a nonnegative potential function. We define the joint v∈V probability p(x ) by taking the product over these It is easy to verify that this joint probability distribution V has {k(x |x )} as its conditionals; thus, henceforth, potential functions and normalizing, v πv wewrite k(x |x )=p(x |x ). 1 v πv v πv (2) p(x )= ψ (x ), Note that we have made no distinction between data V Z C C and parameters, and indeed it is natural to include C∈C parameters among the nodes in the graph. where Z is a normalization factor obtained by inte- Aplate is a useful device for capturing replication grating or summing the product with respect to xV. in graphical models, including the factorial and nested See Figure 3 for an example of an undirected graph- structures that occur in experimental designs.A simple ical model. FIG.1. The diagram in (a) is shorthand for the graphical model in (b). This model asserts that the variables Zn are conditionally independent and identically distributed given θ, and can be viewed as a graphical model representation of the de Finetti theorem. Note that shading, here and elsewhere in the paper, denotes conditioning. 142 M.I. JORDAN to restrict undirected models to such domains; in par- ticular, it is possible to include parameters among the nodes of an undirected graph to yield an alternative general tool for Bayesian modeling. It is also possi- ble to work with hybrids that include both directed and undirected edges (Lauritzen, 1996). In general, directed graphs and undirected graphs makedifferent assertions of conditional independence. Thus,therearefamiliesofprobabilitydistributions that are captured by a directed graph and are not captured byanyundirectedgraph, and vice versa (Pearl, 1988). The representations shown in (1) and (2) can be overly coarse for some purposes. In particular, in the undirectedformalismthecliquesC maybequitelarge, and it is often useful to consider potential functions that are themselves factorized in ways that need not be equated with conditional independencies. Thus, in general, we consider a set of “factors” {fi(xCi):i ∈ I} for some index set I,whereC is the subset of nodes associated with the ith factor.iNote in particular that the same subset can be repeated multiple times (i.e., we allow C = C for i = j). We define a joint i j FIG.2. AnexampleofahierarchicalBayesianmodelrepresented probability by taking the product across these factors: as a directed graphical model. This is the errors-in-covariates 1 (3) p(x )= f x . logistic regression model of Richardson, Leblond, Jaussent and V Z i Ci Green (2002). The core of this model is a logistic regression of Yi i∈I on Xi. The covariate Xi is not observed (in general), but noisy AsshowninFigure4,thisdefinitionisassociatedwith measurements U of X are available, as are additional observed i i a graphical representation—the factor graph covariates Ci. The density model for Xi is taken to be a mixture (Kschischang,FreyandLoeliger,2001).Afactorgraph model, where K is the number of components, W are the mixing is a bipartite graph in which the random variables are proportions, Zi are the allocations and θ parameterizes themixture round nodes and the factors appear as square nodes. components. There is an edge between the factor node fi and the variable node X if and only if v ∈ C . Undirected graphs are often used in problems in ar- v i eassuchasspatialstatistics,statistical naturallanguage Factor graphs provide a more fine-grained represen- processing and communication networks—problems tation of the factors that make up a joint probability in whichthere is little causalstructure to guide the con- andareusefulin defininginferencealgorithms that ex- struction of a directed graph. However,there is no need ploit this structure (the topic of the following section). FIG.3. An example of an undirected graphical model. Proba- FIG.4.An example of a factor graph. Probability dis- bility distributions associated with this graph can be factorized as tributions associated with this graph can be factorized as p(x )= 1ψ(x ,x )ψ(x ,x )ψ(x ,x )ψ(x ,x )ψ(x ,x ,x ). p(x )= 1f (x ,x )f (x ,x )f (x ,x ,x )f (x ,x ). V Z 1 2 1 3 2 4 3 5 2 5 6 V Z a 1 3 b 3 4 c 2 4 5 d 1 3 GRAPHICALMODELS 143 Note also that the factor fi(xCi) can often be written computational issue—exact algorithms, sampling al- as exp(θ χ (x )) for a parameter θ and a fixed func- gorithms and variational algorithms. Our presentation i i Ci i tion χi, in which case the representation in (3) is noth- will be brief; for a fuller presentation, see Cowell et al. ing but the canonical form of the exponential family. (1999) and Jordan (1999). Thus factor graphs are widely used as graph–theoretic 3.1 Exact Algorithms representations of exponential family distributions. We begin with an example. Consider the graphical 3. ALGORITHMSFORPROBABILISTIC model shown in Figure 3 and suppose that we wish INFERENCE to compute the marginal probability p(x ). We obtain 1 The general problem of probabilistic inference is this marginal by summing (assuming discrete random that of computing conditional probabilities p(x |x ), variables) over the remaining variables: F E where V =E∪F ∪H forgivensubsetsE,F andH. p(x )=1ψ(x,x)ψ(x ,x) In this section we are concerned with algorithms 1 Z 1 2 1 3 for performing such computations and the role that x2 x3 x4 x5 x6 (4) · ψ(x ,x )ψ(x ,x ) graphical structure can play in making such computa- 2 4 3 5 tions efficient. · ψ(x ,x ,x ). In discussing inference algorithms, it proves useful 2 5 6 to treat directed graphs and undirected graphs on an Naively, each of these sums is applied to a summand equal footing. This is done by converting the former that involves six variables, and thus the computa- to the latter. Note in particular that (1) could be tional complexity scales as r6 (assuming for simplic- treated as a special case of (2) if it were the case that ity that each variable has cardinality r). We can obtain each factor p(x |x ) were a function on a clique. In a smaller complexity, however, by exploiting the dis- i πi tributive law: general, however, the parents πi of a given node i are not connected and so the set πi ∪{i} is not a p(x )= 1 ψ(x ,x )ψ(x ,x)ψ(x ,x) clique. We can force this set to be a clique by adding 1 Z 1 2 1 3 2 4 (undirected) edges between all of the parents π , x2 x3 x4 i essentially constructing a new graph that is a graphical · ψ(x ,x ) ψ(x ,x ,x ) 3 5 2 5 6 cover of the original graph. If we also convert the x5 x6 directed edges (from parents to children) to undirected 1 = ψ(x ,x ) ψ(x ,x ) ψ(x ,x ) edges, the result is an undirected graphical cover—the Z 1 2 1 3 2 4 so-called moral graph—in which all of the arguments x2 x3 x4 of the function p(x |x ) are contained in a clique. i πi · ψ(x ,x )m (x ,x ) That is, in the moral graph, the factorization in (1) is 3 5 6 2 5 x5 a special case of (2). Thus we can proceed by working 1 exclusively within the undirected formalism. = ψ(x ,x ) ψ(x ,x )m (x ,x ) Z 1 2 1 3 5 2 3 It is also useful to note that from a computational (5) x2 x3 point of view the conditioning plays little essential · ψ(x ,x) role in the problem. Indeed, to condition on the event 2 4 {X =x }, it suffices to redefine the original clique x4 E E 1 potentials. Thus, for i ∈ E, we multiply the poten- = ψ(x ,x )m (x ) tial ψ (x ) by the Kronecker delta function δ(x ) for Z 1 2 4 2 C C i x2 anyCsuchthat{i}∈C∩E.Theresultisanunnormal- · ψ(x ,x)m (x ,x ) ized representation of the conditional probability that 1 3 5 2 3 has the factorized form in (2). Thus, from a computa- x3 tional point of view, it suffices to focus on the problem = 1 ψ(x ,x)m (x )m (x ,x ) of marginalization of the general factorized expres- Z 1 2 4 2 3 1 2 sion in (2). We are interested in controlling the growth x2 in computational complexity of performing such mar- = 1m (x ), ginalization,asafunctionofthecardinalityofV.Inthe Z 2 1 following three sections, we describe the three princi- where we have defined intermediate factors m in i pal classes of methods that attempt to deal with this an obvious notation. We obtain the value of Z,and
no reviews yet
Please Login to review.