Probabilistic Graphical Models Pdf 88956

Partial capture of text on file.
             Statistical Science
             2004, Vol. 19, No. 1, 140–155
             DOI10.1214/088342304000000026
             ©Institute of Mathematical Statistics, 2004
             Graphical Models
             Michael I. Jordan
                                Abstract.  Statistical applications in ﬁelds such as bioinformatics, informa-
                                tion retrieval, speech processing, image processing and communications of-
                                ten involve large-scale models in which thousands or millions of random
                                variables are linked in complex ways. Graphical models provide a gen-
                                eral methodology for approaching these problems, and indeed many of the
                                models developed by researchers in these applied ﬁelds are instances of the
                                generalgraphicalmodelformalism.Wereviewsomeofthebasicideasunder-
                                lying graphical models, including the algorithmic ideas that allow graphical
                                modelstobedeployedinlarge-scaledataanalysisproblems.Wealsopresent
                                examples of graphical models in bioinformatics, error-control coding and
                                language processing.
                                Key words and phrases: Probabilistic graphical models, junction tree
                                algorithm, sum-product algorithm, Markov chain Monte Carlo, variational
                                inference, bioinformatics, error-control coding.
                              1. INTRODUCTION                           representation, the formalism provides general algo-
               The ﬁelds of statistics and computer science have        rithms for computing marginal and conditional prob-
             generally followed separate paths for the past several     abilities of interest. Moreover, the formalism provides
             decades, each ﬁeld providing useful services to the        control over the computational complexity associated
             other, but with the coreconcernsofthetwoﬁeldsrarely        with these operations.
             appearing to intersect. In recent years, however, it has      The graphical model formalism is agnostic to the
             become increasingly evident that the long-term goals       distinction between frequentist and Bayesian statis-
                                                                        tics. However, by providing general machinery for
             of the two ﬁelds are closely aligned. Statisticians are    manipulatingjoint probability distributions and, in par-
             increasinglyconcernedwiththecomputationalaspects,          ticular, by making hierarchical latent variable models
             both theoretical and practical, of models and infer-       easy to represent and manipulate, the formalism has
             ence procedures. Computer scientists are increasingly      proved to be particularly popular within the Bayesian
             concerned with systems that interact with the external     paradigm. Viewing Bayesian statistics as the sys-
             world and interpret uncertain data in terms of under-      tematic application of probability theory to statis-
             lying probabilistic models. One area in which these        tics and viewing graphical models as a systematic
             trends are most evident is that of probabilistic graph-    application of graph–theoretic algorithms to probabil-
             ical models.                                               ity theory, it should not be surprising that many authors
               Agraphicalmodelisafamilyofprobabilitydistribu-           have viewed graphical models as a general Bayesian
             tions deﬁnedintermsofadirectedorundirectedgraph.           “inference engine” (Cowell, Dawid, Lauritzen and
             Thenodesinthegraphareidentiﬁedwithrandomvari-              Spiegelhalter, 1999).
             ables, and joint probability distributions are deﬁned by      What is perhaps most distinctive about the graph-
             taking products over functions deﬁned on connected         ical model approach is its naturalness in formulating
             subsets of nodes. By exploiting the graph–theoretic        probabilistic models of complexphenomenainapplied
                                                                        ﬁelds, while maintaining control over the computa-
                                                                        tional cost associated with these models. Accordingly,
             Michael I. Jordan is Professor, Computer Science           in this article our principal focus is on the presentation
             Division and Department of Statistics, University of       of graphical models that have proved useful in applied
             California, Berkeley, California 94720-3860, USA           domains and on ways in which the formalism encour-
             (e-mail: jordan@stat.berkeley.edu).                        agestheexplorationofextensionsofclassicalmethods.
                                                                    140
                                                           GRAPHICALMODELS                                                141
             Before turning to these examples, however, we begin       example of a plate is shown in Figure 1, which can be
             with an overview of basic concepts.                       viewed as a graphical model representation of the de
                                                                       Finetti exchangeability theorem.
                            2. REPRESENTATION                             Directed graphical models are familiar as represen-
               Thetwomostcommonformsofgraphicalmodelare                tations of hierarchical Bayesian models. An exampleis
             directed graphical models and undirected graphical        given in Figure 2.
             models,basedondirectedacylicgraphsandundirected              The graph provides an appealing visual representa-
             graphs, respectively.                                     tion of a joint probability distribution, but it also pro-
               Let us begin with the directed case. Let G(V,E)         vides a great deal more. First, whatever the functional
             be a directed acyclic graph, where V are the nodes        forms of the kernels p(x |x   ), the factorization in (1)
                                                                                                v  πv
             and E are the edges of the graph. Let {X :v ∈ V}          implies a set of conditional independence statements
                                                        v              among the variables X , and the entire set of condi-
             be a collection of random variables indexed by the                                v
             nodes of the graph. To each node v ∈ V,letπv denote       tional independence statements can be obtained from
             the subset of indices of its parents. We allow sets of    a polynomial time reachability algorithm based on the
             indices to appear wherever a single index appears;        graph (Pearl, 1988). Second, as we discuss in the fol-
             thus X     denotes the vector of random variables         lowingsection,thegraphicalstructurecanbeexploited
                    πv
             indexed by the parents of v. Given a collection of        byalgorithms for probabilistic inference.
             kernels {k(x |x  ):v ∈ V} that sum (in the discrete          Let us now consider the undirected case. Given an
                         v  πv
             case) or integrate (in the continuous case) to 1 (with    undirectedgraphG(V,E),weagainlet{X :v ∈V}be
             respectto x ), we deﬁne a joint probability distribution                                            v
                        v                                              a collection of random variables indexed by the nodes
             (a probability mass function or probability density as    of the graph and let C denote a collection of cliques
             appropriate) as                                           of the graph (i.e., fully connected subsets of nodes).
                                                                      Associatedwith eachclique C ∈C,letψ (x ) denote
             (1)            p(x )=       k(x |x ).                                                              C C
                               V            v  πv                      a nonnegative potential function. We deﬁne the joint
                                     v∈V                               probability p(x ) by taking the product over these
             It is easy to verify that this joint probability distribution             V
             has {k(x |x )} as its conditionals; thus, henceforth,     potential functions and normalizing,
                     v  πv
             wewrite k(x |x )=p(x |x ).                                                         1 
                         v  πv        v  πv                            (2)            p(x )=          ψ (x ),
               Note that we have made no distinction between data                         V    Z        C C
             and parameters, and indeed it is natural to include                                  C∈C
             parameters among the nodes in the graph.                  where Z is a normalization factor obtained by inte-
               Aplate is a useful device for capturing replication     grating or summing the product with respect to xV.
             in graphical models, including the factorial and nested   See Figure 3 for an example of an undirected graph-
             structures that occur in experimental designs.A simple    ical model.
             FIG.1. The diagram in (a) is shorthand for the graphical model in (b). This model asserts that the variables Zn are conditionally
             independent and identically distributed given θ, and can be viewed as a graphical model representation of the de Finetti theorem. Note
             that shading, here and elsewhere in the paper, denotes conditioning.
               142                                                        M.I. JORDAN
                                                                                     to restrict undirected models to such domains; in par-
                                                                                     ticular, it is possible to include parameters among the
                                                                                     nodes of an undirected graph to yield an alternative
                                                                                     general tool for Bayesian modeling. It is also possi-
                                                                                     ble to work with hybrids that include both directed and
                                                                                     undirected edges (Lauritzen, 1996).
                                                                                        In general, directed graphs and undirected graphs
                                                                                     makedifferent assertions of conditional independence.
                                                                                     Thus,therearefamiliesofprobabilitydistributions that
                                                                                     are captured by a directed graph and are not captured
                                                                                     byanyundirectedgraph, and vice versa (Pearl, 1988).
                                                                                        The representations shown in (1) and (2) can be
                                                                                     overly coarse for some purposes. In particular, in the
                                                                                     undirectedformalismthecliquesC maybequitelarge,
                                                                                     and it is often useful to consider potential functions
                                                                                     that are themselves factorized in ways that need not
                                                                                     be equated with conditional independencies. Thus, in
                                                                                     general, we consider a set of “factors” {fi(xCi):i ∈ I}
                                                                                     for some index set I,whereC is the subset of nodes
                                                                                     associated with the ith factor.iNote in particular that
                                                                                     the same subset can be repeated multiple times (i.e.,
                                                                                     we allow C = C for i = j). We deﬁne a joint
                                                                                                    i      j
               FIG.2. AnexampleofahierarchicalBayesianmodelrepresented               probability by taking the product across these factors:
               as a directed graphical model. This is the errors-in-covariates                                     1          
                                                                                     (3)               p(x )=            f x     .
               logistic regression model of Richardson, Leblond, Jaussent and                              V      Z       i   Ci
               Green (2002). The core of this model is a logistic regression of Yi                                   i∈I
               on Xi. The covariate Xi is not observed (in general), but noisy       AsshowninFigure4,thisdeﬁnitionisassociatedwith
               measurements U of X are available, as are additional observed
                               i     i                                               a    graphical     representation—the         factor    graph
               covariates Ci. The density model for Xi is taken to be a mixture      (Kschischang,FreyandLoeliger,2001).Afactorgraph
               model, where K is the number of components, W are the mixing          is a bipartite graph in which the random variables are
               proportions, Zi are the allocations and θ parameterizes themixture    round nodes and the factors appear as square nodes.
               components.
                                                                                     There is an edge between the factor node fi and the
                                                                                     variable node X if and only if v ∈ C .
                  Undirected graphs are often used in problems in ar-                                  v                        i
               eassuchasspatialstatistics,statistical naturallanguage                   Factor graphs provide a more ﬁne-grained represen-
               processing and communication networks—problems                        tation of the factors that make up a joint probability
               in whichthere is little causalstructure to guide the con-             andareusefulin deﬁninginferencealgorithms that ex-
               struction of a directed graph. However,there is no need               ploit this structure (the topic of the following section).
               FIG.3. An example of an undirected graphical model. Proba-            FIG.4.An example of a factor graph. Probability dis-
               bility distributions associated with this graph can be factorized as  tributions associated with this graph can be factorized as
               p(x )= 1ψ(x ,x )ψ(x ,x )ψ(x ,x )ψ(x ,x )ψ(x ,x ,x ).                  p(x )= 1f (x ,x )f (x ,x )f (x ,x ,x )f (x ,x ).
                   V     Z    1   2     1  3     2   4     3  5     2  5   6            V     Z a 1 3 b 3 4 c 2 4 5 d 1 3
                                                            GRAPHICALMODELS                                                143
             Note also that the factor fi(xCi) can often be written      computational issue—exact algorithms, sampling al-
             as exp(θ χ (x )) for a parameter θ and a ﬁxed func-         gorithms and variational algorithms. Our presentation
                     i i  Ci                    i
             tion χi, in which case the representation in (3) is noth-   will be brief; for a fuller presentation, see Cowell et al.
             ing but the canonical form of the exponential family.       (1999) and Jordan (1999).
             Thus factor graphs are widely used as graph–theoretic       3.1 Exact Algorithms
             representations of exponential family distributions.
                                                                           We begin with an example. Consider the graphical
                   3. ALGORITHMSFORPROBABILISTIC                         model shown in Figure 3 and suppose that we wish
                                 INFERENCE                               to compute the marginal probability p(x ). We obtain
                                                                                                                 1
               The general problem of probabilistic inference is         this marginal by summing (assuming discrete random
             that of computing conditional probabilities p(x |x ),       variables) over the remaining variables:
                                                            F E
             where V =E∪F ∪H forgivensubsetsE,F andH.                         p(x )=1ψ(x,x)ψ(x ,x)
             In this section we are concerned with algorithms                     1                      Z     1  2      1  3
             for performing such computations and the role that                         x2 x3  x4 x5  x6
                                                                         (4)                             · ψ(x ,x )ψ(x ,x )
             graphical structure can play in making such computa-                                             2   4     3  5
             tions efﬁcient.                                                                             · ψ(x ,x ,x ).
               In discussing inference algorithms, it proves useful                                           2   5  6
             to treat directed graphs and undirected graphs on an        Naively, each of these sums is applied to a summand
             equal footing. This is done by converting the former        that involves six variables, and thus the computa-
             to the latter. Note in particular that (1) could be         tional complexity scales as r6 (assuming for simplic-
             treated as a special case of (2) if it were the case that   ity that each variable has cardinality r). We can obtain
             each factor p(x |x ) were a function on a clique. In        a smaller complexity, however, by exploiting the dis-
                             i πi                                        tributive law:
             general, however, the parents πi of a given node i
             are not connected and so the set πi ∪{i} is not a            p(x )= 1 ψ(x ,x )ψ(x ,x)ψ(x ,x)
             clique. We can force this set to be a clique by adding           1    Z         1   2        1   3         2  4
             (undirected) edges between all of the parents π ,                        x2            x3           x4
                                                                  i                              
             essentially constructing a new graph that is a graphical              ·    ψ(x ,x )     ψ(x ,x ,x )
                                                                                            3  5         2   5  6
             cover of the original graph. If we also convert the                     x5           x6
             directed edges (from parents to children) to undirected               1                          
                                                                                =        ψ(x ,x )     ψ(x ,x )      ψ(x ,x )
             edges, the result is an undirected graphical cover—the                Z         1   2        1   3         2  4
             so-called moral graph—in which all of the arguments                      x2            x3           x4
             of the function p(x |x ) are contained in a clique.                    
                                 i  πi                                             ·    ψ(x ,x )m (x ,x )
             That is, in the moral graph, the factorization in (1) is                       3  5   6   2  5
                                                                                     x5
             a special case of (2). Thus we can proceed by working                 1              
             exclusively within the undirected formalism.                       =        ψ(x ,x )     ψ(x ,x )m (x ,x )
                                                                                   Z         1   2        1   3   5  2   3
               It is also useful to note that from a computational       (5)          x2            x3
             point of view the conditioning plays little essential                 · ψ(x ,x)
             role in the problem. Indeed, to condition on the event                         2  4
             {X =x }, it sufﬁces to redeﬁne the original clique                      x4
                E     E                                                            1 
             potentials. Thus, for i ∈ E, we multiply the poten-                =        ψ(x ,x )m (x )
             tial ψ (x ) by the Kronecker delta function δ(x ) for                 Z         1   2  4   2
                  C C                                         i                       x2
             anyCsuchthat{i}∈C∩E.Theresultisanunnormal-                            · ψ(x ,x)m (x ,x )
             ized representation of the conditional probability that                        1  3   5   2  3
             has the factorized form in (2). Thus, from a computa-                   x3
             tional point of view, it sufﬁces to focus on the problem           = 1 ψ(x ,x)m (x )m (x ,x )
             of marginalization of the general factorized expres-                  Z         1   2  4   2   3  1  2
             sion in (2). We are interested in controlling the growth                 x2
             in computational complexity of performing such mar-                = 1m (x ),
             ginalization,asafunctionofthecardinalityofV.Inthe                     Z 2 1
             following three sections, we describe the three princi-     where we have deﬁned intermediate factors m in
                                                                                                                           i
             pal classes of methods that attempt to deal with this       an obvious notation. We obtain the value of Z,and
The words contained in this file might help you see if this file matches what you are looking for:

...Statistical science vol no doi institute of mathematical statistics graphical models michael i jordan abstract applications in elds such as bioinformatics informa tion retrieval speech processing image and communications ten involve large scale which thousands or millions random variables are linked complex ways provide a gen eral methodology for approaching these problems indeed many the developed by researchers applied instances generalgraphicalmodelformalism wereviewsomeofthebasicideasunder lying including algorithmic ideas that allow modelstobedeployedinlarge scaledataanalysisproblems wealsopresent examples error control coding language key words phrases probabilistic junction tree algorithm sum product markov chain monte carlo variational inference introduction representation formalism provides general algo computer have rithms computing marginal conditional prob generally followed separate paths past several abilities interest moreover decades each eld providing useful services t...
Related files

Share

Help

Related files

Share

Share to social media

Help

Login Area