Probabilistic Programming Practical Frank Wood, Brooks Paige {fwood,brooks}@robots.ox.ac.uk MLSS 2015 Setup Java Installation Mac and Windows: Download and run the installer from https://www.java.com/ en/download/manual.jsp Linux: # Debian/Ubuntu sudo apt-get install default-jre,! # Fedora sudo yum install java-1.7.0-openjdk,! Java (> v. 1.5) Leiningen Installation # Download lien to ˜/bin mkdir ˜/bin cd ˜/bin wget http://git.io/XyijMQ # Make executable chmod a+x ˜/bin/lein # Add ˜/bin to path echo ’export PATH="$HOME/bin:$PATH"’ >> ˜/.bashrc # Run lein lein Further details: http://leiningen.org/ Leiningen (v. > 2.0) Practical Materials • https://bitbucket.org/probprog/mlss2015/get/ master.zip • cd mlss2015 • lein gorilla • open the url 5 Schedule • 15:35 - 16:05 Intro/Hello World! • 16:05 - 16:30 Gaussian (you code) • 16:30 - 16:40 Discuss / intro to physics problem • 16:40 - 16:55 Physics (you code) • 16:55 - 17:00 Share / discuss solutions • 17:00 - 17:20 Inference explanation • 17:20 - 17:45 Poisson (you code) • 17:45 - 17:50 Inference Q/A • 17:50 - 18:05 Coordination (you code) What is probabilistic programming? An Emerging Field ML: Algorithms & Applications STATS: Inference & Theory PL: Compilers, Semantics, Analysis Probabilistic Programming Conceptualization Parameters Program Output CS Parameters Program Observations Probabilistic Programming Statistics y p(y|x)p(x) p(x| Operative Definition “Probabilistic programs are usual functional or imperative programs with two added constructs: (1) the ability to draw values at random from distributions, and (2) the ability to condition values of variables in a program via observations.” Gordon et al, 2014 What are the goals of probabilistic programming? Simplify Machine Learning… Start Identify And Formalize Problem, Gather Data Design Model = Read Papers, Do Math Existing Model Sufficient? Choose Approximate Inference Algorithm Derive Updates And Code Inference Algorithm Exists And Can Use? Performs Well Statistically? End Performs Well Computationally? Search For Useable Implementation Test Scale Deploy Simple Model? Implement Using High Level Modeling Tool Tool Supports Required Features? N N N N N N Y Y Y Y Y Y LEGEND Color indicates the skills that are required to traverse the edge. PhD-level machine learning or statistics PhD-level machine learning or computer science Non-specialist Feasible? Y N To This Start Identify And Formalize Problem, Gather Data Design Model = Write Probabilistic Program Existing Model Sufficient? Derive Updates And Code Inference Algorithm Exists And Can Use? Performs Well? End Performs Well Computationally? Search For Useable Implementation Debug, Test, Profile Scale Deploy N Y LEGEND Color indicates the skills that are required to traverse the edge. Non-specialist Choose Approximate Inference Algorithm Simple Model? Implement Using High Level Modeling Tool Tool Supports Required Features? Feasible? Automate Inference Programming Language Representation / Abstraction Layer Inference Engine(s) Models / Stochastic Simulators CARON ET AL. This lack of consistency is shared by other models based on the Po´lya urn construction (Zhu et al., 2005; Ahmed and Xing, 2008; Blei and Frazier, 2011). Blei and Frazier (2011) provide a detailed discussion on this issue and describe cases where one should or should not bother about it. It is possible to define a slightly modified version of our model that is consistent under marginal- isation, at the expense of an additional set of latent variables. This is described in Appendix C. 3.2 Stationary Models for Cluster Locations To ensure we obtain a first-order stationary Pitman-Yor process mixture model, we also need to satisfy (B). This can be easily achieved if for k 2 I(mtt) Uk,t ⇠ ⇢ p (·|Uk,t 1) if k 2 I(mtt 1) H otherwise where H is the invariant distribution of the Markov transition kernel p (·|·). In the time series literature, many approaches are available to build such transition kernels based on copulas (Joe, 1997) or Gibbs sampling techniques (Pitt and Walker, 2005). Combining the stationary Pitman-Yor and cluster locations models, we can summarize the full model by the following Bayesian network in Figure 1. It can also be summarized using a Chinese restaurant metaphor (see Figure 2). Figure 1: A representation of the time-varying Pitman-Yor process mixture as a directed graphi- cal model, representing conditional independencies between variables. All assignment variables and observations at time t are denoted ct and zt, respectively. 3.3 Properties of the Models Under the uniform deletion model, the number At = P im t i,t 1 of alive allocation variables at time t can be written as At = t 1X j=1 nX k=1 Xj,k 8 c0 Hr m c1 ⇡m Hy ✓m r0 s0 r1 s1 y1 r2 s2 y2 rT sT yT r3 s3 y31 Gaussian Mixture Model ¼ µc y i k k i N K K α Gπ θc y i k k o i N K K α Gπ θc y i k k o i N 1 1 Figure : From left to right: graphical models for a finite Gaussian mixture model (GMM), a Bayesian GMM, and an infinite GMM ci |~⇡ ⇠ Discrete(~⇡) ~yi |ci = k ;⇥ ⇠ Gaussian(·|✓k). ~⇡|↵ ⇠ Dirichlet(·| ↵ K , . . . , ↵ K ) ⇥ ⇠ G0 Wood (University of Oxford) Unsupervised Machine Learning January, 2014 16 / 19 Latent Dirichlet Allocation ↵ wdiz d i k d = 1 . . . D i = 1 . . . Nd. ✓d k = 1 . . .K Figure 1. Graphical model for LDA model Lecture LDA LDA is a hierarchical model used to model text documents. Each document is modeled as a mixture of topics. Each topic is defined as a distribution over the words in the vocabulary. Here, we will denote by K the number of topics in the model. We use D to indicate the number of documents, M to denote the number of words in the vocabulary, and Nd. to denote the number of words in document d. We will assume that the words have been translated to the set of integers {1, . . . ,M} through the use of a static dictionary. This is for convenience only and the integer mapping will contain no semantic information. The generative model for the D documents can be thought of as sequentially drawing a topic mixture ✓d for each document independently from a DirK(↵~1) distribution, where DirK(~ ) is a Dirichlet distribution over the K-dimensional simplex with parameters [ 1, 2, . . . , K ]. Each of K topics { k}Kk=1 are drawn independently from DirM ( ~1). Then, for each of the i = 1 . . . Nd. words in document d, an assignment variable zdi is drawn from Mult(✓ d). Conditional on the assignment variable zdi , word i in document d, denoted as w d i , is drawn independently from Mult( zdi ). The graphical model for the process can be seen in Figure 1. The model is parameterized by the vector valued parameters {✓d}Dd=1, and { k}Kk=1, the parameters {Zdi }d=1,...,D,i=1,...,Nd. , and the scalar positive parameters ↵ and . The model is formally written as: ✓d ⇠ DirK(↵~1) k ⇠ DirM ( ~1) zdi ⇠ Mult(✓d) wdi ⇠ Mult( zdi ) 1 ✓d ⇠ DirK (↵~1) k ⇠ DirM( ~1) zdi ⇠ Discrete(✓d) wdi ⇠ Discrete( zdi ) Wood (University of Oxford) Unsupervised Machine Learning January, 2014 15 / 19 Hello World! 15 First Exercise Gaussian Unknown Mean • Learning objectives 1. Clojure 2. Gorilla REPL 3. Anglican 4. Automatic inference over generative models expressed as programs via query • Resources • https://clojuredocs.org/ • https://bitbucket.org/probprog/anglican/ • http://www.robots.ox.ac.uk/~fwood/anglican/index.html 16 Simulation Second Exercise Learning objectives 1. Develop experience thinking about expressing problems as inference over program executions 2. Understand how to perform inference over a complex deterministic generative process, here a 2D-physics simulator 18 Second Exercise Use inference to solve a mechanism design optimization task: • get all balls safely in bin 19 Inference Trace Probability • observe data points • internal random choices • simulate from by running the program forward • weight traces by observes y1 y2 ✓ x1 x2 x11 x12 x13 x21 x22 { { etc p(y1:N ,x1:N ) = NY n=1 g(yn|x1:n)f(xn|x1:n 1) y1 y2 x1 x2 x3 y3 f(xn|x1:n 1) g(yn|x1:n) xn yn Trace x1,1 = 3 x1,2 = 0 x1,2 = 1 x1,2 = 2 (let [x-1-1 3 x-1-2 (sample (discrete (range x-1-1)))] (if (not= x-1-2 1) (let [x-2-1 (+ x-1-2 7)] (sample (poisson x-2-1))))) x2,1 = 7 x2,1 = 9 x2,2 = 0 x2,2 = 1 . . . Observe x1,1 = 3 x1,2 = 0 x1,2 = 1 x1,2 = 2 (let [x-1-1 3 x-1-2 (sample (discrete (range x-1-1)))] (if (not= x-1-2 1) (let [x-2-1 (+ x-1-2 7)] (sample (poisson x-2-1)))) (observe (gaussian x-2-1 0.0001) 7))) x2,1 = 7 x2,1 = 9 x2,2 = 0 x2,2 = 1 . . . “Single Site” MCMC = LMH Posterior distribution of execution traces is proportional to trace score with observed values plugged in Metropolis-Hastings acceptance rule Lightweight Implementations of Probabilistic Programming Languages Via Transformational Compilation [Wingate, Stuhlmüller et al, 2011] 24 ▪ Need ▪ Proposal ▪ Have ▪ Likelihoods (via observe statement restrictions) ▪ Prior (sequence of ERP returns; scored in interpreter) min ✓ 1, p(y|x0)p(x0)q(x|x0) p(y|x)p(x)q(x0|x) ◆ [assume poi-1 (beta 7 4)] [assume poi-2 (+ 1 (poisson 8))] [assume generative-model-part-1 (lambda (a b) (if (flip a) (+ (poisson b) 7) (normal a b)))] [assume generative-model-part-2 (gamma poi-1 poi-2)] [observe (normal (generative-model-part-1 poi-1 poi-2) 1) 18] [observe (normal 0 generative-model-part-2) 6] [predict (list poi-1 poi-2 )] x1,1 ⇠ Beta(7, 4) x1,2|x1,1 = 0.4 ⇠ Poisson(8) x1,3|x1,2 = 6, x1,1 = 0.4 ⇠ Binomial(0.4) x1,4|x1,3 = true, x1,2 = 6, x1,1 = 0.4 ⇠ Poisson(4) y1 = 18|x1,4 = 7, x1,3 = true, x1,2 = 6, x1,1 = 0.4 ⇠ Normal(7, 1) x2,1| . . . ⇠ Gamma(0.4, 6) y2 = 6|x2,1 = 6.92, . . . ⇠ Normal(0, 6.92) p(x|y) / p˜(y = observes,x) 2 Manuscript under review by AISTATS 2014 Note that there are more e cient PMCMC algorithms for probabilistic programming inference. In particular, there is no reason to fork unless an observe has just been interpreted. Alg. 1 is presented in this form for expositional purposes. 3.2 Random Database We refer to the MH approach to sampling over the space of all traces proposed in [13] as “random database” (RDB). A RDB sampler is a MH sampler where a single variable drawn in the course of a partic- ular interpretation of a probabilistic program is modi- fied via a standard MH proposal, and this modification is accepted by comparing the value of the joint distri- bution of old and new program traces. For complete- ness we review RDB here, noting a subtle correction to the acceptance ratio proposed in the original reference which is proper for a larger family of models. The RDB sampler employs a data structure that holds all random variables x associated with an execution trace, along with the parameters and log probability of each draw. Note that interpretation of a program is deterministic conditioned on x. A new proposal trace is initialized by picking a single variable xm,j from the |x| random draws, and resampling its value using a reversible kernel (x0m,j |xm,j). Starting from this initialization, the program is rerun to generate a new set of variables x0 that correspond to a new valid execution trace. In each instance where the random procedure type remains the same, we reuse the exist- ing value from the set x, rescoring its log probability conditioned on the preceding variables where neces- sary. When the random procedure type has changed, or a new random variable is encountered, its value is sampled in the usual manner. Finally, we compute the probability p(y|x0) by rescoring each observe as needed, and accept with probability min ✓ 1, p(y|x0)p(x0)q(x|x0) p(y|x)p(x)q(x0|x) ◆ . (3) In order to calculate the ratio of the proposal proba- bilities q(x0|x) and q(x|x0), we need to account for the variables that were resampled in the course of con- structing the proposal, as well as the fact that the sets x0 and x may have di↵erent cardinalities |x0| and |x|. We will use the (slightly inaccurate) notation x0\x to refer to the set of variables that were resampled, and let x0 \ x represent the set of variables common to both execution traces. The proposal probability is now given by q(x0|x) = (x 0 m,j |xm,j) |x| p(x0\x | x0 \ x) p(x0m,j |x0 \ x) . (4) In our implementation, the initialization x0m,j is sim- ply resampled conditioned on the preceding variables, such that (x0m,j |xm,j) = p(x0m,j |x0 \ x). The reverse proposal density can now be expressed in a similar fashion in terms of x\x0 and x \ x0 = x0 \ x, allowing the full acceptance probability to be written as p(y|x0) p(x0) |x| p(x\x0 | x \ x0) p(y|x) p(x) |x0| p(x0\x | x0 \ x) . (5) 4 Testing Programming probabilistic program interpreters is a non-trivial software development e↵ort, involving both the correct implementation of an interpreter and the correct implementation of a general purpose sampler. The methodology we employ to ensure correctness of both involves three levels of testing; 1) unit tests, 2) measure tests, and 3) conditional measure tests. 4.1 Unit and Measure Tests In the context of probabilistic programming, unit test- ing includes verifying that the interpreter correctly interprets a comprehensive set of small deterministic programs. Measure testing involves interpreting short revealing programs consisting of assume and predict statements (producing a sequence of ancestral, uncon- ditioned samples, i.e. no observe’s). Interpreter out- put is tested relative to ground truth, where ground truth is computed via exhaustive enumeration, ana- lytic derivation, or some combination, and always in a di↵erent, well-tested independent computational sys- tem like Matlab. Various comparisons of the empirical distribution constructed from the accumulating stream of output predicts’s and ground truth are computed; Kulback-Leibler (KL) divergences for discrete sample spaces and Kolmogorov Smirnov (KS) test statistics for continuous sample spaces. While it is possible to construct distribution equality hypothesis tests for some combinations of test statistic and program we generally are content to accept interpreters for which there is clear evidence of convergence towards zero of all test statistics for all measure tests. Anglican passed all unit and measure tests. 4.2 Conditional Measure Tests Measure tests involving conditioning provide addi- tional information beyond that provided by measure and unit tests. Conditioning involves endowing pro- grams with observe statements which constrain or weight the set of possible execution traces. Interpret- ing observe statements engages the full inference ma- chinery. Conditional measure test performance is mea- sured in the same way as measure test performance. LMH Proposal Single stochastic procedure (SP) output Number of SP’s in original trace Probability of new SP return value (sample) given trace prefix Probability of new part of proposed execution trace [Wingate, Stuhlmüller et al, 2011] Manuscript under review by AISTATS 2014 Note that there are more e cient PMCMC algorithms for probabilistic programming inference. In particular, there is no reason to fork unless an observe has just been interpreted. Alg. 1 is presented in this form for expositional purposes. 3.2 Random Database We refer to the MH approach to sampling over the space of all traces proposed in [13] as “random database” (RDB). A RDB sampler is a MH sampler where a single variable drawn in the course of a partic- ular interpretation of a probabilistic program is modi- fied via a standard MH proposal, and this modification is accepted by comparing the value of the joint distri- bution of old and new program traces. For complete- ness we review RDB here, noting a subtle correction to the acceptance ratio proposed in the original reference which is proper for a larger family of models. The RDB sampler employs a data structure that holds all random variables x associated with an execution trace, along with the parameters and log probability of each draw. Note that interpretation of a program is deterministic conditioned on x. A new proposal trace is initialized by picking a single variable xm,j from the |x| random draws, and resampling its value using a reversible kernel (x0m,j |xm,j). Starting from this initialization, the program is rerun to generate a new set of variables x0 that correspond to a new valid execution trace. In each instance where the random procedure type remains the same, we reuse the exist- ing value from the set x, rescoring its log probability conditioned on the preceding variables where neces- sary. When the random procedure type has changed, or a new random variable is encountered, its value is sampled in the usual manner. Finally, we compute the probability p(y|x0) by rescoring each observe as needed, and accept with probability min ✓ 1, p(y|x0)p(x0)q(x|x0) p(y|x)p(x)q(x0|x) ◆ . (3) In order to calculate the ratio of the proposal proba- bilities q(x0|x) and q(x|x0), we need to account for the variables that were resampled in the course of con- structing the proposal, as well as the fact that the sets x0 and x may have di↵erent cardinalities |x0| and |x|. We will use the (slightly inaccurate) notation x0\x to refer to the set of variables that were resampled, and let x0 \ x represent the set of variables common to both execution traces. The proposal probability is now given by q(x0|x) = (x 0 m,j |xm,j) |x| p(x0\x | x0 \ x) p(x0m,j |x0 \ x) . (4) In our implementation, the initialization x0m,j is sim- ply resampled conditioned on the preceding variables, such that (x0m,j |xm,j) = p(x0m,j |x0 \ x). The reverse proposal density can now be expressed in a similar fashion in terms of x\x0 and x \ x0 = x0 \ x, allowing the full acceptance probability to be written as p(y|x0) p(x0) |x| p(x\x0 | x \ x0) p(y|x) p(x) |x0| p(x0\x | x0 \ x) . (5) 4 Testing Programming probabilistic program interpreters is a non-trivial software development e↵ort, involving both the correct implementation of an interpreter and the correct implementation of a general purpose sampler. The methodology we employ to ensure correctness of both involves three levels of testing; 1) unit tests, 2) measure tests, and 3) conditional measure tests. 4.1 Unit and Measure Tests In the context of probabilistic programming, unit test- ing includes verifying that the interpreter correctly interprets a comprehensive set of small deterministic programs. Measure testing involves interpreting short revealing programs consisting of assume and predict statements (producing a sequence of ancestral, uncon- ditioned samples, i.e. no observe’s). Interpreter out- put is tested relative to ground truth, where ground truth is computed via exhaustive enumeration, ana- lytic derivation, or some combination, and always in a di↵erent, well-tested independent computational sys- tem like Matlab. Various comparisons of the empirical distribution constructed from the accumulating stream of output predicts’s and ground truth are computed; Kulback-Leibler (KL) divergences for discrete sample spaces and Kolmogorov Smirnov (KS) test statistics for continuous sample spaces. While it is possible to construct distribution equality hypothesis tests for some combinations of test statistic and program we generally are content to accept interpreters for which there is clear evidence of convergence towards zero of all test statistics for all measure tests. Anglican passed all unit and measure tests. 4.2 Conditional Measure Tests Measure tests involving conditioning provide addi- tional information beyond that provided by measure and unit tests. Conditioning involves endowing pro- grams with observe statements which constrain or weight the set of possible execution traces. Interpret- ing observe statements engages the full inference ma- chinery. Conditional measure test performance is mea- sured in the same way as measure test performance. LMH Implementation “Single site update” = sample from the prior = run program forward Simplified MH acceptance ratio Number of SP applications in original trace Probability of regenerating current trace continuation given proposal trace beginning Number of SP applications in new trace Probability of generating proposal trace continuation given current trace beginning 26 Manuscript under review by AISTATS 2014 Note that there are more e cient PMCMC algorithms for probabilistic programming inference. In particular, there is no reason to fork unless an observe has just been interpreted. Alg. 1 is presented in this form for expositional purposes. 3.2 Random Database We refer to the MH approach to sampling over the space of all traces proposed in [13] as “random database” (RDB). A RDB sampler is a MH sampler where a single variable drawn in the course of a partic- ular interpretation of a probabilistic program is modi- fied via a standard MH proposal, and this modification is accepted by compari g the value of the joint distri- bution f old and new program traces. For complete- ness we review RDB here, noting a subtle correction to the acceptance ratio proposed in the original reference which is proper for a larger family of models. The RDB sampler employs a data structure that holds all random variables x associated with an xecution trac , along with the parameters and log probability of each draw. Note that interpretation of a program i deterministic conditioned on x. A new proposal trace s initialized by picking a singl variable xm,j from the |x| rando draws, and resampling its value using a reversible kernel (x0m,j |xm,j). Starting from this initialization, the program is rerun to generate a new set of variables x0 that correspond to a new valid execution trace. In each instance where the random procedure type remains the same, we reuse the exist- ing value from the set x, rescoring its log probability conditioned on the preceding variables wher neces- sary. When the r nd m procedure type has changed, or a new random variable is encountered, its value is sampled in the usual manner. Finally, we compute the probability p(y|x0) by rescoring each observe as needed, and accept with probability min ✓ 1, p(y|x0)p(x0)q(x|x0) p(y|x)p(x)q(x0|x) ◆ . (3) In order to calculate the ratio of the proposal proba- bilities q(x0|x) and q(x|x0), we need to account for the variables that were resampled in the course of con- structing the proposal, as well as the fact that the sets x0 and x may have di↵erent cardinalities |x0| and |x|. We will use the (slightly inaccurate) notation x0\x to refer to the set of variables that were resampled, and let x0 \ x represent the set of variables common to both execution traces. The proposal probability is now given by q(x0|x) = (x 0 m,j |xm,j) |x| p(x0\x | x0 \ x) p(x0m,j |x0 \ x) . (4) In our implementation, the initialization x0m,j is sim- ply resampled conditioned on the preceding variables, such that (x0m,j |xm,j) = p(x0m,j |x0 \ x). The reverse proposal density can now be expressed in a similar fashion in terms of x\x0 and x \ x0 = x0 \ x, allowing the full acceptance probability to be written as p(y|x0) p(x0) |x| p(x\x0 | x \ x0) p(y|x) p(x) |x0| p(x0\x | x0 \ x) . (5) 4 Testing Programming probabilistic program interpreters is a no -trivi l software development e↵ rt, involving both the correct implementation of an interpreter and the correct implementation of a general purpose sampler. The methodology we em loy to ensure correct ess f both involves three levels of testing; 1) unit tests, 2) measure tests, and 3) conditional measure tests. 4.1 Unit and Measure Tests In the context of probabilistic programming, unit test- ing includes verifying that the interpreter correctly interprets a comprehensive set of small deterministic programs. Measure testing involves interpreting short revealing programs consisting of assume and predict statements (producing a sequence of ancestral, uncon- ditioned samples, i.e. no observe’s). Interpreter out- put is tested relative to ground truth, where ground truth is computed via exhaustive enumeration, ana- lytic derivation, or some combination, and always in a di↵erent, well-tested independent computational sys- tem like Matlab. Various compari ons of the empirical distribution constructed f m the accumulating st eam of output predicts’s and ground truth are computed; Kulback-Leibler (KL) divergences for discrete sample spaces and Kolmogorov Smirnov (KS) test statistics for continuous sample spaces. While it is possible to construct distribution equality hypothesis tests for some combinations of test statistic and program we generally are content to accept interpreters for which there is clea evidence of convergence towards zero of all test statistics f r all measure tests. Anglican pas ed all unit and measure tests. 4.2 Conditional Measure Tests Measure tests involving conditioning provide addi- tional information beyond that provided by measure and unit tests. Conditioning involves endowi g pro- grams with observe statements which constrain or weight the set of possible execution traces. Interpret- ing observe statements engages the full inference ma- chinery. Conditional measure test performance is mea- sured in the same way as measure test performance. Sequential Monte Carlo targets With a weighted set of particles Noting the identity We can use importance sampling to generate samples from Given a sample-‐based approximation to Introduction : Sequential Monte Carlo q(x1:n|y1:n) = f(xn|x1:n 1)p(x1:n 1|y1:n 1) x`n ⇠ f(xn|xa ` n 1 1:n 1) w˜`n = g(y1:n,x ` 1:n) w`n = w˜`nPL `=1 w˜ ` n p(x1:n|y1:n) q(x1:n|y1:n) = g(yn|x1:n)f(xn|x1:n 1)p(x1:n 1|y1:n 1) f(xn|x1:n 1)p(x1:n 1|y1:n 1) = g(yn|x1:n) p(x1:N |y1:N ) / p˜(y1:N ,x1:N ) ⌘ NY n=1 g(yn|x1:n)f(xn|x1:n 1) 5 A Compilation Target for Probabilistic Programming Languages In this manner, any model with a generative process that can be described in arbitrary C code can be represented in this sequential form in the space of program execution traces. Each observe statement takes as its argument ln g(yn|x1:n). Each quantity of interest in a predict statement corresponds to some deterministic function h(·) of all random choices x1:N made during the execution of the program. Given a set of S posterior samples {x(s)1:N}, we can approximate the posterior distribution of the predict value as h(x1:N ) ⇡ 1 S SX s=1 h(x(s)1:N ). (10) 3.2. Sequential Monte Carlo Forward simulation-based algorithms are a natural fit for probabilistic programs: run the program and report execu- tions that match the data. Sequential Monte Carlo (SMC, sequential importance resampling) forms the basic building block of other, more complex particle-based methods, and can its lf be used as a simple approach to probabilistic pro- gramming inference. SMC approximates a target density p(x1:N |y1:N ) as a weighted set of L realized trajectories x`1:N such that p(x1:N |y1:N ) ⇡ LX `=1 w`N x`1:N (x1:N ). (11) For most probabilistic programs of interest, it will be in- tractable to sample from p(x1:N |y1:N ) directly. Instead, noting that (for n > 1) we have the recursive identity p(x1:n|y1:n) (12) = p(x1:n 1|y1:n 1)g(yn|x1:n)f(xn|x1:n 1), we sample from p(x1:N |y1:N ) by iteratively sampling from each p(x1:n|y1:n), in turn, from 1 through N . At each n, we construct an importance sampling distribution by proposing from some distribution q(xn|x1:n 1, y1:n); in probabilistic programs we find it convenient to propose directly from the executions of the program, i.e. each se- quence of random variates xn is jointly sampled from the program execution state dynamics x`n ⇠ f(xn|xa ` n 1 1:n 1) (13) where a`n 1 is an “ancestor index,” the particle index 1, . . . , L of the parent (at time n 1) of x`n. The unnormal- ized particle importance weights at each observation yn are simply the observe data likelihood w˜`n = g(y1:n,x ` 1:n) (14) Algorithm 1 Parallel SMC program execution Assume: N observations, L particles launch L copies of the program (parallel) for n = 1 . . . N do wait until all L reach observe yn (barrier) update unnormalized weights w˜1:Ln (serial) if ESS < ⌧ then sample number of offspring O1:Ln (serial) set weight w˜1:Ln = 1 (serial) for ` = 1 . . . L do fork or exit (parallel) end for else set all number of offspring O`n = 1 (serial) end if continue program execution (parallel) end for wait until L program traces terminate (barrier) predict from L samples from pˆ(x1:L1:N |y1:N ) (serial) which can be normalized as w`n = w˜`nPL `=1 w˜ ` n . (15) After each step n, we now have a weighted set of execu- tion traces which approximate p(x1:n|y1:n). As the pro- gram continues, traces which do not correspond well with the data will have weights which become negligibly small, leading in the worst case to all weight concentrated on a single execution trace. To counteract this deficiency, we resample from our current set of L execution traces after each observation yn, according to their weights w`n. This is achieved by sampling a countO`n for the number of “off- spring” of a given execution trace ` to be included at time n + 1. Any sampling scheme must ensure E[O`n] = w`n. Sampling offspring counts O`n is equivalent to sampling ancestor indices a`n. Program execution traces with no off- spring are killed; program execution traces with more than one offspring are forked multiple times. After resampling, all weights w`n = 1. We only resample if the effective sample size ESS ⇡ 1P `(w ` n) 2 (16) is less than some threshold value ⌧ ; we choose ⌧ = L/2. In probabilistic C, each observe statement forms a bar- rier: parallel execution cannot continue until all particles have arrived at the observe and have reported their cur- rent unnormalized weight. As execution traces arrive at the observe barrier, they take the number of particles which have already reached the current observe as a (temporary) Algorithm 1 Particle Gibbs Prob. Prog. Inference L number of particles S number of sweeps {w˜(`)N ,x(`)1:N} Run SMC for s < S do {·,x⇤1:N} r(1, {1/L,x(`)1:N}) {·,x(`)0 = ;} initialize L 1 interpreters for d 2 ordered lin s of program do for ` < L 1 do x¯(`)1:(n 1) fork(x(`)1:(n 1)) end for if direc ve(d) == assume then for ` < L 1 do ¯(`)1:n inte pre (d, ¯(`)1:(n 1)) end for {x(`)1:n} {x¯(`)1:n} [ x⇤1:n else if directive(d) == predict then for ` < L 1 do interpret(d, x¯(`)1:(n 1)) end for interpret(d,x⇤1:(n 1)) else if directive(d) == observe then for ` < L 1 do {w¯(`), x¯(`)1:n} i te pre (d, x¯(`)1:(n 1)) end for T r(L 1, {w¯(`)n , x¯(`)1:n} [ {w˜⇤1:n,x⇤1:n}) {w˜(`)n ,x(`)1:n} T [ {w˜⇤n,x⇤1:n} end if end for end for p(x1:N |y1:N ) ⇡ LX `=1 w`N x`1:N ( 1:N ) p(x1:n|y1:n) = g(yn|x1:n)f(xn|x1:n 1)p(x1:n 1|y1:n 1) p(x1:n|y1:n) = LX `=1 w`n 1g(yn|x1:n)f(xn|x`1:n 1)p(x`1:n 1|y1:n 1) 4 Algorithm 1 Particle Gibbs Prob. Prog. Inference L number of particles S number of sweeps {w˜(`)N ,x(`)1:N} Run SMC for s < S do {·,x⇤1:N} r(1, {1/L,x(`)1:N}) {·,x(`)0 = ;} initialize L 1 interpreters for d 2 ordered lines of program do for ` < L 1 do x¯(`)1:(n 1) fork(x(`)1:(n 1)) end for if directive(d) == assume then for ` < L 1 do x¯(`)1:n interpret(d, x¯(`)1:(n 1)) end for {x(`)1:n} {x¯(`)1:n} [ x⇤1:n else if directive(d) == predict then for ` < L 1 do interpret(d, x¯(`)1:(n 1)) end for interpret(d,x⇤1:(n 1)) else if directive(d) == observe the for ` < L 1 do {w¯(`)n , x¯(`)1:n} interpret(d, x¯(`)1:(n 1)) end for T r(L 1, {w¯(`)n , x¯(`)1:n} [ {w˜⇤1:n,x⇤1:n}) {w˜(`)n ,x(`)1:n} T [ {w˜⇤n,x⇤1:n} end if end for end for p(x1:N |y1:N ) ⇡ LX `=1 w`N x`1:N (x1:N ) p(x1:n|y1:n) = g(yn|x1:n)f(xn|x1:n 1)p(x1:n 1|y1:n 1) p(x1:n|y1:n) = LX `=1 w`n 1g(yn|x1:n)f(xn|x`1:n 1)p(x`1:n 1|y1:n 1) 4 n = 1 n = 2 Iteratively, - simulate - weight - resample SMC Observe Pa rti cle Run program forward until next observeWeight of particle Is observation likelihood Proposal Ep(x1:n|y1:n)[h(x1:n)] ⇡ 1 L LX `=1 h(x`1:n)w ` n x ` 1:n ⇠ q(x1:n|y1:n), w`n = p(x`1:n|y1:n) q(x`1:n|y1:n) = p(x1:n 1|y1:n 1) ⇡ LX `=1 w`n 1 x`1:n 1(x1:n 1) p(x1:n|y1:n) = g(yn|x1:n)f(xn|x1:n 1)p(x1:n 1|y1:n 1) q(x1:n|y1:n) = f(xn|x1:n 1)p(x1:n 1|y1:n 1) p(x1:n|y1:n) ⇡ LX `=1 g(yn|x`1:n) x`1:n(x1:n), x`1:n = x`nx a`n 1 1:n 1 ⇠ f p(x1:n|y1:n) ⇡ LX `=1 g(yn|x`1:n)w`n 1 x`1:n(x1:n), x`1:n = x`nx a`n 1 1:n 1 ⇠ f p(y1:N |x1:N , ✓) = p(x0|✓) NY n=1 p(yn|xn, ✓)p(xn|xn 1, ✓)p(✓) min ✓ 1, p(y1:N |✓0)p(✓0)q(✓|✓0) p(y1:N |✓)p(✓)q(✓0|✓) ◆ min 1, Zˆ 0p(✓0)q(✓|✓0) Zˆp(✓)q(✓0|✓) ! Zˆ ⌘ p(y1:N |✓) ⇡ NY n=1 " 1 N LX `=1 w`n # p(x1) f(xn|x1:n 1) g(yn|x1:n) 6 SMC for Probabilistic Programming Fischer, Kiselyov, and Shan “Purely functional lazy non-deterministic programming” ACM Sigplan 2009 W., van de Meent, and Mansinghka “A New Approach to Probabilistic Programming Inference” AISTATS 2014 Paige and W. “A Compilation Target for Probabilistic Programming Languages” ICML 2014 Sequence of environments Parallel executions • Initialization (sample) • Forward simulation (sample) • Observation likelihood computation • pointwise evaluation up to normalization SMC Methods Only Require Ep(x1:n|y1:n)[h(x1:n)] ⇡ 1 L LX `=1 h(x`1:n)w ` n x ` 1:n ⇠ q(x1:n|y1:n), w`n = p(x`1:n|y1:n) q(x`1:n|y1:n) = p(x1:n 1|y1:n 1) ⇡ LX `=1 w`n 1 x`1:n 1(x1:n 1) p(x1:n|y1:n) = g(yn|x1:n)f(xn|x1:n 1)p(x1:n 1|y1:n 1) q(x1:n|y1:n) = f(xn|x1:n 1)p(x1:n 1|y1:n 1) p(x1:n|y1:n) ⇡ LX `=1 g(yn|x`1:n)w`n 1 x`1:n(x1:n), x`1:n = x`nx a`n 1 1:n 1 ⇠ f p(y1:N |x1:N , ✓) = p(x0|✓) NY n=1 p(yn|xn, ✓)p(xn|xn 1, ✓)p(✓) min ✓ 1, p(y1:N |✓0)p(✓0)q(✓|✓0) p(y1:N |✓)p(✓)q(✓0|✓) ◆ min ✓ 1, Z 0p(✓0)q(✓|✓0) Zp(✓)q(✓0|✓) ◆ Zˆ ⌘ p(y1:N |✓) ⇡ NY n=1 " 1 N LX `=1 w`n # p(x1) f(xn|x1:n 1) g(yn|x1:n) 6 Ep(x1:n|y1:n)[h(x1:n)] ⇡ 1 L LX `=1 h(x`1:n)w ` n x ` 1:n ⇠ q(x1:n|y1:n), w`n = p q(x`1:n|y1:n) = p(x1:n 1|y1:n 1) ⇡ LX `=1 w`n 1 x`1:n 1(x1:n 1) p(x1:n|y1:n) = g(yn|x1:n)f(xn|x1:n 1)p(x1:n 1|y1:n 1) q(x1:n|y1:n) = f(xn|x1:n 1)p(x1:n 1|y1:n 1) p(x1:n|y1:n) ⇡ LX `=1 g(yn|x`1:n)w`n 1 x`1:n(x1:n), x`1:n = x`nx a`n 1 1:n 1 ⇠ f p(y1:N |x1:N , ✓) = p(x0|✓) NY n=1 p(yn|xn, ✓)p(xn|xn 1, ✓)p(✓) min ✓ 1, p(y1:N |✓0) (✓0)q(✓|✓0) p(y1:N |✓)p(✓)q(✓0|✓) ◆ min ✓ 1, Z 0p(✓0)q(✓|✓0) Zp(✓)q(✓0|✓) ◆ Zˆ ⌘ p(y1:N |✓) ⇡ NY n=1 " 1 N LX `=1 w`n # p(x1) f(xn|x1:n 1) g(yn|x1:n) 6 Ep(x1:n|y1:n)[h(x1:n)] ⇡ 1 L LX `=1 h(x`1:n)w ` n x ` 1:n ⇠ q(x1:n|y1:n), w`n = p(x`1:n|y1:n) q(x`1:n|y1:n) = p(x 1|y1:n 1) ⇡ LX `=1 w`n 1 x`1:n 1(x :n 1) p(x1:n|y1:n) = g(yn|x1:n)f(xn|x1:n 1)p(x1:n 1|y1:n 1) q(x1:n|y1:n) = f(xn|x1:n 1)p(x1:n 1|y1:n 1) p(x1:n|y1:n) ⇡ LX `=1 g(yn|x`1:n)w`n 1 x`1:n(x1:n), x`1:n = x`nx a`n 1 1:n 1 ⇠ f p(y1:N |x1:N , ✓) = p(x0|✓) NY n=1 p(yn|xn, ✓)p(xn|xn 1, ✓)p(✓) min ✓ 1, p(y1:N |✓0)p(✓0)q(✓|✓0) p(y1:N |✓)p(✓)q(✓0|✓) ◆ min ✓ 1, Z 0p(✓0)q(✓|✓0) Zp(✓)q(✓0|✓) ◆ Zˆ ⌘ p(y1:N |✓) ⇡ NY n=1 " 1 N LX `=1 w`n # p(x1) f(xn|x1:n 1) g(yn|x1:n) 6 SMC for Probabilistic Programming Paige and W. “A Compilation Target for Probabilistic Programming Languages.” ICML, 2014Paige and W. “A Compilation Target for Probabilistic Programming Languages” ICML 2014 A Compilation Target for Probabilistic Programming Languages In this manner, any model with a generative process that can be described in arbitrary C code can be represented in this sequential form in the space of program execution traces. Each observe statement takes as its argument ln g(yn|x1:n). Each quantity of interest in a predict statement corresponds to some deterministic function h(·) of all random choices x1:N made during the execution of the program. Given a set of S posterior samples {x(s)1:N}, we can approximate the posterior distribution of the predict value as h(x1:N ) ⇡ 1 S SX s=1 h(x(s)1:N ). (10) 3.2. Sequential Monte Carlo Forward simulation-based algorithms are a natural fit for probabilistic programs: run the program and report execu- tions that match the data. Sequential Monte Carlo (SMC, sequential importance resampling) forms the basic building block of other, more complex particle-based methods, and can itself be used as a simple approach to probabilistic pro- gramming inference. SMC approximates a target density p(x1:N |y1:N ) as a weighted set of L realized trajectories x`1:N such that p(x1:N |y1:N ) ⇡ LX `=1 w`N x`1:N (x1:N ). (11) For most probabilistic programs of interest, it will be in- tractable to sample from p(x1:N |y1:N ) directly. Instead, noting that (for n > 1) we have the recursive identity p(x1:n|y1:n) (12) = p(x1:n 1|y1:n 1)g(yn|x1:n)f(xn|x1:n 1), we sample from p(x1:N |y1:N ) by iteratively sampling from each p(x1:n|y1:n), in turn, from 1 through N . At each n, we construct an importance sampling distribution by proposing from some distribution q(xn|x1:n 1, y1:n); in probabilistic programs we find it convenient to propose directly from the executions of the program, i.e. each se- quence of random variates xn is jointly sampled from the program execution state dynamics x`n ⇠ f(xn|xa ` n 1 1:n 1) (13) where a`n 1 is an “ancestor index,” the particle index 1, . . . , L of the parent (at time n 1) of x`n. The unnormal- ized particle importance weights at each observation yn are simply the observe data likelihood w˜`n = g(y1:n,x ` 1:n) (14) Algorithm 1 Parallel SMC program execution Assume: N observations, L particles launch L copies of the program (parallel) for n = 1 . . . N do wait until all L reach observe yn (barrier) update unnormalized weights w˜1:Ln (serial) if ESS < ⌧ then sample number of offspring O1:Ln (serial) set weight w˜1:Ln = 1 (serial) for ` = 1 . . . L do fork or exit (parallel) end for else set all number of offspring O`n = 1 (serial) end if continue program execution (parallel) end for wait until L program traces terminate (barrier) predict from L samples from pˆ(x1:L1:N |y1:N ) (serial) which can be ormalized s w`n = w˜`nPL `=1 w˜ ` n . (15) After each step n, we now have a weighted set of execu- tion traces which approximate p(x1:n|y1:n). As the pro- gram continues, traces which do not correspond well with the data will have weights which become negligibly small, leading in the worst case to all weight concentrated on a single execution trace. To counteract this deficiency, we resample from our current set of L execution traces after each observation yn, according to their weights w`n. This is achieved by sampling a countO`n for the number of “off- spring” of a given execution trace ` to be included at time n + 1. Any sampling scheme must ensure E[O`n] = w`n. Sampling offspring counts O`n is equivalent to sampling ancestor indices a`n. Program execution traces with no off- spring are killed; program execution traces with more than one offspring are forked multiple times. After resampling, all weights w`n = 1. We only resample if the effective sample size ESS ⇡ 1P `(w ` n) 2 (16) is less than some threshold value ⌧ ; we choose ⌧ = L/2. In probabilistic C, each observe statement forms a bar- rier: parallel execution cannot continue until all particles have arrived at the observe and have reported their cur- rent unnormalized weight. As execution traces arrive at the observe barrier, they take the number of particles which have already reached the current observe as a (temporary) Intuitively - run - wait - fork SMC for Probabilistic Programming Th re ad s observe delimiter continuations SMC Inner Loop n n n … n n n … n n n … • Sequential Monte Carlo is now a building block for other inference techniques • Particle MCMC - PIMH : “particle independent Metropolis- Hastings” - iCSMC : “iterated conditional SMC” -‐ s=1 s=2 s=3 [Andrieu, Doucet, Holenstein 2010] [W., van de Meent, Mansinghka 2014] Third Exercise Poisson • Learning objectives 1. Understand trace 2. Understand on which variables inference algorithms operate 3. Develop intuitions about how LMH and SMC inference variants work in probabilistic programming systems 34 Fourth Exercise Coordination • Learning objectives 1. Learn how to write non-trivial models including mutually recursive queries • Resources • http://forestdb.org/ • http://www.robots.ox.ac.uk/~fwood/anglican/examples/ index.html 35 Fourth Exercise Coordination • Learning objectives 1. Learn how to write non-trivial models including mutually recursive queries • Resources • http://forestdb.org/ • http://www.robots.ox.ac.uk/~fwood/anglican/examples/ index.html 36 Discrete RV’s Only 2000 1990 2010 Systems PL HANSAI IBAL Figaro ML STATS WinBUGS BUGS JAGS STAN LibBi Venture Anglican Church Probabilistic-C infer.NET webChurch Blog Factorie AI Prism Prolog KMP Bounded Recursion Problog Simula Probabilistic-ML,Haskell,Scheme,… Opportunities • Parallelism “Asynchronous Anytime Sequential Monte Carlo” [Paige, W., Doucet, Teh NIPS 2014] • Backwards passing “Particle Gibbs with Ancestor Sampling for Probabilistic Programs” [van de Meent, Yang, Mansinghka, W. AISTATS 2015] • Search “Maximum a Posteriori Estimation by Search in Probabilistic Models” [Tolpin, W., SOCS, 2015] • Adaptation “Output-Sensitive Adaptive Metropolis-Hastings for Probabilistic Programs” [Tolpin, van de Meent, Paige, W ; ECML, 2015] • Novel proposals “Neural Adaptive Inference for Probabilistic Programming” [Paige, W.; in submission] Bubble Up Inference Probabilistic Programming Language Models Applications Probabilistic Programming System