Java程序辅导

C C++ Java Python Processing编程在线培训 程序编写 软件开发 视频讲解

客服在线QQ:2653320439 微信:ittutor Email:itutor@qq.com
wx: cjtutor
QQ: 2653320439
Probabilistic Programming Practical 
Frank Wood, Brooks Paige 
{fwood,brooks}@robots.ox.ac.uk 
MLSS 2015
Setup
Java
Installation
Mac and Windows:
Download and run the installer
from https://www.java.com/
en/download/manual.jsp
Linux:
# Debian/Ubuntu
sudo apt-get install
default-jre,!
# Fedora
sudo yum install
java-1.7.0-openjdk,!
Java (> v. 1.5)
Leiningen
Installation
# Download lien to ˜/bin
mkdir ˜/bin
cd ˜/bin
wget http://git.io/XyijMQ
# Make executable
chmod a+x ˜/bin/lein
# Add ˜/bin to path
echo ’export PATH="$HOME/bin:$PATH"’ >> ˜/.bashrc
# Run lein
lein
Further details: http://leiningen.org/
Leiningen (v. > 2.0)
Practical Materials
• https://bitbucket.org/probprog/mlss2015/get/
master.zip 
• cd mlss2015 
• lein gorilla 
• open the url 
5
Schedule
• 15:35 - 16:05 Intro/Hello World! 
• 16:05 - 16:30 Gaussian  (you code) 
• 16:30 - 16:40 Discuss / intro to physics problem 
• 16:40 - 16:55 Physics  (you code) 
• 16:55 - 17:00 Share / discuss solutions 
• 17:00 - 17:20 Inference explanation 
• 17:20 - 17:45 Poisson (you code) 
• 17:45 - 17:50 Inference Q/A 
• 17:50 - 18:05 Coordination (you code)
What is probabilistic 
programming?
An Emerging Field
ML: 
Algorithms &
Applications
STATS: 
Inference &
Theory
PL: 
Compilers,
Semantics,
Analysis
Probabilistic
Programming
Conceptualization
Parameters
Program
Output
CS
Parameters
Program
Observations
Probabilistic Programming Statistics
y
p(y|x)p(x)
p(x|
Operative Definition
“Probabilistic programs are usual functional or 
imperative programs with two added constructs:  
(1) the ability to draw values at random from 
distributions, and  
(2) the ability to condition values of variables in a 
program via observations.”   
Gordon et al, 2014
What are the goals of 
probabilistic 
programming?
Simplify Machine Learning…
Start
Identify And Formalize 
Problem, Gather Data 
Design Model = 
Read Papers, Do Math
Existing Model 
Sufficient?
Choose Approximate 
Inference Algorithm
Derive Updates And 
Code Inference 
Algorithm
Exists And
Can Use?
Performs Well 
Statistically?
End
Performs Well 
Computationally?
Search For Useable 
Implementation
Test
Scale
Deploy
Simple Model?
Implement Using High 
Level Modeling Tool
Tool
Supports Required
Features?
N
N
N
N
N N
Y
Y Y
Y
Y
Y
LEGEND Color indicates the skills that are 
required to traverse the edge.
PhD-level machine learning or statistics
PhD-level machine learning or computer 
science
Non-specialist
Feasible?
Y
N
To This
Start
Identify And Formalize 
Problem, Gather Data 
Design Model = 
Write Probabilistic 
Program
Existing Model 
Sufficient?
Derive Updates And 
Code Inference 
Algorithm
Exists And
Can Use?
Performs Well?
End
Performs Well 
Computationally?
Search For Useable 
Implementation
Debug, Test, Profile
Scale
Deploy
N Y
LEGEND Color indicates the skills that are 
required to traverse the edge.
Non-specialist
Choose Approximate 
Inference Algorithm
Simple Model?
Implement Using High 
Level Modeling Tool
Tool
Supports Required
Features?
Feasible?
Automate Inference
Programming Language Representation / Abstraction Layer
Inference Engine(s)
Models / Stochastic Simulators
CARON ET AL.
This lack of consistency is shared by other models based on the Po´lya urn construction (Zhu
et al., 2005; Ahmed and Xing, 2008; Blei and Frazier, 2011). Blei and Frazier (2011) provide a
detailed discussion on this issue and describe cases where one should or should not bother about it.
It is possible to define a slightly modified version of our model that is consistent under marginal-
isation, at the expense of an additional set of latent variables. This is described in Appendix C.
3.2 Stationary Models for Cluster Locations
To ensure we obtain a first-order stationary Pitman-Yor process mixture model, we also need to
satisfy (B). This can be easily achieved if for k 2 I(mtt)
Uk,t ⇠
⇢
p (·|Uk,t1) if k 2 I(mtt1)
H otherwise
where H is the invariant distribution of the Markov transition kernel p (·|·). In the time series
literature, many approaches are available to build such transition kernels based on copulas (Joe,
1997) or Gibbs sampling techniques (Pitt and Walker, 2005).
Combining the stationary Pitman-Yor and cluster locations models, we can summarize the full
model by the following Bayesian network in Figure 1. It can also be summarized using a Chinese
restaurant metaphor (see Figure 2).
Figure 1: A representation of the time-varying Pitman-Yor process mixture as a directed graphi-
cal model, representing conditional independencies between variables. All assignment
variables and observations at time t are denoted ct and zt, respectively.
3.3 Properties of the Models
Under the uniform deletion model, the number At =
P
im
t
i,t1 of alive allocation variables at time
t can be written as
At =
t1X
j=1
nX
k=1
Xj,k
8
   c0 Hr
   m
   c1 ⇡m 
Hy ✓m
  r0
  s0
  r1
  s1
  y1
 r2
  s2
  y2
rT
 sT
 yT
 r3
  s3
  y31
Gaussian Mixture Model
¼
µc
y
i
k
k
i
N
K
K
α
Gπ
θc
y
i
k
k o
i
N
K
K
α
Gπ
θc
y
i
k
k o
i
N
1
1
Figure : From left to right: graphical models for a finite Gaussian mixture model
(GMM), a Bayesian GMM, and an infinite GMM
ci |~⇡ ⇠ Discrete(~⇡)
~yi |ci = k ;⇥ ⇠ Gaussian(·|✓k).
~⇡|↵ ⇠ Dirichlet(·| ↵
K
, . . . ,
↵
K
)
⇥ ⇠ G0
Wood (University of Oxford) Unsupervised Machine Learning January, 2014 16 / 19
Latent Dirichlet Allocation
↵ wdiz
d
i k 
d = 1 . . . D
i = 1 . . . Nd.
✓d
k = 1 . . .K
Figure 1. Graphical model for LDA model
Lecture LDA
LDA is a hierarchical model used to model text documents. Each document is modeled as
a mixture of topics. Each topic is defined as a distribution over the words in the vocabulary.
Here, we will denote by K the number of topics in the model. We use D to indicate the
number of documents, M to denote the number of words in the vocabulary, and Nd. to
denote the number of words in document d. We will assume that the words have been
translated to the set of integers {1, . . . ,M} through the use of a static dictionary. This is
for convenience only and the integer mapping will contain no semantic information. The
generative model for the D documents can be thought of as sequentially drawing a topic
mixture ✓d for each document independently from a DirK(↵~1) distribution, where DirK(~)
is a Dirichlet distribution over the K-dimensional simplex with parameters [1,2, . . . ,K ].
Each of K topics {k}Kk=1 are drawn independently from DirM (~1). Then, for each of the
i = 1 . . . Nd. words in document d, an assignment variable zdi is drawn from Mult(✓
d).
Conditional on the assignment variable zdi , word i in document d, denoted as w
d
i , is drawn
independently from Mult(zdi ). The graphical model for the process can be seen in Figure 1.
The model is parameterized by the vector valued parameters {✓d}Dd=1, and {k}Kk=1, the
parameters {Zdi }d=1,...,D,i=1,...,Nd. , and the scalar positive parameters ↵ and . The model
is formally written as:
✓d ⇠ DirK(↵~1)
k ⇠ DirM (~1)
zdi ⇠ Mult(✓d)
wdi ⇠ Mult(zdi )
1
✓d ⇠ DirK (↵~1)
k ⇠ DirM(~1)
zdi ⇠ Discrete(✓d)
wdi ⇠ Discrete(zdi )
Wood (University of Oxford) Unsupervised Machine Learning January, 2014 15 / 19
Hello World!
15
First Exercise
Gaussian Unknown Mean 
• Learning objectives 
1. Clojure 
2. Gorilla REPL 
3. Anglican 
4. Automatic inference over generative models 
expressed as programs via query 
• Resources 
• https://clojuredocs.org/ 
• https://bitbucket.org/probprog/anglican/ 
• http://www.robots.ox.ac.uk/~fwood/anglican/index.html
16
Simulation
Second Exercise
Learning objectives 
1. Develop experience thinking about expressing problems 
as inference over program executions 
2. Understand how to perform inference over a complex 
deterministic generative process, here a 2D-physics 
simulator 
18
Second Exercise
Use inference to solve a 
mechanism design 
optimization task: 
• get all balls safely in bin
19
Inference
Trace Probability
• observe data points  
• internal random choices  
• simulate from  
by running the program 
forward 
• weight traces by observes
y1 y2
✓
x1 x2
x11 x12 x13 x21 x22
{ {
etc
p(y1:N ,x1:N ) =
NY
n=1
g(yn|x1:n)f(xn|x1:n1)
y1 y2
x1 x2 x3
y3
f(xn|x1:n1)
g(yn|x1:n)
xn
yn
Trace
x1,1 = 3
x1,2 = 0
x1,2 = 1
x1,2 = 2
(let [x-1-1 3
      x-1-2 (sample (discrete (range x-1-1)))] 
  (if (not= x-1-2 1) 
    (let [x-2-1 (+ x-1-2 7)]
      (sample (poisson x-2-1)))))
x2,1 = 7
x2,1 = 9
x2,2 = 0
x2,2 = 1
. . .
Observe
x1,1 = 3
x1,2 = 0
x1,2 = 1
x1,2 = 2
(let [x-1-1 3
      x-1-2 (sample (discrete (range x-1-1)))] 
  (if (not= x-1-2 1) 
    (let [x-2-1 (+ x-1-2 7)]
       (sample (poisson x-2-1))))
 (observe (gaussian x-2-1 0.0001) 7)))
x2,1 = 7
x2,1 = 9
x2,2 = 0
x2,2 = 1
. . .
“Single Site” MCMC = LMH
Posterior distribution of execution traces is proportional to trace score with 
observed values plugged in
Metropolis-Hastings acceptance rule  
Lightweight Implementations of Probabilistic Programming Languages Via Transformational Compilation [Wingate, Stuhlmüller et al, 2011]
24
▪ Need
▪ Proposal
▪ Have
▪ Likelihoods (via observe statement restrictions)
▪ Prior (sequence of ERP returns; scored in interpreter)
min
✓
1,
p(y|x0)p(x0)q(x|x0)
p(y|x)p(x)q(x0|x)
◆
[assume poi-1 (beta 7 4)]
[assume poi-2 (+ 1 (poisson 8))]
[assume generative-model-part-1 (lambda (a b)
(if (flip a)
(+ (poisson b) 7)
(normal a b)))]
[assume generative-model-part-2 (gamma poi-1 poi-2)]
[observe (normal (generative-model-part-1 poi-1 poi-2) 1) 18]
[observe (normal 0 generative-model-part-2) 6]
[predict (list poi-1 poi-2 )]
x1,1 ⇠ Beta(7, 4)
x1,2|x1,1 = 0.4 ⇠ Poisson(8)
x1,3|x1,2 = 6, x1,1 = 0.4 ⇠ Binomial(0.4)
x1,4|x1,3 = true, x1,2 = 6, x1,1 = 0.4 ⇠ Poisson(4)
y1 = 18|x1,4 = 7, x1,3 = true, x1,2 = 6, x1,1 = 0.4 ⇠ Normal(7, 1)
x2,1| . . . ⇠ Gamma(0.4, 6)
y2 = 6|x2,1 = 6.92, . . . ⇠ Normal(0, 6.92)
p(x|y) / p˜(y = observes,x)
2
Manuscript under review by AISTATS 2014
Note that there are more ecient PMCMC algorithms
for probabilistic programming inference. In particular,
there is no reason to fork unless an observe has just
been interpreted. Alg. 1 is presented in this form for
expositional purposes.
3.2 Random Database
We refer to the MH approach to sampling over
the space of all traces proposed in [13] as “random
database” (RDB). A RDB sampler is a MH sampler
where a single variable drawn in the course of a partic-
ular interpretation of a probabilistic program is modi-
fied via a standard MH proposal, and this modification
is accepted by comparing the value of the joint distri-
bution of old and new program traces. For complete-
ness we review RDB here, noting a subtle correction to
the acceptance ratio proposed in the original reference
which is proper for a larger family of models.
The RDB sampler employs a data structure that holds
all random variables x associated with an execution
trace, along with the parameters and log probability
of each draw. Note that interpretation of a program
is deterministic conditioned on x. A new proposal
trace is initialized by picking a single variable xm,j
from the |x| random draws, and resampling its value
using a reversible kernel (x0m,j |xm,j). Starting from
this initialization, the program is rerun to generate a
new set of variables x0 that correspond to a new valid
execution trace. In each instance where the random
procedure type remains the same, we reuse the exist-
ing value from the set x, rescoring its log probability
conditioned on the preceding variables where neces-
sary. When the random procedure type has changed,
or a new random variable is encountered, its value is
sampled in the usual manner. Finally, we compute
the probability p(y|x0) by rescoring each observe as
needed, and accept with probability
min
✓
1,
p(y|x0)p(x0)q(x|x0)
p(y|x)p(x)q(x0|x)
◆
. (3)
In order to calculate the ratio of the proposal proba-
bilities q(x0|x) and q(x|x0), we need to account for the
variables that were resampled in the course of con-
structing the proposal, as well as the fact that the sets
x0 and x may have di↵erent cardinalities |x0| and |x|.
We will use the (slightly inaccurate) notation x0\x to
refer to the set of variables that were resampled, and
let x0 \ x represent the set of variables common to
both execution traces. The proposal probability is now
given by
q(x0|x) = (x
0
m,j |xm,j)
|x|
p(x0\x | x0 \ x)
p(x0m,j |x0 \ x)
. (4)
In our implementation, the initialization x0m,j is sim-
ply resampled conditioned on the preceding variables,
such that (x0m,j |xm,j) = p(x0m,j |x0 \ x). The reverse
proposal density can now be expressed in a similar
fashion in terms of x\x0 and x \ x0 = x0 \ x, allowing
the full acceptance probability to be written as
p(y|x0) p(x0) |x| p(x\x0 | x \ x0)
p(y|x) p(x) |x0| p(x0\x | x0 \ x) . (5)
4 Testing
Programming probabilistic program interpreters is a
non-trivial software development e↵ort, involving both
the correct implementation of an interpreter and the
correct implementation of a general purpose sampler.
The methodology we employ to ensure correctness of
both involves three levels of testing; 1) unit tests, 2)
measure tests, and 3) conditional measure tests.
4.1 Unit and Measure Tests
In the context of probabilistic programming, unit test-
ing includes verifying that the interpreter correctly
interprets a comprehensive set of small deterministic
programs. Measure testing involves interpreting short
revealing programs consisting of assume and predict
statements (producing a sequence of ancestral, uncon-
ditioned samples, i.e. no observe’s). Interpreter out-
put is tested relative to ground truth, where ground
truth is computed via exhaustive enumeration, ana-
lytic derivation, or some combination, and always in
a di↵erent, well-tested independent computational sys-
tem like Matlab. Various comparisons of the empirical
distribution constructed from the accumulating stream
of output predicts’s and ground truth are computed;
Kulback-Leibler (KL) divergences for discrete sample
spaces and Kolmogorov Smirnov (KS) test statistics
for continuous sample spaces. While it is possible
to construct distribution equality hypothesis tests for
some combinations of test statistic and program we
generally are content to accept interpreters for which
there is clear evidence of convergence towards zero of
all test statistics for all measure tests. Anglican passed
all unit and measure tests.
4.2 Conditional Measure Tests
Measure tests involving conditioning provide addi-
tional information beyond that provided by measure
and unit tests. Conditioning involves endowing pro-
grams with observe statements which constrain or
weight the set of possible execution traces. Interpret-
ing observe statements engages the full inference ma-
chinery. Conditional measure test performance is mea-
sured in the same way as measure test performance.
LMH Proposal 
Single stochastic  
procedure (SP) output
Number of SP’s in  
original trace
Probability of new SP return value 
(sample) given trace prefix
Probability of new part of 
proposed execution trace
[Wingate, Stuhlmüller et al, 2011]
Manuscript under review by AISTATS 2014
Note that there are more ecient PMCMC algorithms
for probabilistic programming inference. In particular,
there is no reason to fork unless an observe has just
been interpreted. Alg. 1 is presented in this form for
expositional purposes.
3.2 Random Database
We refer to the MH approach to sampling over
the space of all traces proposed in [13] as “random
database” (RDB). A RDB sampler is a MH sampler
where a single variable drawn in the course of a partic-
ular interpretation of a probabilistic program is modi-
fied via a standard MH proposal, and this modification
is accepted by comparing the value of the joint distri-
bution of old and new program traces. For complete-
ness we review RDB here, noting a subtle correction to
the acceptance ratio proposed in the original reference
which is proper for a larger family of models.
The RDB sampler employs a data structure that holds
all random variables x associated with an execution
trace, along with the parameters and log probability
of each draw. Note that interpretation of a program
is deterministic conditioned on x. A new proposal
trace is initialized by picking a single variable xm,j
from the |x| random draws, and resampling its value
using a reversible kernel (x0m,j |xm,j). Starting from
this initialization, the program is rerun to generate a
new set of variables x0 that correspond to a new valid
execution trace. In each instance where the random
procedure type remains the same, we reuse the exist-
ing value from the set x, rescoring its log probability
conditioned on the preceding variables where neces-
sary. When the random procedure type has changed,
or a new random variable is encountered, its value is
sampled in the usual manner. Finally, we compute
the probability p(y|x0) by rescoring each observe as
needed, and accept with probability
min
✓
1,
p(y|x0)p(x0)q(x|x0)
p(y|x)p(x)q(x0|x)
◆
. (3)
In order to calculate the ratio of the proposal proba-
bilities q(x0|x) and q(x|x0), we need to account for the
variables that were resampled in the course of con-
structing the proposal, as well as the fact that the sets
x0 and x may have di↵erent cardinalities |x0| and |x|.
We will use the (slightly inaccurate) notation x0\x to
refer to the set of variables that were resampled, and
let x0 \ x represent the set of variables common to
both execution traces. The proposal probability is now
given by
q(x0|x) = (x
0
m,j |xm,j)
|x|
p(x0\x | x0 \ x)
p(x0m,j |x0 \ x)
. (4)
In our implementation, the initialization x0m,j is sim-
ply resampled conditioned on the preceding variables,
such that (x0m,j |xm,j) = p(x0m,j |x0 \ x). The reverse
proposal density can now be expressed in a similar
fashion in terms of x\x0 and x \ x0 = x0 \ x, allowing
the full acceptance probability to be written as
p(y|x0) p(x0) |x| p(x\x0 | x \ x0)
p(y|x) p(x) |x0| p(x0\x | x0 \ x) . (5)
4 Testing
Programming probabilistic program interpreters is a
non-trivial software development e↵ort, involving both
the correct implementation of an interpreter and the
correct implementation of a general purpose sampler.
The methodology we employ to ensure correctness of
both involves three levels of testing; 1) unit tests, 2)
measure tests, and 3) conditional measure tests.
4.1 Unit and Measure Tests
In the context of probabilistic programming, unit test-
ing includes verifying that the interpreter correctly
interprets a comprehensive set of small deterministic
programs. Measure testing involves interpreting short
revealing programs consisting of assume and predict
statements (producing a sequence of ancestral, uncon-
ditioned samples, i.e. no observe’s). Interpreter out-
put is tested relative to ground truth, where ground
truth is computed via exhaustive enumeration, ana-
lytic derivation, or some combination, and always in
a di↵erent, well-tested independent computational sys-
tem like Matlab. Various comparisons of the empirical
distribution constructed from the accumulating stream
of output predicts’s and ground truth are computed;
Kulback-Leibler (KL) divergences for discrete sample
spaces and Kolmogorov Smirnov (KS) test statistics
for continuous sample spaces. While it is possible
to construct distribution equality hypothesis tests for
some combinations of test statistic and program we
generally are content to accept interpreters for which
there is clear evidence of convergence towards zero of
all test statistics for all measure tests. Anglican passed
all unit and measure tests.
4.2 Conditional Measure Tests
Measure tests involving conditioning provide addi-
tional information beyond that provided by measure
and unit tests. Conditioning involves endowing pro-
grams with observe statements which constrain or
weight the set of possible execution traces. Interpret-
ing observe statements engages the full inference ma-
chinery. Conditional measure test performance is mea-
sured in the same way as measure test performance.
LMH Implementation
“Single site update” = sample from the prior = run program forward
Simplified MH acceptance ratio
Number of SP applications 
in original trace
Probability of regenerating current trace  
continuation given proposal trace beginning
Number of SP applications 
in new trace
Probability of generating proposal trace  
continuation given current trace beginning
26
Manuscript under review by AISTATS 2014
Note that there are more ecient PMCMC algorithms
for probabilistic programming inference. In particular,
there is no reason to fork unless an observe has just
been interpreted. Alg. 1 is presented in this form for
expositional purposes.
3.2 Random Database
We refer to the MH approach to sampling over
the space of all traces proposed in [13] as “random
database” (RDB). A RDB sampler is a MH sampler
where a single variable drawn in the course of a partic-
ular interpretation of a probabilistic program is modi-
fied via a standard MH proposal, and this modification
is accepted by compari g the value of the joint distri-
bution f old and new program traces. For complete-
ness we review RDB here, noting a subtle correction to
the acceptance ratio proposed in the original reference
which is proper for a larger family of models.
The RDB sampler employs a data structure that holds
all random variables x associated with an xecution
trac , along with the parameters and log probability
of each draw. Note that interpretation of a program
i deterministic conditioned on x. A new proposal
trace s initialized by picking a singl variable xm,j
from the |x| rando draws, and resampling its value
using a reversible kernel (x0m,j |xm,j). Starting from
this initialization, the program is rerun to generate a
new set of variables x0 that correspond to a new valid
execution trace. In each instance where the random
procedure type remains the same, we reuse the exist-
ing value from the set x, rescoring its log probability
conditioned on the preceding variables wher neces-
sary. When the r nd m procedure type has changed,
or a new random variable is encountered, its value is
sampled in the usual manner. Finally, we compute
the probability p(y|x0) by rescoring each observe as
needed, and accept with probability
min
✓
1,
p(y|x0)p(x0)q(x|x0)
p(y|x)p(x)q(x0|x)
◆
. (3)
In order to calculate the ratio of the proposal proba-
bilities q(x0|x) and q(x|x0), we need to account for the
variables that were resampled in the course of con-
structing the proposal, as well as the fact that the sets
x0 and x may have di↵erent cardinalities |x0| and |x|.
We will use the (slightly inaccurate) notation x0\x to
refer to the set of variables that were resampled, and
let x0 \ x represent the set of variables common to
both execution traces. The proposal probability is now
given by
q(x0|x) = (x
0
m,j |xm,j)
|x|
p(x0\x | x0 \ x)
p(x0m,j |x0 \ x)
. (4)
In our implementation, the initialization x0m,j is sim-
ply resampled conditioned on the preceding variables,
such that (x0m,j |xm,j) = p(x0m,j |x0 \ x). The reverse
proposal density can now be expressed in a similar
fashion in terms of x\x0 and x \ x0 = x0 \ x, allowing
the full acceptance probability to be written as
p(y|x0) p(x0) |x| p(x\x0 | x \ x0)
p(y|x) p(x) |x0| p(x0\x | x0 \ x) . (5)
4 Testing
Programming probabilistic program interpreters is a
no -trivi l software development e↵ rt, involving both
the correct implementation of an interpreter and the
correct implementation of a general purpose sampler.
The methodology we em loy to ensure correct ess f
both involves three levels of testing; 1) unit tests, 2)
measure tests, and 3) conditional measure tests.
4.1 Unit and Measure Tests
In the context of probabilistic programming, unit test-
ing includes verifying that the interpreter correctly
interprets a comprehensive set of small deterministic
programs. Measure testing involves interpreting short
revealing programs consisting of assume and predict
statements (producing a sequence of ancestral, uncon-
ditioned samples, i.e. no observe’s). Interpreter out-
put is tested relative to ground truth, where ground
truth is computed via exhaustive enumeration, ana-
lytic derivation, or some combination, and always in
a di↵erent, well-tested independent computational sys-
tem like Matlab. Various compari ons of the empirical
distribution constructed f m the accumulating st eam
of output predicts’s and ground truth are computed;
Kulback-Leibler (KL) divergences for discrete sample
spaces and Kolmogorov Smirnov (KS) test statistics
for continuous sample spaces. While it is possible
to construct distribution equality hypothesis tests for
some combinations of test statistic and program we
generally are content to accept interpreters for which
there is clea evidence of convergence towards zero of
all test statistics f r all measure tests. Anglican pas ed
all unit and measure tests.
4.2 Conditional Measure Tests
Measure tests involving conditioning provide addi-
tional information beyond that provided by measure
and unit tests. Conditioning involves endowi g pro-
grams with observe statements which constrain or
weight the set of possible execution traces. Interpret-
ing observe statements engages the full inference ma-
chinery. Conditional measure test performance is mea-
sured in the same way as measure test performance.
Sequential	
  Monte	
  Carlo	
  targets	
  
With	
  a	
  weighted	
  set	
  of	
  particles	
  
Noting	
  the	
  identity	
  
We	
  can	
  use	
  importance	
  sampling	
  to	
  generate	
  samples	
  from	
  	
  
Given	
  a	
  sample-­‐based	
  approximation	
  to	
  	
  
	
  	
  
Introduction : Sequential Monte Carlo
q(x1:n|y1:n) = f(xn|x1:n1)p(x1:n1|y1:n1)
x`n ⇠ f(xn|xa
`
n1
1:n1)
w˜`n = g(y1:n,x
`
1:n)
w`n =
w˜`nPL
`=1 w˜
`
n
p(x1:n|y1:n)
q(x1:n|y1:n) =
g(yn|x1:n)f(xn|x1:n1)p(x1:n1|y1:n1)
f(xn|x1:n1)p(x1:n1|y1:n1) = g(yn|x1:n)
p(x1:N |y1:N ) / p˜(y1:N ,x1:N ) ⌘
NY
n=1
g(yn|x1:n)f(xn|x1:n1)
5
A Compilation Target for Probabilistic Programming Languages
In this manner, any model with a generative process that
can be described in arbitrary C code can be represented
in this sequential form in the space of program execution
traces.
Each observe statement takes as its argument
ln g(yn|x1:n). Each quantity of interest in a predict
statement corresponds to some deterministic function h(·)
of all random choices x1:N made during the execution of
the program. Given a set of S posterior samples {x(s)1:N},
we can approximate the posterior distribution of the
predict value as
h(x1:N ) ⇡ 1
S
SX
s=1
h(x(s)1:N ). (10)
3.2. Sequential Monte Carlo
Forward simulation-based algorithms are a natural fit for
probabilistic programs: run the program and report execu-
tions that match the data. Sequential Monte Carlo (SMC,
sequential importance resampling) forms the basic building
block of other, more complex particle-based methods, and
can its lf be used as a simple approach to probabilistic pro-
gramming inference. SMC approximates a target density
p(x1:N |y1:N ) as a weighted set of L realized trajectories
x`1:N such that
p(x1:N |y1:N ) ⇡
LX
`=1
w`Nx`1:N (x1:N ). (11)
For most probabilistic programs of interest, it will be in-
tractable to sample from p(x1:N |y1:N ) directly. Instead,
noting that (for n > 1) we have the recursive identity
p(x1:n|y1:n) (12)
= p(x1:n1|y1:n1)g(yn|x1:n)f(xn|x1:n1),
we sample from p(x1:N |y1:N ) by iteratively sampling from
each p(x1:n|y1:n), in turn, from 1 through N . At each
n, we construct an importance sampling distribution by
proposing from some distribution q(xn|x1:n1, y1:n); in
probabilistic programs we find it convenient to propose
directly from the executions of the program, i.e. each se-
quence of random variates xn is jointly sampled from the
program execution state dynamics
x`n ⇠ f(xn|xa
`
n1
1:n1) (13)
where a`n1 is an “ancestor index,” the particle index
1, . . . , L of the parent (at time n1) of x`n. The unnormal-
ized particle importance weights at each observation yn are
simply the observe data likelihood
w˜`n = g(y1:n,x
`
1:n) (14)
Algorithm 1 Parallel SMC program execution
Assume: N observations, L particles
launch L copies of the program (parallel)
for n = 1 . . . N do
wait until all L reach observe yn (barrier)
update unnormalized weights w˜1:Ln (serial)
if ESS < ⌧ then
sample number of offspring O1:Ln (serial)
set weight w˜1:Ln = 1 (serial)
for ` = 1 . . . L do
fork or exit (parallel)
end for
else
set all number of offspring O`n = 1 (serial)
end if
continue program execution (parallel)
end for
wait until L program traces terminate (barrier)
predict from L samples from pˆ(x1:L1:N |y1:N ) (serial)
which can be normalized as
w`n =
w˜`nPL
`=1 w˜
`
n
. (15)
After each step n, we now have a weighted set of execu-
tion traces which approximate p(x1:n|y1:n). As the pro-
gram continues, traces which do not correspond well with
the data will have weights which become negligibly small,
leading in the worst case to all weight concentrated on a
single execution trace. To counteract this deficiency, we
resample from our current set of L execution traces after
each observation yn, according to their weights w`n. This
is achieved by sampling a countO`n for the number of “off-
spring” of a given execution trace ` to be included at time
n + 1. Any sampling scheme must ensure E[O`n] = w`n.
Sampling offspring counts O`n is equivalent to sampling
ancestor indices a`n. Program execution traces with no off-
spring are killed; program execution traces with more than
one offspring are forked multiple times. After resampling,
all weights w`n = 1.
We only resample if the effective sample size
ESS ⇡ 1P
`(w
`
n)
2
(16)
is less than some threshold value ⌧ ; we choose ⌧ = L/2.
In probabilistic C, each observe statement forms a bar-
rier: parallel execution cannot continue until all particles
have arrived at the observe and have reported their cur-
rent unnormalized weight. As execution traces arrive at the
observe barrier, they take the number of particles which
have already reached the current observe as a (temporary)
Algorithm 1 Particle Gibbs Prob. Prog. Inference
L number of particles
S  number of sweeps
{w˜(`)N ,x(`)1:N} Run SMC
for s < S do
{·,x⇤1:N} r(1, {1/L,x(`)1:N})
{·,x(`)0 = ;} initialize L 1 interpreters
for d 2 ordered lin s of program do
for ` < L 1 do
x¯(`)1:(n1)  fork(x(`)1:(n1))
end for
if direc ve(d) == assume then
for ` < L 1 do
¯(`)1:n  inte pre (d, ¯(`)1:(n1))
end for
{x(`)1:n} {x¯(`)1:n} [ x⇤1:n
else if directive(d) == predict then
for ` < L 1 do
interpret(d, x¯(`)1:(n1))
end for
interpret(d,x⇤1:(n1))
else if directive(d) == observe then
for ` < L 1 do
{w¯(`), x¯(`)1:n} i te pre (d, x¯(`)1:(n1))
end for
T  r(L 1, {w¯(`)n , x¯(`)1:n} [ {w˜⇤1:n,x⇤1:n})
{w˜(`)n ,x(`)1:n} T [ {w˜⇤n,x⇤1:n}
end if
end for
end for
p(x1:N |y1:N ) ⇡
LX
`=1
w`Nx`1:N ( 1:N )
p(x1:n|y1:n) = g(yn|x1:n)f(xn|x1:n1)p(x1:n1|y1:n1)
p(x1:n|y1:n) =
LX
`=1
w`n1g(yn|x1:n)f(xn|x`1:n1)p(x`1:n1|y1:n1)
4
Algorithm 1 Particle Gibbs Prob. Prog. Inference
L number of particles
S  number of sweeps
{w˜(`)N ,x(`)1:N} Run SMC
for s < S do
{·,x⇤1:N} r(1, {1/L,x(`)1:N})
{·,x(`)0 = ;} initialize L 1 interpreters
for d 2 ordered lines of program do
for ` < L 1 do
x¯(`)1:(n1)  fork(x(`)1:(n1))
end for
if directive(d) == assume then
for ` < L 1 do
x¯(`)1:n  interpret(d, x¯(`)1:(n1))
end for
{x(`)1:n} {x¯(`)1:n} [ x⇤1:n
else if directive(d) == predict then
for ` < L 1 do
interpret(d, x¯(`)1:(n1))
end for
interpret(d,x⇤1:(n1))
else if directive(d) == observe the
for ` < L 1 do
{w¯(`)n , x¯(`)1:n} interpret(d, x¯(`)1:(n1))
end for
T  r(L 1, {w¯(`)n , x¯(`)1:n} [ {w˜⇤1:n,x⇤1:n})
{w˜(`)n ,x(`)1:n} T [ {w˜⇤n,x⇤1:n}
end if
end for
end for
p(x1:N |y1:N ) ⇡
LX
`=1
w`Nx`1:N (x1:N )
p(x1:n|y1:n) = g(yn|x1:n)f(xn|x1:n1)p(x1:n1|y1:n1)
p(x1:n|y1:n) =
LX
`=1
w`n1g(yn|x1:n)f(xn|x`1:n1)p(x`1:n1|y1:n1)
4
n = 1 n = 2
Iteratively, 

- simulate 

- weight 

- resample
SMC
Observe
Pa
rti
cle
 
Run program forward  
until next observeWeight of particle 
Is observation likelihood
Proposal
Ep(x1:n|y1:n)[h(x1:n)] ⇡
1
L
LX
`=1
h(x`1:n)w
`
n x
`
1:n ⇠ q(x1:n|y1:n), w`n =
p(x`1:n|y1:n)
q(x`1:n|y1:n)
=
p(x1:n1|y1:n1) ⇡
LX
`=1
w`n1x`1:n1(x1:n1)
p(x1:n|y1:n) = g(yn|x1:n)f(xn|x1:n1)p(x1:n1|y1:n1)
q(x1:n|y1:n) = f(xn|x1:n1)p(x1:n1|y1:n1)
p(x1:n|y1:n) ⇡
LX
`=1
g(yn|x`1:n)x`1:n(x1:n), x`1:n = x`nx
a`n1
1:n1 ⇠ f
p(x1:n|y1:n) ⇡
LX
`=1
g(yn|x`1:n)w`n1x`1:n(x1:n), x`1:n = x`nx
a`n1
1:n1 ⇠ f
p(y1:N |x1:N , ✓) = p(x0|✓)
NY
n=1
p(yn|xn, ✓)p(xn|xn1, ✓)p(✓)
min
✓
1,
p(y1:N |✓0)p(✓0)q(✓|✓0)
p(y1:N |✓)p(✓)q(✓0|✓)
◆
min
 
1,
Zˆ 0p(✓0)q(✓|✓0)
Zˆp(✓)q(✓0|✓)
!
Zˆ ⌘ p(y1:N |✓) ⇡
NY
n=1
"
1
N
LX
`=1
w`n
#
p(x1)
f(xn|x1:n1)
g(yn|x1:n)
6
SMC for Probabilistic Programming
Fischer, Kiselyov, and Shan “Purely functional lazy non-deterministic programming” ACM Sigplan 2009 
W., van de Meent, and Mansinghka “A New Approach to Probabilistic Programming Inference” AISTATS 2014 
Paige and W. “A Compilation Target for Probabilistic Programming Languages” ICML 2014 
Sequence of environments
Parallel executions
• Initialization (sample) 
• Forward simulation (sample) 
• Observation likelihood computation  
• pointwise evaluation up to normalization
SMC Methods Only Require 
Ep(x1:n|y1:n)[h(x1:n)] ⇡
1
L
LX
`=1
h(x`1:n)w
`
n x
`
1:n ⇠ q(x1:n|y1:n), w`n =
p(x`1:n|y1:n)
q(x`1:n|y1:n)
=
p(x1:n1|y1:n1) ⇡
LX
`=1
w`n1x`1:n1(x1:n1)
p(x1:n|y1:n) = g(yn|x1:n)f(xn|x1:n1)p(x1:n1|y1:n1)
q(x1:n|y1:n) = f(xn|x1:n1)p(x1:n1|y1:n1)
p(x1:n|y1:n) ⇡
LX
`=1
g(yn|x`1:n)w`n1x`1:n(x1:n), x`1:n = x`nx
a`n1
1:n1 ⇠ f
p(y1:N |x1:N , ✓) = p(x0|✓)
NY
n=1
p(yn|xn, ✓)p(xn|xn1, ✓)p(✓)
min
✓
1,
p(y1:N |✓0)p(✓0)q(✓|✓0)
p(y1:N |✓)p(✓)q(✓0|✓)
◆
min
✓
1,
Z 0p(✓0)q(✓|✓0)
Zp(✓)q(✓0|✓)
◆
Zˆ ⌘ p(y1:N |✓) ⇡
NY
n=1
"
1
N
LX
`=1
w`n
#
p(x1)
f(xn|x1:n1)
g(yn|x1:n)
6
Ep(x1:n|y1:n)[h(x1:n)] ⇡
1
L
LX
`=1
h(x`1:n)w
`
n x
`
1:n ⇠ q(x1:n|y1:n), w`n =
p
q(x`1:n|y1:n)
=
p(x1:n1|y1:n1) ⇡
LX
`=1
w`n1x`1:n1(x1:n1)
p(x1:n|y1:n) = g(yn|x1:n)f(xn|x1:n1)p(x1:n1|y1:n1)
q(x1:n|y1:n) = f(xn|x1:n1)p(x1:n1|y1:n1)
p(x1:n|y1:n) ⇡
LX
`=1
g(yn|x`1:n)w`n1x`1:n(x1:n), x`1:n = x`nx
a`n1
1:n1 ⇠ f
p(y1:N |x1:N , ✓) = p(x0|✓)
NY
n=1
p(yn|xn, ✓)p(xn|xn1, ✓)p(✓)
min
✓
1,
p(y1:N |✓0) (✓0)q(✓|✓0)
p(y1:N |✓)p(✓)q(✓0|✓)
◆
min
✓
1,
Z 0p(✓0)q(✓|✓0)
Zp(✓)q(✓0|✓)
◆
Zˆ ⌘ p(y1:N |✓) ⇡
NY
n=1
"
1
N
LX
`=1
w`n
#
p(x1)
f(xn|x1:n1)
g(yn|x1:n)
6
Ep(x1:n|y1:n)[h(x1:n)] ⇡
1
L
LX
`=1
h(x`1:n)w
`
n x
`
1:n ⇠ q(x1:n|y1:n), w`n =
p(x`1:n|y1:n)
q(x`1:n|y1:n)
=
p(x 1|y1:n1) ⇡
LX
`=1
w`n1x`1:n1(x :n1)
p(x1:n|y1:n) = g(yn|x1:n)f(xn|x1:n1)p(x1:n1|y1:n1)
q(x1:n|y1:n) = f(xn|x1:n1)p(x1:n1|y1:n1)
p(x1:n|y1:n) ⇡
LX
`=1
g(yn|x`1:n)w`n1x`1:n(x1:n), x`1:n = x`nx
a`n1
1:n1 ⇠ f
p(y1:N |x1:N , ✓) = p(x0|✓)
NY
n=1
p(yn|xn, ✓)p(xn|xn1, ✓)p(✓)
min
✓
1,
p(y1:N |✓0)p(✓0)q(✓|✓0)
p(y1:N |✓)p(✓)q(✓0|✓)
◆
min
✓
1,
Z 0p(✓0)q(✓|✓0)
Zp(✓)q(✓0|✓)
◆
Zˆ ⌘ p(y1:N |✓) ⇡
NY
n=1
"
1
N
LX
`=1
w`n
#
p(x1)
f(xn|x1:n1)
g(yn|x1:n)
6
SMC for Probabilistic Programming
Paige and W. “A Compilation Target for Probabilistic Programming Languages.”  ICML, 2014Paige and W. “A Compilation Target for Probabilistic Programming Languages” ICML 2014 
A Compilation Target for Probabilistic Programming Languages
In this manner, any model with a generative process that
can be described in arbitrary C code can be represented
in this sequential form in the space of program execution
traces.
Each observe statement takes as its argument
ln g(yn|x1:n). Each quantity of interest in a predict
statement corresponds to some deterministic function h(·)
of all random choices x1:N made during the execution of
the program. Given a set of S posterior samples {x(s)1:N},
we can approximate the posterior distribution of the
predict value as
h(x1:N ) ⇡ 1
S
SX
s=1
h(x(s)1:N ). (10)
3.2. Sequential Monte Carlo
Forward simulation-based algorithms are a natural fit for
probabilistic programs: run the program and report execu-
tions that match the data. Sequential Monte Carlo (SMC,
sequential importance resampling) forms the basic building
block of other, more complex particle-based methods, and
can itself be used as a simple approach to probabilistic pro-
gramming inference. SMC approximates a target density
p(x1:N |y1:N ) as a weighted set of L realized trajectories
x`1:N such that
p(x1:N |y1:N ) ⇡
LX
`=1
w`Nx`1:N (x1:N ). (11)
For most probabilistic programs of interest, it will be in-
tractable to sample from p(x1:N |y1:N ) directly. Instead,
noting that (for n > 1) we have the recursive identity
p(x1:n|y1:n) (12)
= p(x1:n1|y1:n1)g(yn|x1:n)f(xn|x1:n1),
we sample from p(x1:N |y1:N ) by iteratively sampling from
each p(x1:n|y1:n), in turn, from 1 through N . At each
n, we construct an importance sampling distribution by
proposing from some distribution q(xn|x1:n1, y1:n); in
probabilistic programs we find it convenient to propose
directly from the executions of the program, i.e. each se-
quence of random variates xn is jointly sampled from the
program execution state dynamics
x`n ⇠ f(xn|xa
`
n1
1:n1) (13)
where a`n1 is an “ancestor index,” the particle index
1, . . . , L of the parent (at time n1) of x`n. The unnormal-
ized particle importance weights at each observation yn are
simply the observe data likelihood
w˜`n = g(y1:n,x
`
1:n) (14)
Algorithm 1 Parallel SMC program execution
Assume: N observations, L particles
launch L copies of the program (parallel)
for n = 1 . . . N do
wait until all L reach observe yn (barrier)
update unnormalized weights w˜1:Ln (serial)
if ESS < ⌧ then
sample number of offspring O1:Ln (serial)
set weight w˜1:Ln = 1 (serial)
for ` = 1 . . . L do
fork or exit (parallel)
end for
else
set all number of offspring O`n = 1 (serial)
end if
continue program execution (parallel)
end for
wait until L program traces terminate (barrier)
predict from L samples from pˆ(x1:L1:N |y1:N ) (serial)
which can be ormalized s
w`n =
w˜`nPL
`=1 w˜
`
n
. (15)
After each step n, we now have a weighted set of execu-
tion traces which approximate p(x1:n|y1:n). As the pro-
gram continues, traces which do not correspond well with
the data will have weights which become negligibly small,
leading in the worst case to all weight concentrated on a
single execution trace. To counteract this deficiency, we
resample from our current set of L execution traces after
each observation yn, according to their weights w`n. This
is achieved by sampling a countO`n for the number of “off-
spring” of a given execution trace ` to be included at time
n + 1. Any sampling scheme must ensure E[O`n] = w`n.
Sampling offspring counts O`n is equivalent to sampling
ancestor indices a`n. Program execution traces with no off-
spring are killed; program execution traces with more than
one offspring are forked multiple times. After resampling,
all weights w`n = 1.
We only resample if the effective sample size
ESS ⇡ 1P
`(w
`
n)
2
(16)
is less than some threshold value ⌧ ; we choose ⌧ = L/2.
In probabilistic C, each observe statement forms a bar-
rier: parallel execution cannot continue until all particles
have arrived at the observe and have reported their cur-
rent unnormalized weight. As execution traces arrive at the
observe barrier, they take the number of particles which
have already reached the current observe as a (temporary)
Intuitively

- run

- wait 

- fork
SMC for Probabilistic Programming
Th
re
ad
s
observe delimiter
continuations
SMC Inner Loop
n n n 
…
n n n 
…
n n n 
…
• Sequential Monte Carlo is 
now a building block for 
other inference 
techniques 
• Particle MCMC 
- PIMH : “particle 
independent Metropolis-
Hastings” 
- iCSMC : “iterated 
conditional SMC”
-­‐ 	
  	
  
s=1
s=2
s=3
[Andrieu, Doucet, Holenstein 2010] 
[W., van de Meent, Mansinghka 2014]
Third Exercise
Poisson 
• Learning objectives 
1. Understand trace 
2. Understand on which variables inference algorithms 
operate 
3. Develop intuitions about how LMH and SMC 
inference variants work in probabilistic programming 
systems
34
Fourth Exercise
Coordination 
• Learning objectives 
1. Learn how to write non-trivial models including 
mutually recursive queries 
• Resources 
• http://forestdb.org/ 
• http://www.robots.ox.ac.uk/~fwood/anglican/examples/
index.html
35
Fourth Exercise
Coordination 
• Learning objectives 
1. Learn how to write non-trivial models including 
mutually recursive queries 
• Resources 
• http://forestdb.org/ 
• http://www.robots.ox.ac.uk/~fwood/anglican/examples/
index.html
36
Discrete RV’s  
Only
2000
1990
2010
Systems
PL
HANSAI
IBAL
Figaro
ML STATS
WinBUGS
BUGS
JAGS
STAN
LibBi
Venture Anglican
Church
Probabilistic-C
infer.NET
webChurch
Blog
Factorie
AI
Prism
Prolog
KMP
Bounded 
Recursion
Problog
Simula
Probabilistic-ML,Haskell,Scheme,…
Opportunities
• Parallelism  
“Asynchronous Anytime Sequential Monte Carlo” [Paige, W., Doucet, Teh NIPS 2014] 
• Backwards passing  
“Particle Gibbs with Ancestor Sampling for Probabilistic Programs” [van de Meent, Yang, Mansinghka, W. 
AISTATS 2015] 
• Search  
“Maximum a Posteriori Estimation by Search in Probabilistic Models” [Tolpin, W., SOCS, 2015] 
• Adaptation  
“Output-Sensitive Adaptive Metropolis-Hastings for Probabilistic Programs” [Tolpin, van de Meent, Paige, 
W ; ECML, 2015] 
• Novel proposals 
“Neural Adaptive Inference for Probabilistic Programming” [Paige, W.; in submission]
Bubble Up
Inference
Probabilistic Programming Language
Models
Applications
Probabilistic Programming System