Miscellaneous Ops¶
The pyro.ops
module implements tensor utilities
that are mostly independent of the rest of Pyro.
Utilities for HMC¶
-
class
DualAveraging
(prox_center=0, t0=10, kappa=0.75, gamma=0.05)[source]¶ Bases:
object
Dual Averaging is a scheme to solve convex optimization problems. It belongs to a class of subgradient methods which uses subgradients to update parameters (in primal space) of a model. Under some conditions, the averages of generated parameters during the scheme are guaranteed to converge to an optimal value. However, a counter-intuitive aspect of traditional subgradient methods is “new subgradients enter the model with decreasing weights” (see \([1]\)). Dual Averaging scheme solves that phenomenon by updating parameters using weights equally for subgradients (which lie in a dual space), hence we have the name “dual averaging”.
This class implements a dual averaging scheme which is adapted for Markov chain Monte Carlo (MCMC) algorithms. To be more precise, we will replace subgradients by some statistics calculated during an MCMC trajectory. In addition, introducing some free parameters such as
t0
andkappa
is helpful and still guarantees the convergence of the scheme.References
[1] Primal-dual subgradient methods for convex problems, Yurii Nesterov
[2] The No-U-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo, Matthew D. Hoffman, Andrew Gelman
Parameters: - prox_center (float) – A “prox-center” parameter introduced in \([1]\) which pulls the primal sequence towards it.
- t0 (float) – A free parameter introduced in \([2]\) that stabilizes the initial steps of the scheme.
- kappa (float) – A free parameter introduced in \([2]\)
that controls the weights of steps of the scheme.
For a small
kappa
, the scheme will quickly forget states from early steps. This should be a number in \((0.5, 1]\). - gamma (float) – A free parameter which controls the speed of the convergence of the scheme.
-
velocity_verlet
(z, r, potential_fn, inverse_mass_matrix, step_size, num_steps=1, z_grads=None)[source]¶ Second order symplectic integrator that uses the velocity verlet algorithm.
Parameters: - z (dict) – dictionary of sample site names and their current values
(type
Tensor
). - r (dict) – dictionary of sample site names and corresponding momenta
(type
Tensor
). - potential_fn (callable) – function that returns potential energy given z
for each sample site. The negative gradient of the function with respect
to
z
determines the rate of change of the corresponding sites’ momentar
. - inverse_mass_matrix (torch.Tensor) – a tensor \(M^{-1}\) which is used to calculate kinetic energy: \(E_{kinetic} = \frac{1}{2}z^T M^{-1} z\). Here \(M\) can be a 1D tensor (diagonal matrix) or a 2D tensor (dense matrix).
- step_size (float) – step size for each time step iteration.
- num_steps (int) – number of discrete time steps over which to integrate.
- z_grads (torch.Tensor) – optional gradients of potential energy at current
z
.
Return tuple (z_next, r_next, z_grads, potential_energy): next position and momenta, together with the potential energy and its gradient w.r.t.
z_next
.- z (dict) – dictionary of sample site names and their current values
(type
Newton Optimizers¶
-
newton_step
(loss, x, trust_radius=None)[source]¶ Performs a Newton update step to minimize loss on a batch of variables, optionally constraining to a trust region [1].
This is especially usful because the final solution of newton iteration is differentiable wrt the inputs, even when all but the final
x
is detached, due to this method’s quadratic convergence [2].loss
must be twice-differentiable as a function ofx
. Ifloss
is2+d
-times differentiable, then the return value of this function isd
-times differentiable.When
loss
is interpreted as a negative log probability density, then the return valuesmode,cov
of this function can be used to construct a Laplace approximationMultivariateNormal(mode,cov)
.Warning
Take care to detach the result of this function when used in an optimization loop. If you forget to detach the result of this function during optimization, then backprop will propagate through the entire iteration process, and worse will compute two extra derivatives for each step.
Example use inside a loop:
x = torch.zeros(1000, 2) # arbitrary initial value for step in range(100): x = x.detach() # block gradients through previous steps x.requires_grad = True # ensure loss is differentiable wrt x loss = my_loss_function(x) x = newton_step(loss, x, trust_radius=1.0) # the final x is still differentiable
- [1] Yuan, Ya-xiang. Iciam. Vol. 99. 2000.
- “A review of trust region algorithms for optimization.” ftp://ftp.cc.ac.cn/pub/yyx/papers/p995.pdf
- [2] Christianson, Bruce. Optimization Methods and Software 3.4 (1994)
- “Reverse accumulation and attractive fixed points.” http://uhra.herts.ac.uk/bitstream/handle/2299/4338/903839.pdf
Parameters: - loss (torch.Tensor) – A scalar function of
x
to be minimized. - x (torch.Tensor) – A dependent variable of shape
(N, D)
whereN
is the batch size andD
is a small number. - trust_radius (float) – An optional trust region trust_radius. The
updated value
mode
of this function will be withintrust_radius
of the inputx
.
Returns: A pair
(mode, cov)
wheremode
is an updated tensor of the same shape as the original valuex
, andcov
is an esitmate of the covariance DxD matrix withcov.shape == x.shape[:-1] + (D,D)
.Return type:
-
newton_step_1d
(loss, x, trust_radius=None)[source]¶ Performs a Newton update step to minimize loss on a batch of 1-dimensional variables, optionally regularizing to constrain to a trust region.
See
newton_step()
for details.Parameters: - loss (torch.Tensor) – A scalar function of
x
to be minimized. - x (torch.Tensor) – A dependent variable with rightmost size of 1.
- trust_radius (float) – An optional trust region trust_radius. The
updated value
mode
of this function will be withintrust_radius
of the inputx
.
Returns: A pair
(mode, cov)
wheremode
is an updated tensor of the same shape as the original valuex
, andcov
is an esitmate of the covariance 1x1 matrix withcov.shape == x.shape[:-1] + (1,1)
.Return type: - loss (torch.Tensor) – A scalar function of
-
newton_step_2d
(loss, x, trust_radius=None)[source]¶ Performs a Newton update step to minimize loss on a batch of 2-dimensional variables, optionally regularizing to constrain to a trust region.
See
newton_step()
for details.Parameters: - loss (torch.Tensor) – A scalar function of
x
to be minimized. - x (torch.Tensor) – A dependent variable with rightmost size of 2.
- trust_radius (float) – An optional trust region trust_radius. The
updated value
mode
of this function will be withintrust_radius
of the inputx
.
Returns: A pair
(mode, cov)
wheremode
is an updated tensor of the same shape as the original valuex
, andcov
is an esitmate of the covariance 2x2 matrix withcov.shape == x.shape[:-1] + (2,2)
.Return type: - loss (torch.Tensor) – A scalar function of
-
newton_step_3d
(loss, x, trust_radius=None)[source]¶ Performs a Newton update step to minimize loss on a batch of 3-dimensional variables, optionally regularizing to constrain to a trust region.
See
newton_step()
for details.Parameters: - loss (torch.Tensor) – A scalar function of
x
to be minimized. - x (torch.Tensor) – A dependent variable with rightmost size of 2.
- trust_radius (float) – An optional trust region trust_radius. The
updated value
mode
of this function will be withintrust_radius
of the inputx
.
Returns: A pair
(mode, cov)
wheremode
is an updated tensor of the same shape as the original valuex
, andcov
is an esitmate of the covariance 3x3 matrix withcov.shape == x.shape[:-1] + (3,3)
.Return type: - loss (torch.Tensor) – A scalar function of
Tensor Contraction¶
-
contract_expression
(equation, *shapes, **kwargs)[source]¶ Wrapper around
opt_einsum.contract_expression()
that optionally uses Pyro’s cheap optimizer and optionally caches contraction paths.Parameters: cache_path (bool) – whether to cache the contraction path. Defaults to True.
-
contract
(equation, *operands, **kwargs)[source]¶ Wrapper around
opt_einsum.contract()
that optionally uses Pyro’s cheap optimizer and optionally caches contraction paths.Parameters: cache_path (bool) – whether to cache the contraction path. Defaults to True.
-
ubersum
(equation, *operands, **kwargs)[source]¶ Generalized batched sum-product algorithm via tensor message passing.
This generalizes
contract()
in two ways:- Multiple outputs are allowed, and intermediate results can be shared.
- Inputs and outputs can be batched along symbols given in
batch_dims
; reductions alongbatch_dims
are product reductions.
The best way to understand this function is to try the examples below, which show how
ubersum()
calls can be implemented as multiple calls tocontract()
(which is generally more expensive).To illustrate multiple outputs, note that the following are equivalent:
z1, z2, z3 = ubersum('ab,bc->a,b,c', x, y) # multiple outputs backend = 'pyro.ops.einsum.torch_log' z1 = contract('ab,bc->a', x, y, backend=backend) z2 = contract('ab,bc->b', x, y, backend=backend) z3 = contract('ab,bc->c', x, y, backend=backend)
To illustrate batched inputs, note that the following are equivalent:
assert len(x) == 3 and len(y) == 3 z = ubersum('ab,ai,bi->b', w, x, y, batch_dims='i') z = contract('ab,a,a,a,b,b,b->b', w, *x, *y, backend=backend)
When a sum dimension a always appears with a batch dimension i, then a corresponds to a distinct symbol for each slice of a. Thus the following are equivalent:
assert len(x) == 3 and len(y) == 3 z = ubersum('ai,ai->', x, y, batch_dims='i') z = contract('a,b,c,a,b,c->', *x, *y, backend=backend)
When such a sum dimension appears in the output, it must be accompanied by all of its batch dimensions, e.g. the following are equivalent:
assert len(x) == 3 and len(y) == 3 z = ubersum('abi,abi->bi', x, y, batch_dims='i') z0 = contract('ab,ac,ad,ab,ac,ad->b', *x, *y, backend=backend) z1 = contract('ab,ac,ad,ab,ac,ad->c', *x, *y, backend=backend) z2 = contract('ab,ac,ad,ab,ac,ad->d', *x, *y, backend=backend) z = torch.stack([z0, z1, z2])
Note that each batch slice through the output is multilinear in all batch slices through all inptus, thus e.g. batch matrix multiply would be implemented without
batch_dims
, so the following are all equivalent:xy = ubersum('abc,acd->abd', x, y, batch_dims='') xy = torch.stack([xa.mm(ya) for xa, ya in zip(x, y)]) xy = torch.bmm(x, y)
Among all valid equations, some computations are polynomial in the sizes of the input tensors and other computations are exponential in the sizes of the input tensors. This function raises
NotImplementedError
whenever the computation is exponential.Parameters: - equation (str) – An einsum equation, optionally with multiple outputs.
- operands (torch.Tensor) – A collection of tensors.
- batch_dims (str) – An optional string of batch dims.
- cache (dict) – An optional
shared_intermediates()
cache. - modulo_total (bool) – Optionally allow ubersum to arbitrarily scale each result batch, which can significantly reduce computation. This is safe to set whenever each result batch denotes a nonnormalized probability distribution whose total is not of interest.
Returns: a tuple of tensors of requested shape, one entry per output.
Return type: Raises: - ValueError – if tensor sizes mismatch or an output requests a batched dim without that dim’s batch dims.
- NotImplementedError – if contraction would have cost exponential in the size of any input tensor.