Implementing new Bayesian density estimators

The following page provides a tutorial on how to implement new Bayesian density estimators compatible with the BayesDensity.jl-package. Prior to reading this tutorial, one should have already familiarized oneself with the general package API, for instance by reading the General API documentation.

In order to be able to follow this tutorial, it is advantageous to have some prior exposure to data-augmentation schemes, Gibbs sampling and mean-field variational inference. A good introduction to all three topics can be found in Bishop (2006).

Note

The focus of the following tutorial is to present how one can implement new Bayesian models in a BayesDensity-compatible way. As a result, the implementation presented here is by no means optimal in terms of computational efficiency or numerical stability for this particular example.

Bayesian inference for Bernstein densities.

For our tutorial, we will illustrate by focusing on a Bayesian Bernstein-type density estimator for data supported on the unit interval.^[1] This section provides the theoretical background for the model we will later implement as an example, and can be skipped by readers who are more interested in the details of the implementation itself.

Given a positive integer $K$, we say that $f$ is a Bernstein density if we can write

\[f(x) = \sum_{k=1}^K \theta_k\, \varphi_k(x), \quad x\in [0, 1],\]

where $\varphi_k(\cdot)$ is the density of the $\mathrm{Beta}(k, K-k+1)$-distribution and $\boldsymbol{\theta} \in \{\boldsymbol{\vartheta}\in [0,1]^K \colon \sum_k \vartheta_k = 1 \}$. The corresponding cumulative distribution function $F$ is then

\[F(x) = \sum_{k=1}^K \theta_k\, \int_{0}^x\varphi_k(t)\, \text{d}t, \quad x\in [0, 1],\]

owing to the linearity of the integral.

Our main motivation for considering the Bernstein model is that under mild regularity conditions on the true density $f_0$, the Bernstein density $f$ can approximate $f_0$ to an arbitrary degree of precision with respect to a suitable metric, such as the total variation distance, provided $K$ is sufficiently large.

For a Bayesian treatment of the Bernstein density model, we impose a $\mathrm{Dirichlet}(\boldsymbol{a})$-prior distribution on $\boldsymbol{\theta}$, where $\boldsymbol{a} = (a,a, \ldots, a)$ for some $a>0$. Given an observed independent and identically distributed sample $\boldsymbol{x} = (x_1, x_2, \ldots, x_n)$, the likelihood of the observed sample under the Bernstein model for $f$ is

\[p(\boldsymbol{x}\,|\, \boldsymbol{\theta}) = \prod_{i=1}^n \sum_{k=1}^K \theta_k\, \varphi_k(x_i).\]

The form taken by the likelihood function above makes Bayesian inference challenging due to the fact that the resulting posterior distribution is analytically intractable. However, by augmenting the data with latent variables $\boldsymbol{z} \in \{1,2,\ldots, K\}^n$, it is possible to perform posterior inference very efficiently through Gibbs sampling or mean-field VI. To this end, we note that an equivalent formulation of the Bernstein density model is

\[\begin{align*} x_i\,|\, \{z_i = k\} &\sim \varphi_k(x_i), &i = 1,\ldots, n,\\ z_i\,|\, \boldsymbol{\theta} &\sim \text{Multinomial}(1, \boldsymbol{\theta}), &i = 1, \ldots, n,\\ \boldsymbol{\theta} &\sim \text{Dirichlet}_K(\boldsymbol{a}). \end{align*}\]

When $\boldsymbol{z}$ is marginalized out, we recover the Bernstein density likelihood, so the two model formulations are equivalent. However, as it turns out that Bayesian inference based on the augmented posterior $p(\boldsymbol{\theta}, \boldsymbol{z}\,|\, \boldsymbol{x})$ is much simpler than working with the marginal posterior of $\boldsymbol{\theta}$, we work with the augmented formulation instead.

Under this data augmentation strategy it can then be shown that joint posterior of $\boldsymbol{\theta}, \boldsymbol{z}$ is

\[p(\boldsymbol{\theta}, \boldsymbol{z}\, |\, \boldsymbol{x}) \propto \prod_{k = 1}^K \theta_k^{N_k + a - 1} \prod_{i=1}^n \prod_{k=1}^K \varphi_k(x_i)^{\mathbf{1}_{\{k\}}(z_i)},\]

where $\mathbf{1}_{\{k\}}(\cdot)$ is the indicator function and $N_k = \sum_{i=1}^n \mathbf{1}_{\{k\}}(z_i)$.

Gibbs sampling

To write down a Gibbs sampler for this model, we need to derive the full conditional distributions of $\boldsymbol{\theta}$ and $\boldsymbol{z}$. In this case, direct inspection of the joint posterior shows that

\[p(\boldsymbol{\theta}\, |\, \boldsymbol{z}, \boldsymbol{x}) = \mathrm{Dirichlet}(\boldsymbol{a} + \boldsymbol{N}),\]

where $\boldsymbol{N} = (N_1, N_2, \ldots, N_K)$. The full conditional distributions for $\boldsymbol{z}$ are

\[p(\boldsymbol{z}\, |\, \boldsymbol{\theta}, \boldsymbol{x}) \propto \prod_{i=1}^n \prod_{k=1}^K \big\{\theta_k\,\varphi_k(x_i)\big\}^{\mathbf{1}_{\{k\}}(z_i)}.\]

Hence, we see that $z_1, \ldots, z_n$ are independent given $\boldsymbol{\theta}, \boldsymbol{x}$, with $p(z_i = k \,|\, \boldsymbol{\theta}, \boldsymbol{x}) \propto \theta_k\, \varphi_k(x_i)$.

Variational inference

For the Bernstein density , it is relatively straightforward to implement a mean-field variational inference scheme. Here, we approximate the joint posterior $p(\boldsymbol{\theta}, \boldsymbol{z}\,|\, \boldsymbol{x})$ via a distribution $q$ which satisfies the following independence assumption:

\[q(\boldsymbol{\theta}, \boldsymbol{z}) = q(\boldsymbol{\theta})\, q(\boldsymbol{z}).\]

It can be shown [see e.g. Ormerod and Wand (2010)] that the optimal $q$ densities are given by

\[\begin{aligned} q(\boldsymbol{\theta}) &\propto \exp\big\{\mathbb{E}_{\boldsymbol{z}}\big[\log p(\boldsymbol{\theta}, \boldsymbol{z})\big]\big\},\\ q(\boldsymbol{z}) &\propto \exp\big\{\mathbb{E}_{\boldsymbol{\theta}}\big[\log p(\boldsymbol{\theta}, \boldsymbol{z})\big]\big\}, \end{aligned}\]

where the expectations are taken with respect to $q(\boldsymbol{z})$ and $q(\boldsymbol{\theta})$, respectively. This result leads to the iterative coordinate-wise ascent variational inference algorithm (CAVI) for finding the optimal $q$ densities, where we cyclically update $q(\boldsymbol{\theta})$ and $q(\boldsymbol{z})$ until some convergence criterion has been met. An oft-used convergence criterion for this purpose is the evidence lower bound, (ELBO):

\[\mathrm{ELBO}(q) = \mathbb{E}_{\boldsymbol{\theta}, \boldsymbol{z}}\Big(\log \frac{p(\boldsymbol{x}, \boldsymbol{\theta}, \boldsymbol{z})}{q(\boldsymbol{\theta}, \boldsymbol{z})}\Big).\]

For our particular example it can be shown that the optimal $q$ densities are:

$q(\boldsymbol{\theta})$ is the $\mathrm{Dirichlet}(\boldsymbol{a} + \boldsymbol{r})$-density, where $r_k = \sum_{i=1}^n q(z_i = k)$.
$q(\boldsymbol{z}) = \prod_{i=1}^n q(z_i)$, where $q(z_i)$ is the probability mass function of a categorical distribution on $\{1,2,\ldots, K\}$ with $q(z_i = k) \propto \varphi_k(x_i)\, \exp\big\{\psi(a_k + r_k)\big\}$, where $\psi(\cdot)$ denotes the digamma function.

An expression for the ELBO of this model is as follows:

\[\begin{aligned} \mathrm{ELBO}(q) =& \sum_{i=1}^n \sum_{k=1}^K q(z_i = k) \big\{\log b_k(x_i) - \log q(z_i = k)\big\} \\ &+ \sum_{k=1}^K \big\{\log \Gamma(a + r_k) - \log \Gamma(a)\big\} \\ &- \log \Gamma(aK+n) + \log \Gamma(aK) \end{aligned}\]

Implementation

We start by importing the required packages:

using BayesDensityCore, Distributions, Random, StatsBase

Model struct and pdf

The first step to implementing the Bernstein density model in a BayesDensity-compatible way is to define a model struct which is a subtype of AbstractBayesDensityModel:

struct BernsteinDensity{T<:Real, D<:NamedTuple} <: AbstractBayesDensityModel{T}
    data::D # NamedTuple holding data
    K::Int  # Basis dimension
    a::T    # Symmetric Dirichlet parameter.
    function BernsteinDensity{T}(x::AbstractVector{<:Real}, K::Int; a::Real=1.0) where {T<:Real}
        φ_x = Matrix{T}(undef, (length(x), K))
        for i in eachindex(x)
            for k in 1:K
                φ_x[i, k] = pdf(Beta(k, K - k + 1), x[i])
            end
        end
        data = (x = x, n = length(x), φ_x = φ_x)
        return new{T, typeof(data)}(data, K, T(a))
    end
end
BernsteinDensity(args...; kwargs...) = BernsteinDensity{Float64}(args...; kwargs...) # For convenience

In the above implementation, we store the values of $\varphi_k(x_i)$ for $1 \leq i \leq n$ and $1 \leq k \leq K$, as these values are reused repeatedly in the model fitting processes later. We also mantain a copy of the original dataset x in the data field. By default, the original data stored under the field bdm.data.x is used to select a default grid for the first coordinate axis when plotting fitted model objects via the fallback implementation of default_grid_points. If the original data cannot be found under bdm.data.x, then trying to plot a PosteriorSamples or AbstractVIPosterior object without supplying a plotting grid as the second argument will fail, in which case implementing default_plot_grid for this model class will resolve this issue.

Returning to our specific example, we note that the Bernstein model is always supported on $[0, 1]$, so it may make more sense to use a default grid spanning this entire interval. A possible implementation of default_grid_points for this purpose is as follows:

BayesDensityCore.default_grid_points(::BernsteinDensity{T}) where {T} = LinRange{T}(0, 1, 2001)

In order to be able use all the functionality of BayesDensityCore, we also need to implement an equality method for our new type. In this case, this is just a matter of checking that all the fields of two such objects are equal:

Base.:(==)(bd1::BernsteinDensity, bd2::BernsteinDensity) = bd1.data == bd2.data && bd1.K == bd2.K && bd1.a == bd2.a

Next, we implement a method that calculates the pdf of the model when the parameters of the model are given. The pdf method should always receive the model object as the first argument, the parameters as the second argument and the point(s) at which the density should be evaluated as the third. In the implementation presented below, we take in a NamedTuple with a single field named θ which represents the mixture probabilities.

function Distributions.pdf(bdm::BernsteinDensity{T, D}, params::NamedTuple, t::S) where {T<:Real, D, S<:Real}
    K = bdm.K
    (; θ) = params
    f = zero(promote_type(T, S))
    for k in 1:K
        f += θ[k] * pdf(Beta(k, K - k + 1), t)
    end
    return f
end

The BayesDensityCore module provides generic fallback methods for the cases where params is given as a Vector of NamedTuples and when t is a vector. However, as noted in the general API, it is recommended that most models provide specialized methods for vectors of parameters and vectors of evaluation points, as it is often possible to implement batch evaluation more efficiently, e.g. by leveraging BLAS calls instead of loops, when the parameters and the evaluation grid are provided in batches.

Next, we need to implement the cdf method. Owing to the nice structure of the cdf $F$ in this example, this is no more complicated than implementing the pdf:

function Distributions.cdf(bdm::BernsteinDensity{T, D}, params::NamedTuple, t::S) where {T<:Real, D, S<:Real}
    K = bdm.K
    (; θ) = params
    f = zero(promote_type(T, S))
    for k in 1:K
        f += θ[k] * cdf(Beta(k, K - k + 1), t)
    end
    return f
end

In general, it is good practice to also implement the support and hyperparams methods for new models. Note that for the Bernstein density model, the support is always equal to the unit interval, and the only hyperparameter is the scalar value a (here, we treat K as fixed). Hence, the following provides appropriate implementations of the aforementioned methods:

BayesDensityCore.support(::BernsteinDensity{T, D}) where {T, D} = (T(0.0), T(1.0))
BayesDensityCore.hyperparams(bdm::BernsteinDensity) = (a = bdm.a,)

Gibbs sampler

We now turn our attention to implementing the Gibbs sampler itself. All BayesDensity-compatible Markov chain Monte Carlo samplers should overload the sample method. This function should always take in a random seed as the first argument, the density model object as the second argument and the total number of samples (including burn-in) as the third argument. In addition, the number of burn-in samples must be provided as a keyword argument.

The sample method should always return an object of type PosteriorSamples in order to be compatible with the rest of the package. The samples generated during the MCMC routine should be stored in a subtype of AbstractVector, where the type of the elements are compatible with the function signature for the implemented pdf method. Since our implementation of the pdf method takes in a NamedTuple as the parameters argument, we store the generated samples in a vector of NamedTuples in the implementation shown below:

function StatsBase.sample(rng::AbstractRNG, bdm::BernsteinDensity{T, D}, n_samples::Int; n_burnin=min(div(length(x), 5), 1000), init_params::NamedTuple=(θ = fill(1/K, K),)) where {T, D}
    (; K, data, a) = bdm
    (; x, n, φ_x) = data

    a_vec = fill(a, K) # Dirichlet prior parameter

    θ = T.(init_params.θ) # Initialize θ as the uniform vector
    probs = Vector{T}(undef, K) # Vector used to store intermediate calculations of p(zᵢ|θ, x)

    # Store samples as a vector of NamedTuples
    samples = Vector{NamedTuple{(:θ,), Tuple{Vector{Float64}}}}(undef, n_samples)

    for m in 1:n_samples
        N = zeros(Int, K) # N[k] = number of z[i] equal to k.
        for i in 1:n
            for k in 1:K
                probs[k] = θ[k] * φ_x[i, k]
            end
            probs = probs / sum(probs)
            N .+= rand(rng, Multinomial(1, probs)) # sample zᵢ ∼ p(zᵢ|θ, x)
        end
        θ = rand(rng, Dirichlet(a_vec + N)) # sample θ ∼ p(θ|z, x)
        samples[m] = (θ = θ,) # store the current value of θ
    end
    return PosteriorSamples{T}(samples, bdm, n_samples, n_burnin)
end

The above implementation allows the user to supply the initial value of $\theta$ used when performing the first iteration of the Gibbs sampler, via a NamedTuple to match the data structure we use to store the samples.

Note

The convention adopted by the current set of BayesDensity models is that when during an MCMC run, only model pararameters should be stored, and not auxilliary variables which are only introduced in order to facilitate efficient computation. In this case, we therefore do not store the $z_i$ in the model object returned by this method.

Example usage

Having implemented the model struct and the pdf- and sample methods, we can run the MCMC algorithm and perform posterior inference as with any of the other density esitmators implemented in this package:

d_true = Kumaraswamy(2, 5) # Simulate some data from a density supported on [0, 1]
rng = Xoshiro(1) # for reproducibility
x = rand(rng, d_true, 1000)

K = 20
bdm = BernsteinDensity(x, K) # Create Bernstein density model object (a = 1)
ps = sample(rng, bdm, 1_000; n_burnin=500) # Run MCMC

median(ps, 0.5) # Compute the posterior median of f(0.5)

1.6636159017061116

For instance, we can visualize the posterior fit by plotting the posterior means of $f(t)$ and $F(t)$ along with 95 % pointwise credible bands:

using CairoMakie
t = LinRange(0, 1, 1001) # Grid for plotting

fig = Figure(size=(600, 320))
ax1 = Axis(fig[1,1], xlabel="x", ylabel="Density")
plot!(ax1, ps, t, label="MCMC") # Plot the posterior mean and credible bands:
lines!(ax1, t, pdf(d_true, t), label="Truth", color=:black) # Also plot truth for comparison

ax2 = Axis(fig[1,2], xlabel="x", ylabel="Cumulative distribution")
plot!(ax2, ps, cdf, t, label="MCMC")
lines!(ax2, t, cdf(d_true, t), label="Truth", color=:black)

Legend(fig[1,3], ax1, framevisible=false)

fig

Bernstein MCMC fit

Mean-field variational inference

Before getting started on implementing a variational inference algorithm, we first need to define a new struct that represents the variational posterior distribution. To make the resulting variational posterior compatible with the BayesDensityCore interface, the new struct should be a subtype of AbstractVIPosterior.

Although not strictly required in order to make the variational posterior struct BayesDensity-compatible, it is customary to have the variational posterior distribution store the variational densities as Distributions-objects, in addition to storing the original model object.

The implementation below stores the variational density $q(\boldsymbol{\theta})$, along with the Bernstein-density model object. We also create a default constructor which takes in the vector $\boldsymbol{r}$ resulting from the CAVI algorithm, along with the original model object:

struct BernsteinDensityVIPosterior{T<:Real, D<:Dirichlet{T}, M<:BernsteinDensity} <: AbstractVIPosterior{T}
    q_θ::D
    model::M
    function BernsteinDensityVIPosterior{T}(r::AbstractVector{<:Real}, model::M) where {T<:Real, M<:BernsteinDensity}
        a = hyperparams(model).a
        K = model.K
        q_θ = Dirichlet{T}(fill(a, K) + r)
        return new{T, Dirichlet{T}, M}(q_θ, model)
    end
end

It is also recommended to implement the model method, so that the user can easily extract the model to which the variational posterior was fitted:

BayesDensityCore.model(vip::BernsteinDensityVIPosterior) = vip.model

Next, we need to implement a method for generating samples from the variational posterior distribution, i.e. sampling from $q(\boldsymbol{\theta})$. This is achieved by implementing the sample method:

function StatsBase.sample(rng::AbstractRNG, vip::BernsteinDensityVIPosterior{T,D, M}, n_samples::Int) where {T, D, M}
    q_θ = vip.q_θ
    samples = Vector{NamedTuple{(:θ,), Tuple{Vector{Float64}}}}(undef, n_samples)
    for m in 1:n_samples
        θ = rand(rng, q_θ)
        samples[m] = (θ = θ,)
    end
    # Note that we return independent samples here, so burn-in is not needed
    return PosteriorSamples{T}(samples, model(vip), n_samples, 0)
end

Having implemented a struct for storing the variational posterior, we can now turn our attention to the optimization procedure itself. To start, we implement the ELBO, which we will need to determine convergence later. Note that in the implementation below, we assume that the values of $q(z_i = k)$ are stored in a $n \times K$ matrix $\omega$, so that $\omega_{i,k} = q(z_i = k)$.

function Bernstein_ELBO(model::BernsteinDensity{T, D}, r::AbstractVector{<:Real}, ω::AbstractMatrix{<:Real}) where {T, D}
    (; data, K, a) = model
    (; x, n, φ_x) = data
    logφ_x = log.(φ_x)
    ELBO = loggamma(a*K) - loggamma(a*K+n)
    ELBO += sum(loggamma.(r .+ a)) - K*loggamma(a)
    for k in 1:K
        for i in 1:n
            ELBO += ω[i,k]*(logφ_x[i,k] - log(ω[i,k]))
        end
    end
    return ELBO
end

Finally, we are ready to implement the optimization procedure itself by overloading varinf. Note that this function should return a 2-tuple, consisting of the fitted variational posterior itself and an object of type VariationalOptimizationResult.

using SpecialFunctions # For the digamma-function

function BayesDensityCore.varinf(model::BernsteinDensity{T, D}; max_iter::Int=1000, rtol::Real=1e-4) where {T, D}
    (; data, K, a) = model
    (; x, n, φ_x) = data

    # Initialize the latent variables ω[i,k] = q(z_i = k) to 1/K:
    ω = fill(T(1/K), (n, K))
    r = fill(a + n/K, K)

    # CAVI optimization loop
    ELBO_prev = T(-1)
    ELBO = Vector{T}(undef, max_iter)
    converged = false
    iter = 1
    while !converged && iter <= max_iter
        # Update q(θ)
        r = fill(a, K) + vec(sum(ω, dims=1))

        # Update q(z)
        for i in 1:n
            # Compute q(z_i = k) up to proportionality
            for k in 1:K
               ω[i,k] = φ_x[i,k] * exp(digamma(a + r[k]))
            end
            # Normalize so that the rows of ω sum to 1:
            ω[i,:] = ω[i,:] / sum(ω[i,:])
        end

        # Check if the procedure has converged:
        ELBO[iter] = Bernstein_ELBO(model, r, ω)

        # Run at least two iterations
        converged = (abs(ELBO_prev - ELBO[iter]) / ELBO_prev <= rtol) && iter > 1
        ELBO_prev = ELBO[iter]
        iter += 1
    end

    # Print a warning if the procedure fails to converge within the maximum number of iterations
    converged || @warn "Maximum number of iterations reached."

    posterior = BernsteinDensityVIPosterior{T}(r, model)
    info = VariationalOptimizationResult{T}(ELBO[1:iter-1], converged, iter-1, rtol, posterior)
    return posterior, info
end

Example usage

Having implemented the varinf method, we can now perform variational inference for the BernsteinDensity model just as easily as for any other BayesDensityCore-compatible model. The example below shows how to fit a variational posterior to the Bernstein model for a simulated dataset:

d_true = Kumaraswamy(2, 5) # Simulate some data from a density supported on [0, 1]
rng = Xoshiro(1) # for reproducibility
x = rand(rng, d_true, 1_000)

K = 20
bdm = BernsteinDensity(x, K) # Create Bernstein density model object (a = 1)
vip, info = varinf(bdm) # Compute the variational posterior.

mean(vip, 0.2) # Compute the posterior mean of f(0.2)

1.5810389520385393

For instance, we can visualize the variational posterior fit by displaying the (variational) posterior means of $f(t)$ and $F(t)$ along with 95 % pointwise credible bands:

using CairoMakie
t = LinRange(0, 1, 1001) # Grid for plotting

fig = Figure(size=(600, 320))
ax1 = Axis(fig[1,1], xlabel="x", ylabel="Density")
plot!(ax1, vip, t, label="VI") # Plot the posterior mean and credible bands:
lines!(ax1, t, pdf(d_true, t), label="Truth", color=:black) # Also plot truth for comparison

ax2 = Axis(fig[1,2], xlabel="x", ylabel="Cumulative distribution")
plot!(ax2, vip, cdf, t, label="VI")
lines!(ax2, t, cdf(d_true, t), label="Truth", color=:black)

Legend(fig[1,3], ax1, framevisible=false)

fig

Bernstein variational fit

We can also verify that the ELBO has converged:

fig = Figure(size=(400, 350))
ax = Axis(fig[1,1], xlabel="Iteration", ylabel="ELBO")
lines!(ax, info)
fig

Bernstein elbo

1A Bayesian Bernstein-type estimator, where the number $K$ of mixture components is treated as a further random variable has been proposed by Petrone (1999). A BayesDensity implementation of this model is available as RandomBernsteinPoly.