Supported Methods

This page provides background on each histogram method supported through the rule argument. Our presentation is intended to be rather brief, and we do as such not cover the theoretical underpinnings of each method in great detail. For some further background on automatic histogram procedures and the theory behind them, we recommend the excellent reviews contained in the articles of Birgé and Rozenholc (2006) and Davies et al. (2009).

For ease of exposition, we present all methods covered here in the context of estimating the density of a sample $\boldsymbol{x} = (x_1, x_2, \ldots, x_n)$ on the unit interval, but note that extending the procedures presented here to other compact intervals is possible through a suitable affine transformation. In particular, if a density estimate with support $[a,b]$ is desired, we can scale the data to the unit interval through $z_i = (x_i - a)/(b-a)$, and apply the methods on this transformed sample and rescale the resulting density estimate to $[a,b]$. In cases where the support of the density is unknown, a natural choice is $a = x_{(1)}$ and $b = x_{(n)}$. Cases where only the lower or upper bound is known can be handled similarly. The transformation used to construct the histogram can be controlled through the support keyword, where the default argument support=(-Inf, Inf) uses the order statistics-based approach described above.

Notation

Before we describe the methods included here in more detail, we introduce some notation. We let $\mathcal{I} = (\mathcal{I}_1, \mathcal{I}_2, \ldots, \mathcal{I}_k)$ denote a partition of $[0,1]$ into $k$ intervals and write $|\mathcal{I}_j|$ for the length of interval $\mathcal{I}_j$. The intervals in the partition $\mathcal{I}$ can be either right- or left-closed. Whether a left- or right-closed partition is used to draw the histogram is controlled by the keyword argument closed, with options :left and :right (default). This choice is somewhat arbitrary, but is unlikely to matter much in practical applications.

Based on a partition $\mathcal{I}$, we can write down the corresponding histogram density estimate by

\[\widehat{f}(x) = \sum_{j=1}^k \frac{\widehat{\theta}_j}{|\mathcal{I}_j|}\mathbf{1}_{\mathcal{I}_j}(x), \quad x\in [0,1],\]

where $\mathbf{1}_{\mathcal{I}_j}$ is the indicator function, $\widehat{\theta}_j \geq 0$ for all $j$ and $\sum_{j=1}^k \widehat{\theta}_j = 1$.

For most of the methods considered here, the estimated bin probabilities are the maximum likelihood estimates $\widehat{\theta}_j = N_j/n$, where $N_j = \sum_{i=1}^n \mathbb{1}_{\mathcal{I}_j}(x_i)$ is number of observations landing in interval $\mathcal{I}_j$ . The exception to this rule is are the two Bayesian approaches, which uses the Bayes estimator $\widehat{\theta}_j = (a_j + N_j)/(a+n)$ for $(a_1, \ldots, a_k) \in (0,\infty)^k$ and $a = \sum_{j=1}^k a_j$ instead.

The goal of an automatic histogram procedure is to find a partition $\mathcal{I}$ based on the sample alone which produces a reasonable density estimate. Regular histogram procedures only consider regular partitions, where all intervals in the partition are of equal length, so that one only needs to determine the number $k$ of bins. Irregular histograms allow for partitions with intervals of unequal length, and try to determine both the number of bins and the locations of the cutpoints between the intervals.

A short note on using different rules

In order to fit a histogram using a specific rule, we call fit(AutomaticHistogram, x, rule), where x is the data vector. For many of the rules discussed below, the user can specify additional rule-specific keywords to rule, providing additional control over the supplied method when desired. We also provide a set of default values for these parameters, so that the user may for instance call fit(AutomaticHistogram, x, AIC()) to fit a regular histogram using the AIC criterion without having to worry about explicitly passing any keyword arguments.

Irregular histograms

The following section provides a description of all the irregular histogram rules that have been implemented in AutoHist.jl. In each case, the best partition is selected among the subset of interval partitions of the unit interval that have cut points belonging to a discrete set of cardinality $k_n-1$. In all the irregular procedures covered here, we attempt to find best partition according to a goodness-of-fit criterion among all partitions with endpoints belonging to a given discrete mesh $\{\tau_{j}\colon 0\leq j \leq k_n\}$.

Random irregular histogram

AutoHist.RIHType
RIH(;
    a::Real,
    logprior::Function:=k-> 0.0,
    grid::Symbol=:regular,
    maxbins::Union{Int, Symbol}=:default,
    alg::AbstractAlgorithm=SegNeig()
)

The random irregular histogram criterion.

Consists of maximizing the marginal log-posterior of the partition $\mathcal{I} = (\mathcal{I}_1, \ldots, \mathcal{I}_k)$,

\[ \sum_{j=1}^k \big\{\log \Gamma(a_j + N_j) - \log \Gamma(a_j) - N_j\log|\mathcal{I}_j|\big\} + \log p_n(k) - \log \binom{k_n-1}{k-1}\]

Here $p_n(k)$ is the prior distribution on the number $k$ of bins, which can be controlled by supplying a function to the logprior keyword argument. The default value is $p_n(k) \propto 1$. Here, $a_j = a/k$, for a scalar $a > 0$, not depending on $k$.

Keyword arguments

  • a: Specifies Dirichlet concentration parameter in the Bayesian histogram model. Must be a fixed positive number, and defaults to a=5.0.
  • logprior: Unnormalized logprior distribution on the number $k$ of bins. Defaults to a uniform prior, e.g. logprior(k) = 0 for all k.
  • grid: Symbol indicating how the finest possible mesh should be constructed. Options are :data, which uses each unique data point as a grid point, :regular (default) which constructs a fine regular grid, and :quantile which constructs the grid based on the sample quantiles.
  • maxbins: Maximal number of bins for which the above criterion is evaluated. Defaults to maxbins=:default, which sets maxbins to the ceil of min(1000, 4n/log(n)^2) if grid is regular or quantile. Ignored if grid=:data.
  • alg: Algorithm used to fit the model. Currently, only SegNeig is supported for this rule.

Examples

julia> x = (1.0 .- (1.0 .- LinRange(0.0, 1.0, 500)) .^(1/3)).^(1/3);

julia> rule = RIH(a = 5.0, logprior = k-> -log(k), grid = :data);

julia> fit(AutomaticHistogram, x, rule)
AutomaticHistogram
breaks: [0.0, 0.18220071105959446, 0.3587941358096334, 0.8722292888743843, 1.0]
density: [0.11858322346056327, 0.6490600487586273, 1.6066011289666577, 0.30436411114439915]
counts: [10, 57, 414, 19]
type: irregular
closed: right
a: 5.0

References

This approach to irregular histograms first appeared in Simensen et al. (2025).

source

Rozenholc, Mildenberger & Gather penalty A

AutoHist.RMG_penAType
RMG_penA(;
    grid::Symbol=:regular,
    maxbins::Union{Int, Symbol}=:default,
    alg::AbstractAlgorithm=SegNeig()
)

Consists of finding the partition $\mathcal{I}$ that maximizes a penalized log-likelihood,

\[ \sum_{j=1}^k N_j \log (N_j/|\mathcal{I}_j|) - \log \binom{k_n-1}{k-1} - k - 2\log(k) - \sqrt{2(k-1)\Big[\log \binom{k_n-1}{k-1}+ 2\log(k)\Big]}.\]

Keyword arguments

  • grid: Symbol indicating how the finest possible mesh should be constructed. Options are :data, which uses each unique data point as a grid point, :regular (default) which constructs a fine regular grid, and :quantile which constructs the grid based on the sample quantiles.
  • maxbins: Maximal number of bins for which the above criterion is evaluated. Defaults to maxbins=:default, which sets maxbins to the ceil of min(1000, 4n/log(n)^2) if grid is regular or quantile. Ignored if grid=:data.
  • alg: Algorithm used to fit the model. Currently, only SegNeig is supported for this rule.

Examples

julia> x = (1.0 .- (1.0 .- LinRange(0.0, 1.0, 500)) .^(1/3)).^(1/3);

julia> rule = RMG_penA(grid = :data);

julia> fit(AutomaticHistogram, x, rule)
AutomaticHistogram
breaks: [0.0, 0.18875598171056715, 0.3644223879547405, 0.8696799410193466, 1.0]
density: [0.116552597701164, 0.6717277510418354, 1.6229346697072502, 0.30693663211078037]
counts: [11, 59, 410, 20]
type: irregular
closed: right
a: NaN

References

This approach was suggested by Rozenholc et al. (2010).

source

Rozenholc, Mildenberger & Gather penalty B

AutoHist.RMG_penBType
RMG_penB(;
    grid::Symbol=:regular,
    maxbins::Union{Int, Symbol}=:default,
    alg::AbstractAlgorithm=SegNeig()
)

Consists of finding the partition $\mathcal{I}$ that maximizes a penalized log-likelihood,

\[ \sum_{j=1}^k N_j \log (N_j/|\mathcal{I}_j|) - \log \binom{k_n-1}{k-1} - k - \log^{2.5}(k).\]

Keyword arguments

  • grid: Symbol indicating how the finest possible mesh should be constructed. Options are :data, which uses each unique data point as a grid point, :regular (default) which constructs a fine regular grid, and :quantile which constructs the grid based on the sample quantiles.
  • maxbins: Maximal number of bins for which the above criterion is evaluated. Defaults to maxbins=:default, which sets maxbins to the ceil of min(1000, 4n/log(n)^2) if grid is regular or quantile. Ignored if grid=:data.
  • alg: Algorithm used to fit the model. Currently, only SegNeig is supported for this rule.

Examples

julia> x = (1.0 .- (1.0 .- LinRange(0.0, 1.0, 500)) .^(1/3)).^(1/3);

julia> rule = RMG_penB(grid = :data);

julia> fit(AutomaticHistogram, x, rule)
AutomaticHistogram
breaks: [0.0, 0.1948931612779725, 0.375258352661302, 0.8268306249022703, 0.9222490305512866, 1.0]
density: [0.12314439276691318, 0.7096713008662634, 1.6962954704872724, 0.7545714006671028, 0.12861575966067232]
counts: [12, 64, 383, 36, 5]
type: irregular
closed: right
a: NaN

References

This approach was suggested by Rozenholc et al. (2010).

source

Rozenholc, Mildenberger & Gather penalty R

AutoHist.RMG_penRType
RMG_penR(;
    grid::Symbol=:regular,
    maxbins::Union{Int, Symbol}=:default,
    alg::AbstractAlgorithm=SegNeig()
)

Consists of finding the partition $\mathcal{I}$ that maximizes a penalized log-likelihood,

\[ \sum_{j=1}^k \big\{N_j \log (N_j/|\mathcal{I}_j|) - \frac{N_j}{2n}\big\} - \log \binom{k_n-1}{k-1} - \log^{2.5}(k).\]

Keyword arguments

  • grid: Symbol indicating how the finest possible mesh should be constructed. Options are :data, which uses each unique data point as a grid point, :regular (default) which constructs a fine regular grid, and :quantile which constructs the grid based on the sample quantiles.
  • maxbins: Maximal number of bins for which the above criterion is evaluated. Defaults to maxbins=:default, which sets maxbins to the ceil of min(1000, 4n/log(n)^2) if grid is regular or quantile. Ignored if grid=:data.
  • alg: Algorithm used to fit the model. Currently, only SegNeig is supported for this rule.

Examples

julia> x = (1.0 .- (1.0 .- LinRange(0.0, 1.0, 500)) .^(1/3)).^(1/3);

julia> rule = RMG_penR(grid = :data);

julia> fit(AutomaticHistogram, x, rule)
AutomaticHistogram
breaks: [0.0, 0.18875598171056715, 0.3699070396003733, 0.8285645195146814, 0.9222490305512866, 1.0]
density: [0.116552597701164, 0.6845115973621804, 1.6875338000474953, 0.7471886144834441, 0.12861575966067232]
counts: [11, 62, 387, 35, 5]
type: irregular
closed: right
a: NaN

References

This approach was suggested by Rozenholc et al. (2010).

source

Irregular $L_2$ leave-one-out cross-validation (L2CV_R)

AutoHist.L2CV_IType
L2CV_I(;
    grid::Symbol=:regular,
    maxbins::Union{Int, Symbol}=:default,
    alg::AbstractAlgorithm=OptPart(),
    use_min_length::Bool=false
)

Consists of finding the partition $\mathcal{I}$ that maximizes a L2 leave-one-out cross-validation criterion,

\[ \frac{n+1}{n}\sum_{j=1}^k \frac{N_j^2}{|\mathcal{I}_j|} - 2\sum_{j=1}^k \frac{N_j}{|\mathcal{I}_j|}.\]

Keyword arguments

  • grid: Symbol indicating how the finest possible mesh should be constructed. Options are :data, which uses each unique data point as a grid point, :regular (default) which constructs a fine regular grid, and :quantile which constructs the grid based on the sample quantiles.
  • maxbins: Maximal number of bins for which the above criterion is evaluated. Defaults to maxbins=:default, which sets maxbins to the ceil of min(1000, 4n/log(n)^2) if grid is regular or quantile. Ignored if grid=:data.
  • alg: Algorithm used to fit the model. Currently, OptPart and SegNeig are supported for this rule, with the former algorithm being the default.
  • use_min_length: Boolean indicating whether or not to impose a restriction on the minimum bin length of the histogram. If set to true, the smallest allowed bin length is set to (maximum(x)-minimum(x))/n*log(n)^(1.5).

Examples

julia> x = (1.0 .- (1.0 .- LinRange(0.0, 1.0, 500)) .^(1/3)).^(1/3);

julia> rule = L2CV_I(grid = :data, use_min_length=true);

julia> fit(AutomaticHistogram, x, rule)
AutomaticHistogram
breaks: [0.0, 0.149647045210915, 0.2499005080461325, 0.3490626376697454, 0.4600140220788484, 0.7765683248449301, 0.8535131937737716, 0.9121099383916996, 0.9560732934980348, 1.0]
density: [0.08018868653963065, 0.3590898407087615, 0.7664216197097746, 1.2798398213438569, 1.8448651460332468, 1.2476465466304287, 0.6826317786220794, 0.2729545998246794, 0.045530388214070294]
counts: [6, 18, 38, 71, 292, 48, 20, 6, 1]
type: irregular
closed: right
a: NaN

References

This approach dates back to Rudemo (1982).

source

Irregular Kullback-Leibler leave-one-out cross-validation (L2CV_R)

AutoHist.KLCV_IType
KLCV_I(;
    grid::Symbol=:regular,
    maxbins::Union{Int, Symbol}=:default,
    alg::AbstractAlgorithm=OptPart(),
    use_min_length::Bool=false
)

Consists of finding the partition $\mathcal{I}$ that maximizes a Kullback-Leibler leave-one-out cross-validation criterion,

\[ \sum_{j=1}^k N_j\log(N_j-1) - \sum_{j=1}^k N_j\log |\mathcal{I}_j|,\]

where the maximmization is over all partitions with $N_j \geq 2$ for all $j$.

Keyword arguments

  • grid: Symbol indicating how the finest possible mesh should be constructed. Options are :data, which uses each unique data point as a grid point, :regular (default) which constructs a fine regular grid, and :quantile which constructs the grid based on the sample quantiles.
  • maxbins: Maximal number of bins for which the above criterion is evaluated. Defaults to maxbins=:default, which sets maxbins to the ceil of min(1000, 4n/log(n)^2) if grid is regular or quantile. Ignored if grid=:data.
  • alg: Algorithm used to fit the model. Currently, OptPart and SegNeig are supported for this rule, with the former algorithm being the default.
  • use_min_length: Boolean indicating whether or not to impose a restriction on the minimum bin length of the histogram. If set to true, the smallest allowed bin length is set to (maximum(x)-minimum(x))/n*log(n)^(1.5).

Examples

julia> x = (1.0 .- (1.0 .- LinRange(0.0, 1.0, 500)) .^(1/3)).^(1/3);

julia> rule = KLCV_I(grid = :data, use_min_length=true);

julia> fit(AutomaticHistogram, x, rule)
AutomaticHistogram
breaks: [0.0, 0.13888886265725095, 0.23836051747480758, 0.33883651300547, 0.45084951551151237, 0.7900230337213711, 0.8722292888743843, 0.9352920770792058, 1.0]
density: [0.07200001359848368, 0.321699684786505, 0.7165890680628054, 1.2319998295961743, 1.8220762141507212, 1.1191362485586687, 0.5074307830485909, 0.09272434856770642]
counts: [5, 16, 36, 69, 309, 46, 16, 3]
type: irregular
closed: right
a: NaN

References

This approach to irregular histograms was, to the best of our knowledge, first considered in Simensen et al. (2025).

source

Normalized maximum likelihood, regular (NML_R)

AutoHist.NML_IType
NML_I(;
    grid::Symbol=:regular,
    maxbins::Union{Int, Symbol}=:default,
    alg::AbstractAlgorithm=DP()
)

A quick-to-evalutate version of the normalized maximum likelihood criterion.

Consists of finding the partition $\mathcal{I}$ that maximizes a penalized log-likelihood,

\[\begin{aligned} &\sum_{j=1}^k N_j\log \frac{N_j}{|\mathcal{I}_j|} - \frac{k-1}{2}\log(n/2) - \log\frac{\sqrt{\pi}}{\Gamma(k/2)} - n^{-1/2}\frac{\sqrt{2}k\Gamma(k/2)}{3\Gamma(k/2-1/2)} \\ &- n^{-1}\left(\frac{3+k(k-2)(2k+1)}{36} - \frac{\Gamma(k/2)^2 k^2}{9\Gamma(k/2-1/2)^2} \right) - \log \binom{k_n-1}{k-1} \end{aligned}\]

Keyword arguments

  • grid: Symbol indicating how the finest possible mesh should be constructed. Options are :data, which uses each unique data point as a grid point, :regular (default) which constructs a fine regular grid, and :quantile which constructs the grid based on the sample quantiles.
  • maxbins: Maximal number of bins for which the above criterion is evaluated. Defaults to maxbins=:default, which sets maxbins to the ceil of min(1000, 4n/log(n)^2) if grid is regular or quantile. Ignored if grid=:data.
  • alg: Algorithm used to fit the model. Currently, only SegNeig is supported for this rule.

Examples

julia> x = (1.0 .- (1.0 .- LinRange(0.0, 1.0, 500)) .^(1/3)).^(1/3);

julia> rule = NML_I(grid = :data);

julia> fit(AutomaticHistogram, x, rule)
AutomaticHistogram
breaks: [0.0, 0.18875598171056715, 0.3644223879547405, 0.8696799410193466, 1.0]
density: [0.116552597701164, 0.6717277510418354, 1.6229346697072502, 0.30693663211078037]
counts: [11, 59, 410, 20]
type: irregular
closed: right
a: NaN

References

This a variant of this criterion first suggested by Kontkanen and Myllymäki (2007).

source

Regular histograms

The following section details how each value of the rule argument selects the number $k$ of bins to draw a regular histogram automatically based on a random sample. In the following, $\mathcal{I} = (\mathcal{I}_1, \mathcal{I}_2, \ldots, \mathcal{I}_k)$ is the corresponding partition of $[0,1]$ consisting of $k$ equal-length bins. In cases where the value of the number of bins is computed by maximizing an expression, we look for the best regular partition among all regular partitions consisting of no more than $k_n$ bins. For rules falling under this umbrella, $k_n$ can be controlled through the maxbins keyword, as detailed below.

Random regular histogram (RRH), Knuth

AutoHist.RRHType
RRH(; a::Union{Real, Function}, logprior::Function, maxbins::Union{Int, Symbol}=:default)
Knuth(; maxbins::Union{Int, Symbol}=:default)

The random regular histogram criterion.

The number $k$ of bins is chosen as the maximizer of the marginal log-posterior,

\[ n\log (k) + \sum_{j=1}^k \big\{\log \Gamma(a_j + N_j) - \log \Gamma(a_j)\big\} + \log p_n(k).\]

Here $p_n(k)$ is the prior distribution on the number $k$ of bins, which can be controlled by supplying a function to the logprior keyword argument. The default value is $p_n(k) \propto 1$. Here, $a_j = a/k$, for a scalar $a > 0$, possibly depending on $k$. The value of $a$ can be set by supplying a fixed, positive scalar or a function $a(k)$ to the keyword argument a. The default value is a=5.0 for RRH().

The rule Knuth() is a special case of the RRH criterion, which corresponds to the particular choices $a_j = 0.5$ and $p_n(k)\propto 1$.

Keyword arguments

  • a: Specifies Dirichlet concentration parameter in the Bayesian histogram model. Can either be a fixed positive number or a function computing aₖ for different values of k. Defaults to 5.0 if not supplied.
  • logprior: Unnormalized logprior distribution on the number $k$ of bins. Defaults to a uniform prior, e.g. logprior(k) = 0 for all k.
  • maxbins: Maximal number of bins for which the above criterion is evaluated. Defaults to maxbins=:default, which sets maxbins to the ceil of min(1000, 4n/log(n)^2).

Examples

julia> x = (1.0 .- (1.0 .- LinRange(0.0, 1.0, 500)) .^(1/3)).^(1/3);

julia> rule = RRH(a = k->0.5*k, logprior = k->0.0);

julia> h = fit(AutomaticHistogram, x, rule)
AutomaticHistogram
breaks: LinRange{Float64}(0.0, 1.0, 11)
density: [0.04950495049504951, 0.2079207920792079, 0.5643564356435643, 1.0, 1.495049504950495, 1.8712871287128714, 1.9702970297029703, 1.6732673267326732, 0.9603960396039604, 0.2079207920792079]
counts: [2, 10, 28, 50, 75, 94, 99, 84, 48, 10]
type: regular
closed: right
a: 5.0

julia> h == fit(AutomaticHistogram, x, Knuth())
true

References

The Knuth criterion for histograms was proposed by Knuth (2019). The random regular histogram criterion is a generalization.

source

AIC

AutoHist.AICType
AIC(; maxbins::Union{Int, Symbol}=:default)

AIC criterion for regular histograms.

The number $k$ of bins is chosen as the maximizer of the penalized log-likelihood,

\[ n\log (k) + \sum_{j=1}^k N_j \log (N_j/n) - k,\]

where $n$ is the sample size.

Keyword arguments

  • maxbins: Maximal number of bins for which the above criterion is evaluated. Defaults to maxbins=:default, which sets maxbins to the ceil of min(1000, 4n/log(n)^2).

Examples

julia> x = (1.0 .- (1.0 .- LinRange(0.0, 1.0, 500)) .^(1/3)).^(1/3);

julia> fit(AutomaticHistogram, x, AIC())
AutomaticHistogram
breaks: LinRange{Float64}(0.0, 1.0, 11)
density: [0.04, 0.2, 0.56, 1.0, 1.5, 1.88, 1.98, 1.68, 0.96, 0.2]
counts: [2, 10, 28, 50, 75, 94, 99, 84, 48, 10]
type: regular
closed: right
a: NaN

References

The aic criterion was proposed by Taylor (1987) for histograms.

source

BIC

AutoHist.BICType
BIC(; maxbins::Union{Int, Symbol}=:default)

BIC criterion for regular histograms.

The number $k$ of bins is chosen as the maximizer of the penalized log-likelihood,

\[ n\log (k) + \sum_{j=1}^k N_j \log (N_j/n) - \frac{k}{2}\log(n).\]

where $n$ is the sample size.

Keyword arguments

  • maxbins: Maximal number of bins for which the above criterion is evaluated. Defaults to maxbins=:default, which sets maxbins to the ceil of min(1000, 4n/log(n)^2).

Examples

julia> x = (1.0 .- (1.0 .- LinRange(0.0, 1.0, 500)) .^(1/3)).^(1/3);

julia> fit(AutomaticHistogram, x, BIC())
AutomaticHistogram
breaks: LinRange{Float64}(0.0, 1.0, 9)
density: [0.048, 0.336, 0.816, 1.44, 1.904, 1.904, 1.264, 0.288]
counts: [3, 21, 51, 90, 119, 119, 79, 18]
type: regular
closed: right
a: NaN
source

Birgé-Rozenholc (BR)

AutoHist.BRType
BR(; maxbins::Union{Int, Symbol}=:default)

Birgé-Rozenholc criterion for regular histograms.

The number $k$ of bins is chosen as the maximizer of the penalized log-likelihood,

\[ n\log (k) + \sum_{j=1}^k N_j \log (N_j/n) - k - \log^{2.5} (k),\]

where $n$ is the sample size.

Keyword arguments

  • maxbins: Maximal number of bins for which the above criterion is evaluated. Defaults to maxbins=:default, which sets maxbins to the ceil of min(1000, 4n/log(n)^2).

Examples

julia> x = (1.0 .- (1.0 .- LinRange(0.0, 1.0, 500)) .^(1/3)).^(1/3);

julia> fit(AutomaticHistogram, x, BR())
AutomaticHistogram
breaks: LinRange{Float64}(0.0, 1.0, 11)
density: [0.04, 0.2, 0.56, 1.0, 1.5, 1.88, 1.98, 1.68, 0.96, 0.2]
counts: [2, 10, 28, 50, 75, 94, 99, 84, 48, 10]
type: regular
closed: right
a: NaN

References

This criterion was proposed by Birgé and Rozenholc (2006).

source

Regular $L_2$ leave-one-out cross-validation (L2CV_R)

AutoHist.L2CV_RType
L2CV_R(; maxbins::Union{Int, Symbol}=:default)

L2 cross-validation criterion for regular histograms.

The number $k$ of bins is chosen by maximizing a leave-one-out L2 cross-validation criterion,

\[ -2k + k\frac{n+1}{n^2}\sum_{j=1}^k N_j^2.\]

where $n$ is the sample size.

Keyword arguments

  • maxbins: Maximal number of bins for which the above criterion is evaluated. Defaults to maxbins=:default, which sets maxbins to the ceil of min(1000, 4n/log(n)^2).

Examples

julia> x = (1.0 .- (1.0 .- LinRange(0.0, 1.0, 500)) .^(1/3)).^(1/3);

julia> fit(AutomaticHistogram, x, L2CV_R())
AutomaticHistogram
breaks: LinRange{Float64}(0.0, 1.0, 11)
density: [0.04, 0.2, 0.56, 1.0, 1.5, 1.88, 1.98, 1.68, 0.96, 0.2]
counts: [2, 10, 28, 50, 75, 94, 99, 84, 48, 10]
type: regular
closed: right
a: NaN

References

This approach to histogram density estimation was first considered by Rudemo (1982).

source

Regular Kullback-Leibler leave-one-out cross-validation (KLCV_R)

AutoHist.KLCV_RType
KLCV_R(; maxbins::Union{Int, Symbol}=:default)

Kullback-Leibler cross-validation criterion for regular histograms.

The number $k$ of bins is chosen by maximizing a leave-one-out Kullback-Leibler cross-validation criterion,

\[ n\log(k) + \sum_{j=1}^k N_j\log (N_j-1),\]

where $n$ is the sample size and the maximmization is over all regular partitions with $N_j \geq 2$ for all $j$.

Keyword arguments

  • maxbins: Maximal number of bins for which the above criterion is evaluated. Defaults to maxbins=:default, which sets maxbins to the ceil of min(1000, 4n/log(n)^2).

Examples

julia> x = (1.0 .- (1.0 .- LinRange(0.0, 1.0, 500)) .^(1/3)).^(1/3);

julia> fit(AutomaticHistogram, x, KLCV_R())
AutomaticHistogram
breaks: LinRange{Float64}(0.0, 1.0, 11)
density: [0.04, 0.2, 0.56, 1.0, 1.5, 1.88, 1.98, 1.68, 0.96, 0.2]
counts: [2, 10, 28, 50, 75, 94, 99, 84, 48, 10]
type: regular
closed: right
a: NaN

References

This approach was first studied by Hall (1990).

source

Minimum description length (MDL)

AutoHist.MDLType
MDL(; maxbins::Union{Int, Symbol}=:default)

MDL criterion for regular histograms.

The number $k$ of bins is chosen as the minimizer of an encoding length of the data, and is equivalent to the maximizer of

\[ n\log(k) + \sum_{j=1}^k \big(N_j-\frac{1}{2}\big)\log\big(N_j-\frac{1}{2}\big) - \big(n-\frac{k}{2}\big)\log\big(n-\frac{k}{2}\big) - \frac{k}{2}\log(n),\]

where $n$ is the sample size and the maximmization is over all regular partitions with $N_j \geq 1$ for all $j$.

Keyword arguments

  • maxbins: Maximal number of bins for which the above criterion is evaluated. Defaults to maxbins=:default, which sets maxbins to the ceil of min(1000, 4n/log(n)^2).

Examples

julia> x = (1.0 .- (1.0 .- LinRange(0.0, 1.0, 500)) .^(1/3)).^(1/3);

julia> fit(AutomaticHistogram, x, MDL())
AutomaticHistogram
breaks: LinRange{Float64}(0.0, 1.0, 11)
density: [0.04, 0.2, 0.56, 1.0, 1.5, 1.88, 1.98, 1.68, 0.96, 0.2]
counts: [2, 10, 28, 50, 75, 94, 99, 84, 48, 10]
type: regular
closed: right
a: NaN

References

The minimum description length principle was first applied to histogram estimation by Hall and Hannan (1988).

source

Normalized maximum likelihood, regular (NML_R)

AutoHist.NML_RType
NML_R(; maxbins::Union{Int, Symbol}=:default)

NML_R criterion for regular histograms.

The number $k$ of bins is chosen by maximizing a penalized likelihood,

\[\begin{aligned} &\sum_{j=1}^k N_j\log \frac{N_j}{|\mathcal{I}_j|} - \frac{k-1}{2}\log(n/2) - \log\frac{\sqrt{\pi}}{\Gamma(k/2)} - n^{-1/2}\frac{\sqrt{2}k\Gamma(k/2)}{3\Gamma(k/2-1/2)} \\ &- n^{-1}\left(\frac{3+k(k-2)(2k+1)}{36} - \frac{\Gamma(k/2)^2 k^2}{9\Gamma(k/2-1/2)^2} \right). \end{aligned}\]

where $n$ is the sample size.

Keyword arguments

  • maxbins: Maximal number of bins for which the above criterion is evaluated. Defaults to maxbins=:default, which sets maxbins to the ceil of min(1000, 4n/log(n)^2).

Examples

julia> x = (1.0 .- (1.0 .- LinRange(0.0, 1.0, 500)) .^(1/3)).^(1/3);

julia> fit(AutomaticHistogram, x, NML_R())
AutomaticHistogram
breaks: LinRange{Float64}(0.0, 1.0, 24)
density: [0.046, 0.0, 0.138, 0.184, 0.368, 0.506, 0.69, 0.874, 1.104, 1.334  …  1.978, 1.978, 1.978, 1.84, 1.61, 1.334, 0.966, 0.644, 0.276, 0.046]
counts: [1, 0, 3, 4, 8, 11, 15, 19, 24, 29  …  43, 43, 43, 40, 35, 29, 21, 14, 6, 1]
type: regular
closed: right
a: NaN

References

This is a regular variant of the normalized maximum likelihood criterion considered by Kontkanen and Myllymäki (2007).

source

Sturges' rule

AutoHist.SturgesType
Sturges()

Sturges' rule for regular histograms.

The number $k$ of bins is chosen as

\[ k = \lceil \log_2(n) \rceil + 1,\]

where $n$ is the sample size.

This is the default procedure used by the hist() function in base R.

Examples

julia> x = (1.0 .- (1.0 .- LinRange(0.0, 1.0, 500)) .^(1/3)).^(1/3);

julia> fit(AutomaticHistogram, x, Sturges())
AutomaticHistogram
breaks: LinRange{Float64}(0.0, 1.0, 10)
density: [0.054, 0.252, 0.666, 1.206, 1.71, 1.98, 1.782, 1.098, 0.252]
counts: [3, 14, 37, 67, 95, 110, 99, 61, 14]
type: regular
closed: right
a: NaN

References

This classical rule is due to Sturges (1926).

source

Freedman and Diaconis' rule

AutoHist.FDType
FD()

Freedman and Diaconis' rule for regular histograms.

The number $k$ of bins is computed according to the formula

\[ k = \big\lceil\frac{n^{1/3}}{2\text{IQR}(\boldsymbol{x})}\big\rceil,\]

where $\text{IQR}(\boldsymbol{x})$ is the sample interquartile range and $n$ is the sample size.

This is the default procedure used by the histogram() function in Plots.jl.

Examples

julia> x = (1.0 .- (1.0 .- LinRange(0.0, 1.0, 500)) .^(1/3)).^(1/3);

julia> fit(AutomaticHistogram, x, FD())
AutomaticHistogram
breaks: LinRange{Float64}(0.0, 1.0, 16)
density: [0.03, 0.09, 0.24, 0.48, 0.78, 1.08, 1.44, 1.71, 1.92, 2.01, 1.89, 1.59, 1.08, 0.54, 0.12]
counts: [1, 3, 8, 16, 26, 36, 48, 57, 64, 67, 63, 53, 36, 18, 4]
type: regular
closed: right
a: NaN

References

This rule dates back to Freedman and Diaconis (1982).

source

Scott's rule

AutoHist.ScottType
Scott()

Scott's rule for regular histograms.

The number $k$ of bins is computed according to the formula

\[ k = \big\lceil \hat{\sigma}^{-1}(24\sqrt{\pi})^{-1/3}n^{1/3}\big\rceil,\]

where $\hat{\sigma}$ is the sample standard deviation and $n$ is the sample size.

Examples

julia> x = (1.0 .- (1.0 .- LinRange(0.0, 1.0, 500)) .^(1/3)).^(1/3);

julia> fit(AutomaticHistogram, x, Scott())
AutomaticHistogram
breaks: LinRange{Float64}(0.0, 1.0, 14)
density: [0.026, 0.13, 0.338, 0.624, 0.988, 1.378, 1.716, 1.924, 2.002, 1.768, 1.3, 0.676, 0.13]
counts: [1, 5, 13, 24, 38, 53, 66, 74, 77, 68, 50, 26, 5]
type: regular
closed: right
a: NaN

References

This classical rule is due to Scott (1979).

source

Wand's rule

AutoHist.WandType
Wand(; level::Int=2, scalest::Symbol=:minim)

Wand's rule for regular histograms.

A more sophisticated version of Scott's rule, Wand's rule proceeds by determining the bin width $h$ as

\[ h = \Big(\frac{6}{\hat{C}(f_0) n}\Big)^{1/3},\]

where $\hat{C}(f_0)$ is an estimate of the functional $C(f_0) = \int \{f_0'(x)\}^2\, \text{d}x$. The corresponding number of bins is $k = \lceil h^{-1}\rceil$.

Keyword arguments

level: The level keyword controls the number of stages of functional estimation used to compute $\hat{C}$, and can take values 0, 1, 2, 3, 4, 5, with the default value being level=2. The choice level=0 corresponds to a varation on Scott's rule, with a custom scale estimate. scalest: Estimate of scale parameter. Possible choices are :minim :stdev and :iqr. The latter two use sample standard deviation or the sample interquartile range, respectively, to estimate the scale. The default choice :minim uses the minimum of the above estimates.

Examples

julia> x = (1.0 .- (1.0 .- LinRange(0.0, 1.0, 500)) .^(1/3)).^(1/3);

julia> rule = Wand(scalest=:stdev, level=5);

julia> fit(AutomaticHistogram, x, rule)
AutomaticHistogram
breaks: LinRange{Float64}(0.0, 1.0, 13)
density: [0.024, 0.144, 0.408, 0.72, 1.128, 1.536, 1.872, 1.992, 1.848, 1.416, 0.744, 0.168]
counts: [1, 6, 17, 30, 47, 64, 78, 83, 77, 59, 31, 7]
type: regular
closed: right
a: NaN

References

The full details on this method are given in Wand (1997).

source

Index