Supported Methods
This page provides background on each histogram method supported through the rule
argument. Our presentation is intended to be rather brief, and we do as such not cover the theoretical underpinnings of each method in great detail. For some further background on automatic histogram procedures and the theory behind them, we recommend the excellent reviews contained in the articles of Birgé and Rozenholc (2006) and Davies et al. (2009).
For ease of exposition, we present all methods covered here in the context of estimating the density of a sample $\boldsymbol{x} = (x_1, x_2, \ldots, x_n)$ on the unit interval, but note that extending the procedures presented here to other compact intervals is possible through a suitable affine transformation. In particular, if a density estimate with support $[a,b]$ is desired, we can scale the data to the unit interval through $z_i = (x_i - a)/(b-a)$, and apply the methods on this transformed sample and rescale the resulting density estimate to $[a,b]$. In cases where the support of the density is unknown, a natural choice is $a = x_{(1)}$ and $b = x_{(n)}$. Cases where only the lower or upper bound is known can be handled similarly. The transformation used to construct the histogram can be controlled through the support
keyword, where the default argument support=(-Inf, Inf)
uses the order statistics-based approach described above.
Notation
Before we describe the methods included here in more detail, we introduce some notation. We let $\mathcal{I} = (\mathcal{I}_1, \mathcal{I}_2, \ldots, \mathcal{I}_k)$ denote a partition of $[0,1]$ into $k$ intervals and write $|\mathcal{I}_j|$ for the length of interval $\mathcal{I}_j$. The intervals in the partition $\mathcal{I}$ can be either right- or left-closed. Whether a left- or right-closed partition is used to draw the histogram is controlled by the keyword argument closed
, with options :left
and :right
(default). This choice is somewhat arbitrary, but is unlikely to matter much in practical applications.
Based on a partition $\mathcal{I}$, we can write down the corresponding histogram density estimate by
\[\widehat{f}(x) = \sum_{j=1}^k \frac{\widehat{\theta}_j}{|\mathcal{I}_j|}\mathbf{1}_{\mathcal{I}_j}(x), \quad x\in [0,1],\]
where $\mathbf{1}_{\mathcal{I}_j}$ is the indicator function, $\widehat{\theta}_j \geq 0$ for all $j$ and $\sum_{j=1}^k \widehat{\theta}_j = 1$.
For most of the methods considered here, the estimated bin probabilities are the maximum likelihood estimates $\widehat{\theta}_j = N_j/n$, where $N_j = \sum_{i=1}^n \mathbb{1}_{\mathcal{I}_j}(x_i)$ is number of observations landing in interval $\mathcal{I}_j$ . The exception to this rule is are the two Bayesian approaches, which uses the Bayes estimator $\widehat{\theta}_j = (a_j + N_j)/(a+n)$ for $(a_1, \ldots, a_k) \in (0,\infty)^k$ and $a = \sum_{j=1}^k a_j$ instead.
The goal of an automatic histogram procedure is to find a partition $\mathcal{I}$ based on the sample alone which produces a reasonable density estimate. Regular histogram procedures only consider regular partitions, where all intervals in the partition are of equal length, so that one only needs to determine the number $k$ of bins. Irregular histograms allow for partitions with intervals of unequal length, and try to determine both the number of bins and the locations of the cutpoints between the intervals.
A short note on using different rules
In order to fit a histogram using a specific rule
, we call fit(AutomaticHistogram, x, rule)
, where x
is the data vector. For many of the rules discussed below, the user can specify additional rule-specific keywords to rule
, providing additional control over the supplied method when desired. We also provide a set of default values for these parameters, so that the user may for instance call fit(AutomaticHistogram, x, AIC())
to fit a regular histogram using the AIC criterion without having to worry about explicitly passing any keyword arguments.
Irregular histograms
The following section provides a description of all the irregular histogram rules that have been implemented in AutoHist.jl. In each case, the best partition is selected among the subset of interval partitions of the unit interval that have cut points belonging to a discrete set of cardinality $k_n-1$. In all the irregular procedures covered here, we attempt to find best partition according to a goodness-of-fit criterion among all partitions with endpoints belonging to a given discrete mesh $\{\tau_{j}\colon 0\leq j \leq k_n\}$.
Random irregular histogram
AutoHist.RIH
— TypeRIH(;
a::Real,
logprior::Function:=k-> 0.0,
grid::Symbol=:regular,
maxbins::Union{Int, Symbol}=:default,
alg::AbstractAlgorithm=SegNeig()
)
The random irregular histogram criterion.
Consists of maximizing the marginal log-posterior of the partition $\mathcal{I} = (\mathcal{I}_1, \ldots, \mathcal{I}_k)$,
\[ \sum_{j=1}^k \big\{\log \Gamma(a_j + N_j) - \log \Gamma(a_j) - N_j\log|\mathcal{I}_j|\big\} + \log p_n(k) - \log \binom{k_n-1}{k-1}\]
Here $p_n(k)$ is the prior distribution on the number $k$ of bins, which can be controlled by supplying a function to the logprior
keyword argument. The default value is $p_n(k) \propto 1$. Here, $a_j = a/k$, for a scalar $a > 0$, not depending on $k$.
Keyword arguments
a
: Specifies Dirichlet concentration parameter in the Bayesian histogram model. Must be a fixed positive number, and defaults toa=5.0
.logprior
: Unnormalized logprior distribution on the number $k$ of bins. Defaults to a uniform prior, e.g.logprior(k) = 0
for allk
.grid
: Symbol indicating how the finest possible mesh should be constructed. Options are:data
, which uses each unique data point as a grid point,:regular
(default) which constructs a fine regular grid, and:quantile
which constructs the grid based on the sample quantiles.maxbins
: Maximal number of bins for which the above criterion is evaluated. Defaults tomaxbins=:default
, which sets maxbins to the ceil ofmin(1000, 4n/log(n)^2)
ifgrid
isregular
orquantile
. Ignored ifgrid=:data
.alg
: Algorithm used to fit the model. Currently, onlySegNeig
is supported for this rule.
Examples
julia> x = (1.0 .- (1.0 .- LinRange(0.0, 1.0, 500)) .^(1/3)).^(1/3);
julia> rule = RIH(a = 5.0, logprior = k-> -log(k), grid = :data);
julia> fit(AutomaticHistogram, x, rule)
AutomaticHistogram
breaks: [0.0, 0.18220071105959446, 0.3587941358096334, 0.8722292888743843, 1.0]
density: [0.11858322346056327, 0.6490600487586273, 1.6066011289666577, 0.30436411114439915]
counts: [10, 57, 414, 19]
type: irregular
closed: right
a: 5.0
References
This approach to irregular histograms first appeared in Simensen et al. (2025).
Rozenholc, Mildenberger & Gather penalty A
AutoHist.RMG_penA
— TypeRMG_penA(;
grid::Symbol=:regular,
maxbins::Union{Int, Symbol}=:default,
alg::AbstractAlgorithm=SegNeig()
)
Consists of finding the partition $\mathcal{I}$ that maximizes a penalized log-likelihood,
\[ \sum_{j=1}^k N_j \log (N_j/|\mathcal{I}_j|) - \log \binom{k_n-1}{k-1} - k - 2\log(k) - \sqrt{2(k-1)\Big[\log \binom{k_n-1}{k-1}+ 2\log(k)\Big]}.\]
Keyword arguments
grid
: Symbol indicating how the finest possible mesh should be constructed. Options are:data
, which uses each unique data point as a grid point,:regular
(default) which constructs a fine regular grid, and:quantile
which constructs the grid based on the sample quantiles.maxbins
: Maximal number of bins for which the above criterion is evaluated. Defaults tomaxbins=:default
, which sets maxbins to the ceil ofmin(1000, 4n/log(n)^2)
ifgrid
isregular
orquantile
. Ignored ifgrid=:data
.alg
: Algorithm used to fit the model. Currently, onlySegNeig
is supported for this rule.
Examples
julia> x = (1.0 .- (1.0 .- LinRange(0.0, 1.0, 500)) .^(1/3)).^(1/3);
julia> rule = RMG_penA(grid = :data);
julia> fit(AutomaticHistogram, x, rule)
AutomaticHistogram
breaks: [0.0, 0.18875598171056715, 0.3644223879547405, 0.8696799410193466, 1.0]
density: [0.116552597701164, 0.6717277510418354, 1.6229346697072502, 0.30693663211078037]
counts: [11, 59, 410, 20]
type: irregular
closed: right
a: NaN
References
This approach was suggested by Rozenholc et al. (2010).
Rozenholc, Mildenberger & Gather penalty B
AutoHist.RMG_penB
— TypeRMG_penB(;
grid::Symbol=:regular,
maxbins::Union{Int, Symbol}=:default,
alg::AbstractAlgorithm=SegNeig()
)
Consists of finding the partition $\mathcal{I}$ that maximizes a penalized log-likelihood,
\[ \sum_{j=1}^k N_j \log (N_j/|\mathcal{I}_j|) - \log \binom{k_n-1}{k-1} - k - \log^{2.5}(k).\]
Keyword arguments
grid
: Symbol indicating how the finest possible mesh should be constructed. Options are:data
, which uses each unique data point as a grid point,:regular
(default) which constructs a fine regular grid, and:quantile
which constructs the grid based on the sample quantiles.maxbins
: Maximal number of bins for which the above criterion is evaluated. Defaults tomaxbins=:default
, which sets maxbins to the ceil ofmin(1000, 4n/log(n)^2)
ifgrid
isregular
orquantile
. Ignored ifgrid=:data
.alg
: Algorithm used to fit the model. Currently, onlySegNeig
is supported for this rule.
Examples
julia> x = (1.0 .- (1.0 .- LinRange(0.0, 1.0, 500)) .^(1/3)).^(1/3);
julia> rule = RMG_penB(grid = :data);
julia> fit(AutomaticHistogram, x, rule)
AutomaticHistogram
breaks: [0.0, 0.1948931612779725, 0.375258352661302, 0.8268306249022703, 0.9222490305512866, 1.0]
density: [0.12314439276691318, 0.7096713008662634, 1.6962954704872724, 0.7545714006671028, 0.12861575966067232]
counts: [12, 64, 383, 36, 5]
type: irregular
closed: right
a: NaN
References
This approach was suggested by Rozenholc et al. (2010).
Rozenholc, Mildenberger & Gather penalty R
AutoHist.RMG_penR
— TypeRMG_penR(;
grid::Symbol=:regular,
maxbins::Union{Int, Symbol}=:default,
alg::AbstractAlgorithm=SegNeig()
)
Consists of finding the partition $\mathcal{I}$ that maximizes a penalized log-likelihood,
\[ \sum_{j=1}^k \big\{N_j \log (N_j/|\mathcal{I}_j|) - \frac{N_j}{2n}\big\} - \log \binom{k_n-1}{k-1} - \log^{2.5}(k).\]
Keyword arguments
grid
: Symbol indicating how the finest possible mesh should be constructed. Options are:data
, which uses each unique data point as a grid point,:regular
(default) which constructs a fine regular grid, and:quantile
which constructs the grid based on the sample quantiles.maxbins
: Maximal number of bins for which the above criterion is evaluated. Defaults tomaxbins=:default
, which sets maxbins to the ceil ofmin(1000, 4n/log(n)^2)
ifgrid
isregular
orquantile
. Ignored ifgrid=:data
.alg
: Algorithm used to fit the model. Currently, onlySegNeig
is supported for this rule.
Examples
julia> x = (1.0 .- (1.0 .- LinRange(0.0, 1.0, 500)) .^(1/3)).^(1/3);
julia> rule = RMG_penR(grid = :data);
julia> fit(AutomaticHistogram, x, rule)
AutomaticHistogram
breaks: [0.0, 0.18875598171056715, 0.3699070396003733, 0.8285645195146814, 0.9222490305512866, 1.0]
density: [0.116552597701164, 0.6845115973621804, 1.6875338000474953, 0.7471886144834441, 0.12861575966067232]
counts: [11, 62, 387, 35, 5]
type: irregular
closed: right
a: NaN
References
This approach was suggested by Rozenholc et al. (2010).
Irregular $L_2$ leave-one-out cross-validation (L2CV_R)
AutoHist.L2CV_I
— TypeL2CV_I(;
grid::Symbol=:regular,
maxbins::Union{Int, Symbol}=:default,
alg::AbstractAlgorithm=OptPart(),
use_min_length::Bool=false
)
Consists of finding the partition $\mathcal{I}$ that maximizes a L2 leave-one-out cross-validation criterion,
\[ \frac{n+1}{n}\sum_{j=1}^k \frac{N_j^2}{|\mathcal{I}_j|} - 2\sum_{j=1}^k \frac{N_j}{|\mathcal{I}_j|}.\]
Keyword arguments
grid
: Symbol indicating how the finest possible mesh should be constructed. Options are:data
, which uses each unique data point as a grid point,:regular
(default) which constructs a fine regular grid, and:quantile
which constructs the grid based on the sample quantiles.maxbins
: Maximal number of bins for which the above criterion is evaluated. Defaults tomaxbins=:default
, which sets maxbins to the ceil ofmin(1000, 4n/log(n)^2)
ifgrid
isregular
orquantile
. Ignored ifgrid=:data
.alg
: Algorithm used to fit the model. Currently,OptPart
andSegNeig
are supported for this rule, with the former algorithm being the default.use_min_length
: Boolean indicating whether or not to impose a restriction on the minimum bin length of the histogram. If set to true, the smallest allowed bin length is set to(maximum(x)-minimum(x))/n*log(n)^(1.5)
.
Examples
julia> x = (1.0 .- (1.0 .- LinRange(0.0, 1.0, 500)) .^(1/3)).^(1/3);
julia> rule = L2CV_I(grid = :data, use_min_length=true);
julia> fit(AutomaticHistogram, x, rule)
AutomaticHistogram
breaks: [0.0, 0.149647045210915, 0.2499005080461325, 0.3490626376697454, 0.4600140220788484, 0.7765683248449301, 0.8535131937737716, 0.9121099383916996, 0.9560732934980348, 1.0]
density: [0.08018868653963065, 0.3590898407087615, 0.7664216197097746, 1.2798398213438569, 1.8448651460332468, 1.2476465466304287, 0.6826317786220794, 0.2729545998246794, 0.045530388214070294]
counts: [6, 18, 38, 71, 292, 48, 20, 6, 1]
type: irregular
closed: right
a: NaN
References
This approach dates back to Rudemo (1982).
Irregular Kullback-Leibler leave-one-out cross-validation (L2CV_R)
AutoHist.KLCV_I
— TypeKLCV_I(;
grid::Symbol=:regular,
maxbins::Union{Int, Symbol}=:default,
alg::AbstractAlgorithm=OptPart(),
use_min_length::Bool=false
)
Consists of finding the partition $\mathcal{I}$ that maximizes a Kullback-Leibler leave-one-out cross-validation criterion,
\[ \sum_{j=1}^k N_j\log(N_j-1) - \sum_{j=1}^k N_j\log |\mathcal{I}_j|,\]
where the maximmization is over all partitions with $N_j \geq 2$ for all $j$.
Keyword arguments
grid
: Symbol indicating how the finest possible mesh should be constructed. Options are:data
, which uses each unique data point as a grid point,:regular
(default) which constructs a fine regular grid, and:quantile
which constructs the grid based on the sample quantiles.maxbins
: Maximal number of bins for which the above criterion is evaluated. Defaults tomaxbins=:default
, which sets maxbins to the ceil ofmin(1000, 4n/log(n)^2)
ifgrid
isregular
orquantile
. Ignored ifgrid=:data
.alg
: Algorithm used to fit the model. Currently,OptPart
andSegNeig
are supported for this rule, with the former algorithm being the default.use_min_length
: Boolean indicating whether or not to impose a restriction on the minimum bin length of the histogram. If set to true, the smallest allowed bin length is set to(maximum(x)-minimum(x))/n*log(n)^(1.5)
.
Examples
julia> x = (1.0 .- (1.0 .- LinRange(0.0, 1.0, 500)) .^(1/3)).^(1/3);
julia> rule = KLCV_I(grid = :data, use_min_length=true);
julia> fit(AutomaticHistogram, x, rule)
AutomaticHistogram
breaks: [0.0, 0.13888886265725095, 0.23836051747480758, 0.33883651300547, 0.45084951551151237, 0.7900230337213711, 0.8722292888743843, 0.9352920770792058, 1.0]
density: [0.07200001359848368, 0.321699684786505, 0.7165890680628054, 1.2319998295961743, 1.8220762141507212, 1.1191362485586687, 0.5074307830485909, 0.09272434856770642]
counts: [5, 16, 36, 69, 309, 46, 16, 3]
type: irregular
closed: right
a: NaN
References
This approach to irregular histograms was, to the best of our knowledge, first considered in Simensen et al. (2025).
Normalized maximum likelihood, regular (NML_R)
AutoHist.NML_I
— TypeNML_I(;
grid::Symbol=:regular,
maxbins::Union{Int, Symbol}=:default,
alg::AbstractAlgorithm=DP()
)
A quick-to-evalutate version of the normalized maximum likelihood criterion.
Consists of finding the partition $\mathcal{I}$ that maximizes a penalized log-likelihood,
\[\begin{aligned} &\sum_{j=1}^k N_j\log \frac{N_j}{|\mathcal{I}_j|} - \frac{k-1}{2}\log(n/2) - \log\frac{\sqrt{\pi}}{\Gamma(k/2)} - n^{-1/2}\frac{\sqrt{2}k\Gamma(k/2)}{3\Gamma(k/2-1/2)} \\ &- n^{-1}\left(\frac{3+k(k-2)(2k+1)}{36} - \frac{\Gamma(k/2)^2 k^2}{9\Gamma(k/2-1/2)^2} \right) - \log \binom{k_n-1}{k-1} \end{aligned}\]
Keyword arguments
grid
: Symbol indicating how the finest possible mesh should be constructed. Options are:data
, which uses each unique data point as a grid point,:regular
(default) which constructs a fine regular grid, and:quantile
which constructs the grid based on the sample quantiles.maxbins
: Maximal number of bins for which the above criterion is evaluated. Defaults tomaxbins=:default
, which sets maxbins to the ceil ofmin(1000, 4n/log(n)^2)
ifgrid
isregular
orquantile
. Ignored ifgrid=:data
.alg
: Algorithm used to fit the model. Currently, onlySegNeig
is supported for this rule.
Examples
julia> x = (1.0 .- (1.0 .- LinRange(0.0, 1.0, 500)) .^(1/3)).^(1/3);
julia> rule = NML_I(grid = :data);
julia> fit(AutomaticHistogram, x, rule)
AutomaticHistogram
breaks: [0.0, 0.18875598171056715, 0.3644223879547405, 0.8696799410193466, 1.0]
density: [0.116552597701164, 0.6717277510418354, 1.6229346697072502, 0.30693663211078037]
counts: [11, 59, 410, 20]
type: irregular
closed: right
a: NaN
References
This a variant of this criterion first suggested by Kontkanen and Myllymäki (2007).
Regular histograms
The following section details how each value of the rule
argument selects the number $k$ of bins to draw a regular histogram automatically based on a random sample. In the following, $\mathcal{I} = (\mathcal{I}_1, \mathcal{I}_2, \ldots, \mathcal{I}_k)$ is the corresponding partition of $[0,1]$ consisting of $k$ equal-length bins. In cases where the value of the number of bins is computed by maximizing an expression, we look for the best regular partition among all regular partitions consisting of no more than $k_n$ bins. For rules falling under this umbrella, $k_n$ can be controlled through the maxbins
keyword, as detailed below.
Random regular histogram (RRH), Knuth
AutoHist.RRH
— TypeRRH(; a::Union{Real, Function}, logprior::Function, maxbins::Union{Int, Symbol}=:default)
Knuth(; maxbins::Union{Int, Symbol}=:default)
The random regular histogram criterion.
The number $k$ of bins is chosen as the maximizer of the marginal log-posterior,
\[ n\log (k) + \sum_{j=1}^k \big\{\log \Gamma(a_j + N_j) - \log \Gamma(a_j)\big\} + \log p_n(k).\]
Here $p_n(k)$ is the prior distribution on the number $k$ of bins, which can be controlled by supplying a function to the logprior
keyword argument. The default value is $p_n(k) \propto 1$. Here, $a_j = a/k$, for a scalar $a > 0$, possibly depending on $k$. The value of $a$ can be set by supplying a fixed, positive scalar or a function $a(k)$ to the keyword argument a
. The default value is a=5.0
for RRH()
.
The rule Knuth()
is a special case of the RRH
criterion, which corresponds to the particular choices $a_j = 0.5$ and $p_n(k)\propto 1$.
Keyword arguments
a
: Specifies Dirichlet concentration parameter in the Bayesian histogram model. Can either be a fixed positive number or a function computing aₖ for different values of k. Defaults to5.0
if not supplied.logprior
: Unnormalized logprior distribution on the number $k$ of bins. Defaults to a uniform prior, e.g.logprior(k) = 0
for allk
.maxbins
: Maximal number of bins for which the above criterion is evaluated. Defaults tomaxbins=:default
, which sets maxbins to the ceil ofmin(1000, 4n/log(n)^2)
.
Examples
julia> x = (1.0 .- (1.0 .- LinRange(0.0, 1.0, 500)) .^(1/3)).^(1/3);
julia> rule = RRH(a = k->0.5*k, logprior = k->0.0);
julia> h = fit(AutomaticHistogram, x, rule)
AutomaticHistogram
breaks: LinRange{Float64}(0.0, 1.0, 11)
density: [0.04950495049504951, 0.2079207920792079, 0.5643564356435643, 1.0, 1.495049504950495, 1.8712871287128714, 1.9702970297029703, 1.6732673267326732, 0.9603960396039604, 0.2079207920792079]
counts: [2, 10, 28, 50, 75, 94, 99, 84, 48, 10]
type: regular
closed: right
a: 5.0
julia> h == fit(AutomaticHistogram, x, Knuth())
true
References
The Knuth
criterion for histograms was proposed by Knuth (2019). The random regular histogram criterion is a generalization.
AIC
AutoHist.AIC
— TypeAIC(; maxbins::Union{Int, Symbol}=:default)
AIC criterion for regular histograms.
The number $k$ of bins is chosen as the maximizer of the penalized log-likelihood,
\[ n\log (k) + \sum_{j=1}^k N_j \log (N_j/n) - k,\]
where $n$ is the sample size.
Keyword arguments
maxbins
: Maximal number of bins for which the above criterion is evaluated. Defaults tomaxbins=:default
, which sets maxbins to the ceil ofmin(1000, 4n/log(n)^2)
.
Examples
julia> x = (1.0 .- (1.0 .- LinRange(0.0, 1.0, 500)) .^(1/3)).^(1/3);
julia> fit(AutomaticHistogram, x, AIC())
AutomaticHistogram
breaks: LinRange{Float64}(0.0, 1.0, 11)
density: [0.04, 0.2, 0.56, 1.0, 1.5, 1.88, 1.98, 1.68, 0.96, 0.2]
counts: [2, 10, 28, 50, 75, 94, 99, 84, 48, 10]
type: regular
closed: right
a: NaN
References
The aic criterion was proposed by Taylor (1987) for histograms.
BIC
AutoHist.BIC
— TypeBIC(; maxbins::Union{Int, Symbol}=:default)
BIC criterion for regular histograms.
The number $k$ of bins is chosen as the maximizer of the penalized log-likelihood,
\[ n\log (k) + \sum_{j=1}^k N_j \log (N_j/n) - \frac{k}{2}\log(n).\]
where $n$ is the sample size.
Keyword arguments
maxbins
: Maximal number of bins for which the above criterion is evaluated. Defaults tomaxbins=:default
, which sets maxbins to the ceil ofmin(1000, 4n/log(n)^2)
.
Examples
julia> x = (1.0 .- (1.0 .- LinRange(0.0, 1.0, 500)) .^(1/3)).^(1/3);
julia> fit(AutomaticHistogram, x, BIC())
AutomaticHistogram
breaks: LinRange{Float64}(0.0, 1.0, 9)
density: [0.048, 0.336, 0.816, 1.44, 1.904, 1.904, 1.264, 0.288]
counts: [3, 21, 51, 90, 119, 119, 79, 18]
type: regular
closed: right
a: NaN
Birgé-Rozenholc (BR)
AutoHist.BR
— TypeBR(; maxbins::Union{Int, Symbol}=:default)
Birgé-Rozenholc criterion for regular histograms.
The number $k$ of bins is chosen as the maximizer of the penalized log-likelihood,
\[ n\log (k) + \sum_{j=1}^k N_j \log (N_j/n) - k - \log^{2.5} (k),\]
where $n$ is the sample size.
Keyword arguments
maxbins
: Maximal number of bins for which the above criterion is evaluated. Defaults tomaxbins=:default
, which sets maxbins to the ceil ofmin(1000, 4n/log(n)^2)
.
Examples
julia> x = (1.0 .- (1.0 .- LinRange(0.0, 1.0, 500)) .^(1/3)).^(1/3);
julia> fit(AutomaticHistogram, x, BR())
AutomaticHistogram
breaks: LinRange{Float64}(0.0, 1.0, 11)
density: [0.04, 0.2, 0.56, 1.0, 1.5, 1.88, 1.98, 1.68, 0.96, 0.2]
counts: [2, 10, 28, 50, 75, 94, 99, 84, 48, 10]
type: regular
closed: right
a: NaN
References
This criterion was proposed by Birgé and Rozenholc (2006).
Regular $L_2$ leave-one-out cross-validation (L2CV_R)
AutoHist.L2CV_R
— TypeL2CV_R(; maxbins::Union{Int, Symbol}=:default)
L2 cross-validation criterion for regular histograms.
The number $k$ of bins is chosen by maximizing a leave-one-out L2 cross-validation criterion,
\[ -2k + k\frac{n+1}{n^2}\sum_{j=1}^k N_j^2.\]
where $n$ is the sample size.
Keyword arguments
maxbins
: Maximal number of bins for which the above criterion is evaluated. Defaults tomaxbins=:default
, which sets maxbins to the ceil ofmin(1000, 4n/log(n)^2)
.
Examples
julia> x = (1.0 .- (1.0 .- LinRange(0.0, 1.0, 500)) .^(1/3)).^(1/3);
julia> fit(AutomaticHistogram, x, L2CV_R())
AutomaticHistogram
breaks: LinRange{Float64}(0.0, 1.0, 11)
density: [0.04, 0.2, 0.56, 1.0, 1.5, 1.88, 1.98, 1.68, 0.96, 0.2]
counts: [2, 10, 28, 50, 75, 94, 99, 84, 48, 10]
type: regular
closed: right
a: NaN
References
This approach to histogram density estimation was first considered by Rudemo (1982).
Regular Kullback-Leibler leave-one-out cross-validation (KLCV_R)
AutoHist.KLCV_R
— TypeKLCV_R(; maxbins::Union{Int, Symbol}=:default)
Kullback-Leibler cross-validation criterion for regular histograms.
The number $k$ of bins is chosen by maximizing a leave-one-out Kullback-Leibler cross-validation criterion,
\[ n\log(k) + \sum_{j=1}^k N_j\log (N_j-1),\]
where $n$ is the sample size and the maximmization is over all regular partitions with $N_j \geq 2$ for all $j$.
Keyword arguments
maxbins
: Maximal number of bins for which the above criterion is evaluated. Defaults tomaxbins=:default
, which sets maxbins to the ceil ofmin(1000, 4n/log(n)^2)
.
Examples
julia> x = (1.0 .- (1.0 .- LinRange(0.0, 1.0, 500)) .^(1/3)).^(1/3);
julia> fit(AutomaticHistogram, x, KLCV_R())
AutomaticHistogram
breaks: LinRange{Float64}(0.0, 1.0, 11)
density: [0.04, 0.2, 0.56, 1.0, 1.5, 1.88, 1.98, 1.68, 0.96, 0.2]
counts: [2, 10, 28, 50, 75, 94, 99, 84, 48, 10]
type: regular
closed: right
a: NaN
References
This approach was first studied by Hall (1990).
Minimum description length (MDL)
AutoHist.MDL
— TypeMDL(; maxbins::Union{Int, Symbol}=:default)
MDL criterion for regular histograms.
The number $k$ of bins is chosen as the minimizer of an encoding length of the data, and is equivalent to the maximizer of
\[ n\log(k) + \sum_{j=1}^k \big(N_j-\frac{1}{2}\big)\log\big(N_j-\frac{1}{2}\big) - \big(n-\frac{k}{2}\big)\log\big(n-\frac{k}{2}\big) - \frac{k}{2}\log(n),\]
where $n$ is the sample size and the maximmization is over all regular partitions with $N_j \geq 1$ for all $j$.
Keyword arguments
maxbins
: Maximal number of bins for which the above criterion is evaluated. Defaults tomaxbins=:default
, which sets maxbins to the ceil ofmin(1000, 4n/log(n)^2)
.
Examples
julia> x = (1.0 .- (1.0 .- LinRange(0.0, 1.0, 500)) .^(1/3)).^(1/3);
julia> fit(AutomaticHistogram, x, MDL())
AutomaticHistogram
breaks: LinRange{Float64}(0.0, 1.0, 11)
density: [0.04, 0.2, 0.56, 1.0, 1.5, 1.88, 1.98, 1.68, 0.96, 0.2]
counts: [2, 10, 28, 50, 75, 94, 99, 84, 48, 10]
type: regular
closed: right
a: NaN
References
The minimum description length principle was first applied to histogram estimation by Hall and Hannan (1988).
Normalized maximum likelihood, regular (NML_R)
AutoHist.NML_R
— TypeNML_R(; maxbins::Union{Int, Symbol}=:default)
NML_R criterion for regular histograms.
The number $k$ of bins is chosen by maximizing a penalized likelihood,
\[\begin{aligned} &\sum_{j=1}^k N_j\log \frac{N_j}{|\mathcal{I}_j|} - \frac{k-1}{2}\log(n/2) - \log\frac{\sqrt{\pi}}{\Gamma(k/2)} - n^{-1/2}\frac{\sqrt{2}k\Gamma(k/2)}{3\Gamma(k/2-1/2)} \\ &- n^{-1}\left(\frac{3+k(k-2)(2k+1)}{36} - \frac{\Gamma(k/2)^2 k^2}{9\Gamma(k/2-1/2)^2} \right). \end{aligned}\]
where $n$ is the sample size.
Keyword arguments
maxbins
: Maximal number of bins for which the above criterion is evaluated. Defaults tomaxbins=:default
, which sets maxbins to the ceil ofmin(1000, 4n/log(n)^2)
.
Examples
julia> x = (1.0 .- (1.0 .- LinRange(0.0, 1.0, 500)) .^(1/3)).^(1/3);
julia> fit(AutomaticHistogram, x, NML_R())
AutomaticHistogram
breaks: LinRange{Float64}(0.0, 1.0, 24)
density: [0.046, 0.0, 0.138, 0.184, 0.368, 0.506, 0.69, 0.874, 1.104, 1.334 … 1.978, 1.978, 1.978, 1.84, 1.61, 1.334, 0.966, 0.644, 0.276, 0.046]
counts: [1, 0, 3, 4, 8, 11, 15, 19, 24, 29 … 43, 43, 43, 40, 35, 29, 21, 14, 6, 1]
type: regular
closed: right
a: NaN
References
This is a regular variant of the normalized maximum likelihood criterion considered by Kontkanen and Myllymäki (2007).
Sturges' rule
AutoHist.Sturges
— TypeSturges()
Sturges' rule for regular histograms.
The number $k$ of bins is chosen as
\[ k = \lceil \log_2(n) \rceil + 1,\]
where $n$ is the sample size.
This is the default procedure used by the hist()
function in base R.
Examples
julia> x = (1.0 .- (1.0 .- LinRange(0.0, 1.0, 500)) .^(1/3)).^(1/3);
julia> fit(AutomaticHistogram, x, Sturges())
AutomaticHistogram
breaks: LinRange{Float64}(0.0, 1.0, 10)
density: [0.054, 0.252, 0.666, 1.206, 1.71, 1.98, 1.782, 1.098, 0.252]
counts: [3, 14, 37, 67, 95, 110, 99, 61, 14]
type: regular
closed: right
a: NaN
References
This classical rule is due to Sturges (1926).
Freedman and Diaconis' rule
AutoHist.FD
— TypeFD()
Freedman and Diaconis' rule for regular histograms.
The number $k$ of bins is computed according to the formula
\[ k = \big\lceil\frac{n^{1/3}}{2\text{IQR}(\boldsymbol{x})}\big\rceil,\]
where $\text{IQR}(\boldsymbol{x})$ is the sample interquartile range and $n$ is the sample size.
This is the default procedure used by the histogram()
function in Plots.jl
.
Examples
julia> x = (1.0 .- (1.0 .- LinRange(0.0, 1.0, 500)) .^(1/3)).^(1/3);
julia> fit(AutomaticHistogram, x, FD())
AutomaticHistogram
breaks: LinRange{Float64}(0.0, 1.0, 16)
density: [0.03, 0.09, 0.24, 0.48, 0.78, 1.08, 1.44, 1.71, 1.92, 2.01, 1.89, 1.59, 1.08, 0.54, 0.12]
counts: [1, 3, 8, 16, 26, 36, 48, 57, 64, 67, 63, 53, 36, 18, 4]
type: regular
closed: right
a: NaN
References
This rule dates back to Freedman and Diaconis (1982).
Scott's rule
AutoHist.Scott
— TypeScott()
Scott's rule for regular histograms.
The number $k$ of bins is computed according to the formula
\[ k = \big\lceil \hat{\sigma}^{-1}(24\sqrt{\pi})^{-1/3}n^{1/3}\big\rceil,\]
where $\hat{\sigma}$ is the sample standard deviation and $n$ is the sample size.
Examples
julia> x = (1.0 .- (1.0 .- LinRange(0.0, 1.0, 500)) .^(1/3)).^(1/3);
julia> fit(AutomaticHistogram, x, Scott())
AutomaticHistogram
breaks: LinRange{Float64}(0.0, 1.0, 14)
density: [0.026, 0.13, 0.338, 0.624, 0.988, 1.378, 1.716, 1.924, 2.002, 1.768, 1.3, 0.676, 0.13]
counts: [1, 5, 13, 24, 38, 53, 66, 74, 77, 68, 50, 26, 5]
type: regular
closed: right
a: NaN
References
This classical rule is due to Scott (1979).
Wand's rule
AutoHist.Wand
— TypeWand(; level::Int=2, scalest::Symbol=:minim)
Wand's rule for regular histograms.
A more sophisticated version of Scott's rule, Wand's rule proceeds by determining the bin width $h$ as
\[ h = \Big(\frac{6}{\hat{C}(f_0) n}\Big)^{1/3},\]
where $\hat{C}(f_0)$ is an estimate of the functional $C(f_0) = \int \{f_0'(x)\}^2\, \text{d}x$. The corresponding number of bins is $k = \lceil h^{-1}\rceil$.
Keyword arguments
level
: The level
keyword controls the number of stages of functional estimation used to compute $\hat{C}$, and can take values 0, 1, 2, 3, 4, 5
, with the default value being level=2
. The choice level=0
corresponds to a varation on Scott's rule, with a custom scale estimate. scalest
: Estimate of scale parameter. Possible choices are :minim
:stdev
and :iqr
. The latter two use sample standard deviation or the sample interquartile range, respectively, to estimate the scale. The default choice :minim
uses the minimum of the above estimates.
Examples
julia> x = (1.0 .- (1.0 .- LinRange(0.0, 1.0, 500)) .^(1/3)).^(1/3);
julia> rule = Wand(scalest=:stdev, level=5);
julia> fit(AutomaticHistogram, x, rule)
AutomaticHistogram
breaks: LinRange{Float64}(0.0, 1.0, 13)
density: [0.024, 0.144, 0.408, 0.72, 1.128, 1.536, 1.872, 1.992, 1.848, 1.416, 0.744, 0.168]
counts: [1, 6, 17, 30, 47, 64, 78, 83, 77, 59, 31, 7]
type: regular
closed: right
a: NaN
References
The full details on this method are given in Wand (1997).