Algorithms for constructing irregular histograms

Constructing data-adaptive irregular histograms is in general a difficult problem from a computational perspective, and as a result computing the exact optimal partition is often impractical for larger sample sizes. This package solves the problem of irregular histogram construction via heuristics that combine a greedy search procedure with dynamic programming techniques to quickly compute a nearly optimal partition. Examples showcasing the use of the provided algorithms in toy problems can be found here. Note that the default option for the alg keyword offers a reasonable tradeoff between accuracy and computational efficiency, and in cases where one simply wants to draw an irregular histogram quickly for a given dataset, these default choices will typically yield a reasonable density estimate within a reasonable amount of time. However, if better performance or additional accuracy is desired, then a more fine-tuned approach to selecting the algorithm used and its hyperparameters is needed.

Problem description

All the irregular histogram methods supported by this library are the product of solving an optimization problem of the form

\[ \max_{1\leq k\leq k_n}\max_{\boldsymbol{t}_{0:k}} \Big\{\sum_{j=1}^k \Phi(\tau_{n, t_{j-1}}, \tau_{n, t_{j}}] + \Psi(k)\Big\},\]

where the inner maximum is over integer vectors $\boldsymbol{t}_{0:k}$ satisfying $0 = t_0 < t_1 < \cdots < t_{k-1} < t_k = k_n$ and $\{\tau_{n,j}\colon 0\leq j \leq k_n\}$ are the candidate cutpoints between the partition intervals. While it is generally desirable to use a moderate number of possible cutpoints relative to the number of samples, this comes at a heavy computational cost if an exact solution of the above optimization is desired as the runtime complexity of the algorithm is cubic in the number of candidates. We note that for the special case where $\Psi(k) = 0$ for all $k$, a more efficient algorithm is available, allowing for a quadratic-time solution. In both cases, computing the exact solution becomes unfeasible for larger sample sizes.

To ease the computational burden we adopt a greedy search heuristic to construct a subset of possible cutpoints when the number of candidates is large, and then run the optimization algorithm of choice on this smaller set. To find the optimal partition for a given set of candidate cutpoints, this package includes two solvers based on dynamic programming. The algorithm used can be controlled via the alg keyword passed to the rule argument of fit, see the methods page.

Segment neighbourhood

The choice alg = SegNeig() results in the use of the exact dynamic programming algorithm of Kanazawa (1988) to find the optimal partition. Runtime complexity is cubic in the number of candidate cutpoints.

AutoHist.SegNeig — Type

SegNeig(; greedy::Bool=true, gr_maxbins::Union{Int, Symbol}=:default)

The segment neighbourhood algorithm for constructing an irregular histogram.

Keyword arguments

greedy: Boolean indicating whether or not the greedy cutpoint selection strategy of Rozenholc et al. (2010) should be used to select a smaller number of candidate cutpoints prior to running the dynamic programming algorithm. Defaults to true.
gr_maxbins: Number of candidate cutpoints chosen by the greedy algorithm. Supplying gr_maxbins=:default results in the selection of at most $\max\{ 500, n^{1/3} \}+1$ candidate cutpoints (including edges).

Note

This algorithm can be quite slow for large datasets when the greedy keyword is set to false.

Examples

julia> x = LinRange(eps(), 1.0-eps(), 5000) .^(1.0/4.0);

julia> h = fit(AutomaticHistogram, x, RIH(alg = SegNeig(greedy=true, gr_maxbins=200)))
AutomaticHistogram
breaks: [0.0001220703125, 0.17763663029325183, 0.29718725232110504, 0.4022468898607337, 0.4928155429121377, 0.5797614498414855, 0.6667073567708333, 0.7572760098222373, 0.8405991706295289, 0.9239223314368207, 1.0]
density: [0.006626835974128547, 0.057821970706400425, 0.17596277991076312, 0.36279353706969375, 0.6214544825215076, 0.9730458529384184, 1.4481767793920146, 2.0440057561776532, 2.7513848134529346, 3.5648421829491657]
counts: [5, 34, 92, 164, 270, 423, 656, 852, 1147, 1357]
type: irregular
closed: right
a: 5.0

julia> h = fit(AutomaticHistogram, x, RIH(alg = SegNeig(greedy=false)))
AutomaticHistogram
breaks: [0.0001220703125, 0.17763663029325183, 0.29718725232110504, 0.4022468898607337, 0.4928155429121377, 0.5797614498414855, 0.6667073567708333, 0.7572760098222373, 0.8405991706295289, 0.9202995853147645, 1.0]
density: [0.006626835974128547, 0.057821970706400425, 0.17596277991076312, 0.36279353706969375, 0.6214544825215076, 0.9730458529384184, 1.4481767793920146, 2.0440057561776532, 2.733509595364622, 3.545742066060377]
counts: [5, 34, 92, 164, 270, 423, 656, 852, 1090, 1414]
type: irregular
closed: right
a: 5.0

source

Optimal partitioning

The choice alg = OptPart() results in the use of the exact optimal partitioning algorithm of Jackson et al. (2005) to find the optimal histogram partition. This algorithm solves a less general optimization problem than the SegNeig algorithm. Runtime complexity is quadratic in the number of candidate cutpoints.

AutoHist.OptPart — Type

OptPart(; greedy::Bool=true, gr_maxbins::Union{Int, Symbol}=:default)

The optimal partitioning algorithm for constructing an irregular histogram.

Keyword arguments

greedy: Boolean indicating whether or not the greedy cutpoint selection strategy of Rozenholc et al. (2010) should be used to select a smaller number of candidate cutpoints prior to running the dynamic programming algorithm. Defaults to true.
gr_maxbins: Number of candidate cutpoints chosen by the greedy algorithm. Supplying gr_maxbins=:default results in the selection of at most $\max \{ 3000, n^{1/2} \}+1$ candidate cutpoints (including edges).

Note

This algorithm can be quite slow for large datasets when the greedy keyword is set to false.

Examples

julia> x = LinRange(eps(), 1.0-eps(), 1000) .^(1.0/4.0);

julia> h = fit(AutomaticHistogram, x, L2CV_I(alg = OptPart(greedy=true, gr_maxbins=100)))
AutomaticHistogram
breaks: [0.0001220703125, 0.16676839192708331, 0.2977047874813988, 0.4048345656622024, 0.5119643438430059, 0.6071908133370535, 0.6786106654575893, 0.7619338262648809, 0.8452569870721726, 0.9285801478794643, 1.0]
density: [0.006000732511292883, 0.05346107146424569, 0.17735498311154516, 0.3920478574044684, 0.7035858869490909, 1.0641298986692698, 1.5001831278232214, 2.0762534489073383, 2.7963413502624808, 3.5984392626052997]
counts: [1, 7, 19, 42, 67, 76, 125, 173, 233, 257]
type: irregular
closed: right
a: NaN

julia> h = fit(AutomaticHistogram, x, L2CV_I(alg = OptPart(greedy=false)))
AutomaticHistogram
breaks: [0.0001220703125, 0.16676839192708331, 0.2977047874813988, 0.4048345656622024, 0.5119643438430059, 0.6071908133370535, 0.6786106654575893, 0.7619338262648809, 0.8452569870721726, 0.9285801478794643, 1.0]
density: [0.006000732511292883, 0.05346107146424569, 0.17735498311154516, 0.3920478574044684, 0.7035858869490909, 1.0641298986692698, 1.5001831278232214, 2.0762534489073383, 2.7963413502624808, 3.5984392626052997]
counts: [1, 7, 19, 42, 67, 76, 125, 173, 233, 257]
type: irregular
closed: right
a: NaN

source