Choice of algorithm
In this section, we empirically assess the efficiency of the dynamic programming algorithm provided for irregular histograms, and show how heuristics can be used to speed up the computations.[1]
The cubic-time dynamic programming algorithm
As a toy problem, we consider standard normal random samples of using a data-based grid. In this case, the number of candidate cutpoints are $k_n = n+1$, where $n$ is the sample size. For smaller samples, we can just compute the exact solution using the default dynamic programming algorithm, available as SegNeig
. The code snippet below illustrates how this algorithm can be explicitly specified when calling fit
:
using AutoHist, Distributions, BenchmarkTools
n = 500
@benchmark fit(
AutomaticHistogram,
$rand(Normal(), n),
RIH(
grid = :data,
alg = SegNeig(greedy=false)
)
)
# Output
BenchmarkTools.Trial: 55 samples with 1 evaluation per sample.
Range (min … max): 51.734 ms … 235.103 ms ┊ GC (min … max): 10.63% … 76.02%
Time (median): 57.087 ms ┊ GC (median): 10.08%
Time (mean ± σ): 91.326 ms ± 66.481 ms ┊ GC (mean ± σ): 43.70% ± 26.34%
Benchmarking the above code snippet yields a median runtime of around $57\ \text{ms}$ on my machine. Since dynamic programming is quick for samples of this size, the greedy algorithm will only be used if the number of candidate cutpoints exceeds $501$. Thus, changing greedy
to true
in the above code snippet would produce the same histogram, as there are $501$ possible cutpoints.
Speeding up computations via heuristics
The $\mathcal{O}(k_n^3)$ runtime of dynamic programming means that computing the optimal solution quickly becomes computationally prohibitive, even for moderate samples. As an example, when doubling the number of samples in the above code snippet to $n = 1000$, the median runtime increases to $442\ \text{ms}$, a rougly $8$-fold increase. To ensure that the code retains good performance even for larger samples, we have implemented a greedy search heuristic which selects a subset of the candidate cutpoints, and the dynamic programming algorithm is subsequently run on this smaller set. Adopting the heuristic approach can improve performance considerably, but comes at the cost of no longer being guaranteed to find the optimal solution. To showcase the computational advantages of the heuristic approach, we run a benchmark on a normal sample of size $n = 10^6$.
n = 10^6
@benchmark fit(
AutomaticHistogram,
$rand(Normal(), n),
RIH(
grid = :data,
alg = SegNeig(greedy=true) # NB! greedy=true is the default option
)
)
# Output
BenchmarkTools.Trial: 7 samples with 1 evaluation per sample.
Range (min … max): 683.418 ms … 826.615 ms ┊ GC (min … max): 1.71% … 19.55%
Time (median): 777.266 ms ┊ GC (median): 11.38%
Time (mean ± σ): 765.973 ms ± 48.456 ms ┊ GC (mean ± σ): 10.04% ± 6.39%
As we can see, the median runtime is less than 2 times slower than the mean time it took to compute the exact solution for random samples of size $n = 10^3$.
The number candidate cutpoints constructed by the greedy search heuristic can be controlled through the gr_maxbins
keyword argument, which equals the number of selected gridpoints plus one. By default, the greedy algorithm will produce a subset consisting of $\max\{500, n^{1/3}\}+1$ cutpoints by default (including the edges). For gr_maxbins1 < gr_maxbins2
, the cutpoint subset formed by the greedy algorithm for gr_maxbins1
is a subset of that selected with gr_maxbins2
bins. Thus, increasing the number of candidate cutpoints added this grid will never lead to a worse solution of the original optimization problem. If additional precision is desired in the above example, we can increase gr_maxbins
to $1000$:
n = 10^6
@benchmark fit(
AutomaticHistogram,
$rand(Normal(), n),
RIH(
grid = :data,
alg = SegNeig(greedy=true, gr_maxbins=10^3)
)
)
# Output
BenchmarkTools.Trial: 5 samples with 1 evaluation per sample.
Range (min … max): 1.114 s … 1.339 s ┊ GC (min … max): 5.36% … 17.15%
Time (median): 1.270 s ┊ GC (median): 15.61%
Time (mean ± σ): 1.232 s ± 99.117 ms ┊ GC (mean ± σ): 12.38% ± 6.19%
The above code snippet has a median runtime of about $1.2\ \text{s}$ on my machine.
The quadratic-time dynamic programming algorithm
For the L2CV_I
and KLCV_I
criteria, it becomes possible to compute the exact solution via the quadratic-time dynamic programming algorithm OptPart
instead of the cubic-time SegNeig
algorithm used for the other problems. In practice, this means that computing the exact solution is often feasible even as the number of candidate cutpoints becomes quite large. Both the OptPart
and SegNeig
algorithms can be used to fit L2CV_I
, allowing us to get a direct comparison between the performance of the two algorithms.
n = 10^3
@benchmark fit(
AutomaticHistogram,
$rand(Normal(), n),
L2CV_I(
grid = :data,
alg = OptPart(greedy=false)
)
)
@benchmark fit(
AutomaticHistogram,
$rand(Normal(), n),
L2CV_I(
grid = :data,
alg = SegNeig(greedy=false)
)
)
# Output OptPart
BenchmarkTools.Trial: 1014 samples with 1 evaluation per sample.
Range (min … max): 3.673 ms … 10.196 ms ┊ GC (min … max): 0.00% … 33.51%
Time (median): 4.437 ms ┊ GC (median): 0.00%
Time (mean ± σ): 4.921 ms ± 1.099 ms ┊ GC (mean ± σ): 12.30% ± 15.90%
# Output SegNeig
BenchmarkTools.Trial: 11 samples with 1 evaluation per sample.
Range (min … max): 408.000 ms … 570.588 ms ┊ GC (min … max): 11.73% … 35.49%
Time (median): 421.996 ms ┊ GC (median): 11.52%
Time (mean ± σ): 471.118 ms ± 68.395 ms ┊ GC (mean ± σ): 22.00% ± 10.88%
The mean runtime is around $100$ times faster for the quadratic-time algorithm! Although the speedup from the quadratic-time algorithm is considerable in this case, it is often too slow to be used in practice for larger sample sizes. To speed up the computation for these criteria, it is once again possible to use the greedy search heuristic used in the cubic-time case. Due to the superior runtime complexity of the exact algorithm for cross-validation criteria, the exact solution is by default computed when the number of total candidate cutpoints is less than $3001$, and the greedy search heuristic is used thereafter to build a smaller candidate set as previously. By default gr_maxbins
is set to $\max\{3000, \sqrt{n}\}$, but this can be replaced with a user-defined value if better performance or additional accuracy is desired.
In practice one should of course never use the SegNeig
algorithm for the two cross-validation criteria, as it only leads to a reduction in speed compared to using OptPart
, which is the default.
- 1Note: The benchmarks presented here were performed on a Windows machine with a Intel® Core™ Ultra 5 125U CPU. Results may vary on systems with different hardware configurations.