Non-Parametric Density Estimation: Theory and Applications
In this article, we’ll talk about what Density Estimation is and the role it plays in statistical analysis.We’ll analyze two popular density estimation methods, histograms and kernel density estimators, and analyze their theoretical properties as well as how they perform in practice.
Finally, we’ll look at how density estimation may be used as a tool for classification tasks.
Hopefully after reading this article, you leave with an appreciation of density estimation as a fundamental statistical tool, and a solid intuition behind the density estimation approaches we discuss here.
Ideally, this article will also spark an interest in learning more about density estimation and point you towards additional resources to help you dive deeper than what is discussed here!
Contents:
Background Concepts
What is density estimation?
Histograms
Overview
Theoretical Properties
Demonstration of Theoretical Properties
Kernel Density Estimators (KDE)
Naive Density Estimator
KDE: Overview
Kernel and Bandwidth
Density Estimation for Classification
Wrap-up
Sources
Background Concepts
Learning/refreshing on the following concepts will be helpful to fully appreciate the rest of what is discussed in this article.
Bias and variance: important concepts for discussing the accuracy of the density estimation approaches discussed.
The cumulative distribution function and probability density function.
Parametric vs.
non-parametric statistics: knowing the distinction will help understand the relevance of the density estimation methods discussed.
O notation: used to describe the asymptotic behavior of the bias/variance of the density estimators.
Kernel: kind of important for the kernel density estimator.
What is density estimation?
Density estimation is concerned with reconstructing the probability density function of a random variable, X, given a sample of random variates X1, X2,…, Xn.
Density estimation plays a crucial role in statistical analysis.
It may be used as a standalone method for analyzing the properties of a random variable’s distribution, such as modality, spread, and skew.
Alternatively, density estimation may be used as a means for further statistical analysis, such as classification tasks, goodness-of-fit tests, and anomaly detection, to name a few.
Some of you may recall that the probability distribution of a random variable X can be completely characterized by its cumulative distribution function (CDF), F(⋅).
If X is a discrete random variable, then we can derive its probability mass function (PMF), p(⋅), from its CDF via the following relationship: p(Xi) = F(Xi) − F(Xi-1), where Xi-1 denotes the largest value within the discrete distribution of X that is less than Xi.
If X is continuous, then its probability density function (PDF), p(⋅), may be derived by differentiating its CDF i.e.
F′(⋅) = p(⋅).
Based on this, you may be wondering why we need methods to estimate the probability distribution of X, when we can just exploit the relationships stated above.
Certainly, given a sample of data X1,…, Xn, we may always construct an estimate of its CDF.
If X is discrete, then constructing its PMF is straightforward, as it simply requires counting the frequency of observations for each distinct value that appears in our sample.
However, if X is continuous, estimating its PDF is not so trivial.
Notice that our estimate of the CDF, F(⋅), will necessarily follow a discrete distribution, since we have a finite amount of empirical data.
Since F(⋅) is discrete, we cannot simply differentiate it to obtain an estimate of the PDF.
Thus, this motivates the need for other methods of estimating p(⋅).
To provide some additional motivation behind density estimation, the CDF may be suboptimal to use for analyzing the properties of the probability distribution of X.
For example, consider the following display.
PDF vs.
CDF of data following a bimodal distribution.
Certain properties of the distribution of X, such as its bimodal nature, are immediately clear from analyzing its PDF.
However, these properties are harder to notice from analyzing its CDF, due to the cumulative nature of the distribution.
For many folks, the PDF likely provides a more intuitive display of the distribution of X — it is larger at values of X that are more likely to “occur” and smaller for values of X that are less likely.
Broadly speaking, density estimation approaches may be categorized as parametric or non-parametric.
Parametric density estimation assumes X follows some distribution that may be characterized by some parameters (ex: X ∼ N(μ,σ)).
Density estimation in this case involves estimating the relevant parameters for the parametric distribution of X, and then plugging in these parameter estimates to the corresponding density function formula for X.
Non-parametric density estimation makes less rigid assumptions about the distribution of X, and estimates the shape of the density function directly from the empirical data.
As a result, non-parametric density estimates will typically have lower bias and higher variance compared to parametric density estimates.
Non-parametric methods may be desired when the underlying distribution of X is unknown and we’re working with a large amount of empirical data.
For the rest of this article, we’ll focus on analyzing two popular non-parametric methods for density estimation: Histograms and kernel density estimators (KDEs).
We’ll dig into how they work, the benefits and drawbacks of each approach, and how accurately they estimate the true density function of a random variable.
Finally, we’ll examine how density estimation can be applied to classification problems, and how the quality of the density estimator can impact classification performance.
Histograms
Overview
Histograms are a simple non-parametric approach for constructing a density estimate from a sample of data.
Intuitively, this approach involves partitioning the range of our data into distinct equal length bins.
Then, for any given point, assign its density to be equal to the proportion of points that reside within the same bin, normalized by the bin length.
Formally, given a sample of n observations
partition the domain into M bins
such that
For a given point x ∈ βl, where βl denotes the lth bin, the density estimate produced by the histogram will be
Pointwise density estimate of the histogram.
Since the histogram density estimator assigns uniform density to all points within the same bin, the density estimate will be discontinuous at all of its breakpoints where the density estimates differ.
Histogram density estimate for the standard Gaussian.
Uniform densities are assigned to all points within the same bin.
Above, we have the histogram density estimate of the standard Gaussian distribution generated from a sample of 1000 data points.
We see that x = 0 and x = −0.5 lie within the same bin, and thus have identical density estimates.
Theoretical Properties
Histograms are a simple and intuitive method for density estimation.
They make no assumptions about the underlying distribution of the random variable.
Histogram estimation simply requires tuning the bin width, h, and the point where the histogram bins originate from, t0.
However, we’ll see very soon that the accuracy of the histogram estimator is highly dependent on tuning these parameters appropriately.
As desired, the histogram estimator is a true density function.
It is non-negative over its entire domain.
It integrates to 1.
Integral of the histogram density estimator.
We can evaluate the accuracy of the histogram estimator for estimating the true density, p(⋅), by decomposing its mean squared error into its bias and variance terms.
First, lets examine its bias at a given point x ∈ (bk-1, bk].
Expected value of the pointwise histogram density estimate.
Let’s take a bit of a leap here.
Using the Taylor series expansion, the fact that the PDF is the derivative of the CDF, and |x − bk-1| ≤ h, we can derive the following.
Thus, we have
which implies
Asymptotic bias of the histogram density estimator.
Therefore, the histogram estimator is an unbiased estimator of the true density, p(⋅), as the bin width approaches 0.
Now, let’s analyze the variance of the histogram estimator.
Notice that as h → ∞, we have
Therefore,
Asymptotic variance of the histogram density estimator.
Now, we’re at a bit of an impasse; we see that as h → ∞, the bias of the histogram density estimate decreases, while its variance increases.
We’re typically concerned with the accuracy of the density estimate at large sample sizes (i.e.
as n → ∞).
Therefore, to maximize the accuracy of the histogram density estimate, we’ll want to tune h to achieve the following behavior:
Choose h to be small to minimize bias.
As h → 0 and n → ∞, we must have nh → ∞ to minimize variance.
In other words, the large sample size should overpower the small bin width, asymptotically.
This bias-variance trade-off is not unexpected:
Small bin widths may capture the density around a particular point with high precision.
However, density estimates may change from small random variations across data sets as less points will fall within the same bin.
Large bin widths include more data points when computing the density estimate at a given point, which means density estimates will be more robust to small random variations in the data.
Let’s illustrate this trade-off with some examples.
Demonstration of Theoretical Properties
First, we’ll look at how small bin widths may lead to large variance in the histogram density estimator.
For this example, we’ll draw four samples of 50 random variates, where each sample is drawn from a standard Gaussian distribution.
We’ll set a relatively small bin width (h = 0.2).
set.seed(25)
# Standard Gaussian
mu <- 0
sd <- 1
# Parameters for density estimate
n <- 50
h <- 0.2
# Generate 4 samples of standard Gaussian
samples <- replicate(4, rnorm(n, mean = mu, sd = sd), simplify = FALSE)
# Setup 2x2 plot
par(mfrow = c(2, 2), mar = c(4, 4, 3, 1))
# Plot histograms
titles <- paste("Sample", 1:4)
invisible(mapply(plot_histogram, samples, title = titles,
MoreArgs = list(binwidth = h, origin = 0, line = 0)))
Histogram density estimates (h = 0.2) generated from 4 different samples of the standard Gaussian.
Notice the high variability in density estimates across samples.
It’s clear that the histogram density estimates vary quite a bit.
For instance, we see that the pointwise density estimate at x = 0 ranges from approximately 0.2 in Sample 4 to approximately 0.6 in Sample 2.
Additionally, the distribution of the density estimate produced in Sample 1 appears almost bimodal, with peaks around −1 and a little above 0.
Let’s repeat this exercise to demonstrate how large bin widths may result in a density estimate with lower variance, but higher bias.
For this example, let’s draw four samples from a bimodal distribution consisting of a mixture of two Gaussian distributions, N(0, 1) and N(3, 1).
We’ll set a relatively large bin width (h = 2).
set.seed(25)
# Bimodal distribution parameters - mixture of N(0, 1) and N(4, 1)
mu_1 <- 0
sd_1 <- 1
mu_2 <- 3
sd_2 <- 1
# Density estimation parameters
n <- 100
h <- 2
# Generate 4 samples from bimodal distribution
samples <- replicate(4, c(rnorm(n/2, mean = mu_1, sd = sd_1), rnorm(n/2, mean = mu_2, sd = sd_2)), simplify = FALSE)
# Set up 2x2 plotting grid
par(mfrow = c(2, 2), mar = c(4, 4, 3, 1))
# Plot histograms
titles <- paste("Sample", 1:4)
invisible(mapply(plot_histogram, samples, title = titles,
MoreArgs = list(binwidth = h, origin = 0, line = 0)))
Histogram density estimates (h = 2) generated from 4 different samples of a bimodal distribution.
These histograms fail to capture the bimodal nature of the data.
There is still some variation in the density estimates across the four histograms, but they appear stable relative to the density estimates we saw above with smaller bin widths.
For instance, it appears that the pointwise density estimate at x = 0 is approximately 0.15 across all the histograms.
However, it’s clear that these histogram estimators introduce a large amount of bias, as the bimodal distribution of the true density function is masked by the large bin widths.
Additionally, we mentioned previously that the histogram estimator requires tuning the origin point, t0.
Let’s look at an example that illustrates the impact that the choice of t0 can have on the histogram density estimate.
set.seed(123)
# Distribution and density estimation parameters
# Bimodal distribution: mixture of N(0, 1) and N(5, 1)
n <- 50
data <- c(rnorm(n/2, mean = 0, sd = 1), rnorm(n/2, mean = 5, sd = 1))
h <- 3
# Set up plotting grid
par(mfrow = c(1, 2), mar = c(4, 4, 3, 1))
# Same bin width, different origins
plot_histogram(data, binwidth = h, origin = 0, title = paste("Bin width = ", h, ", Origin = 0"))
plot_histogram(data, binwidth = h, origin = 1, title = paste("Bin width = ", h, ", Origin = 1"))
Histogram density estimates of a bimodal distribution with different origin points.
Notice the histogram on the right fails to capture the bimodal nature of the data.
The histogram density estimates above differ in their origin point by a magnitude of 1.
The impact of the different origin point on the resulting histogram density estimates is evident.
The histogram on the left captures the fact that the distribution is bimodal with peaks around 0 and 5.
In contrast, the histogram on the right gives the impression that the density of X follows a unimodal distribution with a single peak around 5.
Histograms are a simple and intuitive approach to density estimation.
However, histograms will always produce density estimates that follow a discrete distribution, and we’ve seen that the resulting density estimate may be highly dependent on an arbitrary choice of the origin point.
Next, we’ll look at an alternative method for density estimation, Kernel Density Estimation, that addresses these shortcomings.
Kernel Density Estimators (KDE)
Naive Density Estimator
We’ll first look at the most basic form of a kernel density estimator, the naive density estimator.
This approach is also known as the “moving histogram”; it is an extension of the traditional histogram density estimator that computes the density at a given point by examining the number of observations that fall within an interval that is centered around that point.
Formally, the pointwise density estimate at x produced by the naive density estimator can be written as follows.
Pointwise density estimate of the Naive Density Estimator.
Its corresponding kernel is defined as follows.
Naive Density Estimator kernel function.
Unlike the traditional histogram density estimate, the density estimate produced by the moving histogram does not vary based on the choice of origin point.
In fact, there is no concept of “origin point” in the moving histogram, as the density estimate at x only depends on the points that lie within the neighborhood (x − (h/2), x + (h/2)).
Let’s examine the density estimate produced by the naive density estimator for the same bimodal distribution as we used above for highlighting the histogram’s dependency on origin point.
set.seed(123)
# Bimodal distribution - mixture of N(0, 1) and N(5, 1)
data <- c(rnorm(n/2, mean = 0, sd = 1), rnorm(n/2, mean = 5, sd = 1))
# Density estimate parameters
n <- 50
h <- 1
# Naive Density Estimator: KDE with rectangular kernel using half the bin width
# Rectangular kernel counts points within (x - h, x + h)
pdf_est <- density(data, kernel = "rectangular", bw = h/2)
# Plot PDF
plot(pdf_est, main = "NDE: Bimodal Gaussian", xlab = "x", ylab = "Density", col = "blue", lwd = 2)
rug(data)
polygon(pdf_est, col = rgb(0, 0, 1, 0.2), border = NA)
grid()
Naive Density Estimate of a bimodal distribution containing a mixture of N(0, 1) and N(5, 1).
Clearly, the density estimate produced by the naive density estimator captures the bimodal distribution much more accurately than the traditional histogram.
Additionally, the density at each point is captured with much finer granularity.
That being said, the density estimate produced by the NDE is still quite “rough” i.e.
the density estimate does not have smooth curvature.
This is because each observation is weighted as “all or nothing” when computing the pointwise density estimate, which is apprent from its kernel, K0.
Specifically, all points within the neighborhood (x − (h/2), x + (h/2)) contribute equally to the density estimate, while points outside the interval contribute nothing.
Ideally, when computing the density estimate for x, we would like to weigh points in proportion to their distance from x, such that the points closer/farther from x have a higher/lower impact on its density estimate, respectively.
This is essentially what the KDE does: it generalizes the naive density estimator by replacing the uniform density function with an arbitrary density function, the kernel.
Intuitively, you can think of the KDE as a smoothed histogram.
KDE: Overview
The kernel density estimator generated from a sample X1,…, Xn, can be defined as follows:
Pointwise density estimate of the KDE.
Below are some popular choices for kernels used in density estimation.
These are just several of the more popular kernels that are typically used for density estimation.
For more information about kernel functions, check out the Wikipedia.
If you’re seeking for some intuition behind what exactly a kernel function is (as I was), check out this quora thread.
We can see that the KDE is a genuine density function.
It is always non-negative, since K(⋅) is a density function.
It integrates to 1.
Integral of the KDE.
Kernel and Bandwidth
In practice, K(⋅) is chosen to be symmetric and unimodal around 0 (∫u⋅K(u)du = 0).
Additionally, K(⋅) is typically scaled to have unit variance when used for density estimation (∫u2⋅K(u)du = 1).
This scaling essentially standardizes the impact that the choice of bandwidth, h, has on the KDE, regardless of the kernel being used.
Since the KDE at a given point is the weighted sum of its neighboring points, where the weights are computed by K(⋅), the smoothness of the density estimate is inherited from the smoothness of the kernel function.
Smooth kernel functions will produce smooth KDEs.
We can see that the Gaussian kernel depicted above is infinitely differentiable, so KDEs with the Gaussian kernel will produce density estimates with smooth curvature.
On the other hand, the other kernel functions (Epanechnikov, rectangular, triangular) are not differentiable everywhere (ex: ±1), and in the case of the rectangular and triangular kernels, do not have smooth curvature.
Thus, KDEs using these kernels will produce rougher density estimates.
However, in practice, we’ll see that as long as the kernel function is continuous, the choice of the kernel has relatively little impact on the KDE compared to the choice of bandwidth.
set.seed(123)
# sample from standard Gaussian
x <- rnorm(50)
# kernel/bandwidths for KDEs
kernels <- c("gaussian", "epanechnikov", "rectangular", "triangular")
bandwidths <- c(0.5, 1, 2)
colors_k <- rainbow(length(kernels))
colors_b <- rainbow(length(bandwidths))
plot_kde_comparison <- function(values, label, type = c("kernel", "bandwidth")) {
type <- match.arg(type)
plot(NULL, xlim = range(x) + c(-1, 1), ylim = c(0, 0.5),
xlab = "x", ylab = "Density", main = paste("KDE with Different", label))
for (i in seq_along(values)) {
if (type == "kernel") {
d <- density(x, kernel = values[i])
col <- colors_k[i]
} else {
d <- density(x, bw = values[i], kernel = "gaussian")
col <- colors_b[i]
}
lines(d$x, d$y, col = col, lwd = 2)
}
curve(dnorm(x), add = TRUE, lty = 2, lwd = 2)
legend("topright", legend = c(as.character(values), "True Density"),
col = c(if (type == "kernel") colors_k else colors_b, "black"),
lwd = 2, lty = c(rep(1, length(values)), 2), cex = 0.8)
rug(x)
}
plot_kde_comparison(kernels, "Kernels", type = "kernel")
plot_kde_comparison(bandwidths, "Bandwidths", type = "bandwidth")
We see that the KDEs for the standard Gaussian with various kernels are relatively similar, compared to the KDEs produced with various bandwidths.
Accuracy of the KDE
Let’s examine how accurately the KDE estimates the true density, p(⋅).
As we did with the histogram estimator, we can decompose its mean squared error into its bias and variance terms.
For details behind how to derive these bias and variance terms, check out lecture 6 of these notes.
The bias and variance of the KDE at x can be expressed as follows.
Asymptotic bias and variance of the KDE.
Intuitively, these results give us the following insights:
The effect of K(⋅) on the accuracy of the KDE is primarily captured via the term σ2K = ∫K(u)2du.
The Epanechnikov kernel minimizes this integral, so theoretically it should produce the optimal KDE.
However, we’ve seen that the choice of kernel has little practical impact on the KDE relative to its bandwidth.
Additionally, the Epanechnikov kernel has a bounded support interval ([−1, 1]).
As a result, it may produce rougher density estimates relative to kernels that are nonzero across the entire real number space (ex: Gaussian).
Thus, the Gaussian kernel is commonly used in practice.
Recall that the asymptotic bias and variance of the histogram estimator as h → ∞ was O(h) and O(1/(nh)), respectively.
Comparing these against KDE tells us that the KDE improves upon the histogram density estimator primarily through decreased asymptotic bias.
This is expected: the kernel smoothly varies the weight of the neighboring points of x when computing the pointwise density at x, instead of assigning uniform density to arbitrary fixed intervals of the domain.
In other words, the KDE imposes a less rigid structure on the density estimate compared to the histogram approach.
For histograms and KDEs, we’ve seen that the bandwidth h can have a significant impact on the accuracy of the density estimate.
Ideally, we would select the h such that the mean squared error of the density estimator is minimized.
However, it turns out that this theoretically optimal h depends on the curvature of the true density p(⋅), which is unknown practice (otherwise we wouldn’t need density estimation)!
Some popular approaches for bandwidth selection include:
Assuming the true density resembles some reference distribution p0(⋅) (ex: Gaussian), then plugging in the curvature of p0(⋅) to derive the bandwidth.
This approach is simple, but it assumes the distribution of the data, so it may be a poor choice if you’re looking to build density estimates to explore your data.
Non-parametric approaches to bandwidth selection, such as cross-validation and plug-in methods.
The unbiased cross-validation and Sheather-Jones methods are popular bandwidth selectors and typically produce fairly accurate density estimates.
For more information on the impact of bandwidth selection on the KDE, check out this blog post.
set.seed(42)
# Simulate data: a bimodal distribution
x <- c(rnorm(150, mean = -2), rnorm(150, mean = 2))
# Define true density
true_density <- function(x) {
0.5 * dnorm(x, mean = -2, sd = 1) +
0.5 * dnorm(x, mean = 2, sd = 1)
}
# Create plotting range
x_grid <- seq(min(x) - 1, max(x) + 1, length.out = 500)
xlim <- range(x_grid)
ylim <- c(0, max(true_density(x_grid)) * 1.2)
# Base plot
plot(NULL, xlim = xlim, ylim = ylim,
main = "KDE: Various Bandwidth Selection Methods",
xlab = "x", ylab = "Density")
# KDE with different bandwidths
lines(density(x), col = "red", lwd = 2, lty = 4)
h_scott <- 1.06 * sd(x) * length(x)^(-1/5)
lines(density(x, bw = h_scott), col = "blue", lwd = 2, lty = 2)
lines(density(x, bw = bw.ucv(x)), col = "darkgreen", lwd = 2, lty = 3)
lines(density(x, bw = bw.SJ(x)), col = "purple", lwd = 2, lty = 4)
# True density
lines(x_grid, true_density(x_grid), col = "black", lwd = 2)
# Add legend
legend("topright",
legend = c("Silverman (Default))", "Scott's Rule", "Unbiased CV",
"Sheather-Jones", "True Density"),
col = c("red", "blue", "darkgreen", "purple", "black"),
lty = 1:6, lwd = 2, cex = 0.8)
KDEs using various bandwidth selection methods where the underlying data follows a bimodal distribution.
Notice the KDEs using the Sheather-Jones and Unbiased Cross-Validation methods produce density estimates closest to the true density.
Density Estimation for Classification
We’ve discussed a great deal about the underlying theory of histograms and KDE, and we’ve demonstrated how they perform at modeling the true density of some sample data.
Now, we’ll look at how we can apply what we learned about density estimation for a simple classification task.
For instance, say we want to build a classifier from a sample of n observations (x1, y1),…, (xn, yn), where each xi comes from a p-dimensional feature space, X, and yi corresponds to the target labels drawn from Y = {1,…, m}.
Intuitively, we want to build a classifier such that for each observation, our classifier assigns it the class label k such that the following is satisfied.
The Bayes classifier does precisely that, and computes the conditional probability above using the following equation.
The Bayes Classifier
This classifier relies on the following:
πk = P(Y = k): the prior probability that an observation (xi, yi) belongs to the kth class (i.e.
yi = k).
This can be estimated by simply counting the proportion of points in each class from our sample data.
fk(x) ≡ P(X = x | Y = k): the p-dimensional density function of X for all observations in target class k.
This is harder to estimate: for each of the m target classes, we must determine the shape of the distribution for each dimension of X, and also whether there are any associations between the different dimensions.
The Bayes classifier is optimal if the quantities above can be computed precisely.
However, this is impossible to achieve in practice when working with a finite sample of data.
For more detail behind why the Bayes classifier is optimal, check out this site.
So the question becomes, how can we approximate the Bayes classifier?
One popular method is the Naive Bayes classifier.
Naive Bayes assumes class-conditional independence, which means that for each target class, it reduces the p-dimensional density estimation problem into p separate univariate density estimation tasks.
These univariate densities may be estimated parametrically or non-parametrically.
A typical parametric approach would assume that each dimension of X follows a univariate Gaussian distribution with class-specific mean and a diagonal co-variance matrix, whereas a non-parametric approach may model each dimension of X using a histogram or KDE.
The parametric approach to univariate density estimation in Naive Bayes may be useful when we have a small amount of data relative to the size of the feature space, as the bias introduced by the Gaussian assumption may help reduce the variance of the classifier.
However, the Gaussian assumption may not always be appropriate depending on the distribution of data that you’re working with.
Let’s examine how parametric vs.
non-parametric density estimates can impact the decision boundary of the Naive Bayes classifier.
We’ll build two classifiers on the Iris dataset: one of them will assume each feature follows a Gaussian distribution, and the other will build kernel density estimates for each feature.
# Parametric Naive Bayes
param_nb <- naive_bayes(Species ~ ., data = train)
# Nonparametric Naive Bayes
# KDE with Gaussian kernel and Sheather-Jones bandwidth
nonparam_nb <- naive_bayes(Species ~ ., data = train,
usekernel = TRUE,
kernel="gaussian",
bw="sj") # play with bandwidth to see how it affects the classification boundaries!
# Create grid for plotting decision boundaries
x_seq <- seq(min(iris2D$Sepal.Length), max(iris2D$Sepal.Length), length.out = 200)
y_seq <- seq(min(iris2D$Petal.Length), max(iris2D$Petal.Length), length.out = 200)
grid <- expand.grid(Sepal.Length = x_seq, Petal.Length = y_seq)
# Predict class for each point on grid
grid$param_pred <- predict(param_nb, grid)
grid$nonparam_pred <- predict(nonparam_nb, grid)
# Plot decision boundaries
nb_parametric <- ggplot() +
geom_tile(data = grid, aes(x = Sepal.Length, y = Petal.Length, fill = param_pred), alpha = 0.3) +
geom_point(data = train, aes(x = Sepal.Length, y = Petal.Length, color = Species), size = 2) +
ggtitle("Parametric Naive Bayes Decision Boundary") +
theme_minimal()
nb_nonparametric <- ggplot() +
geom_tile(data = grid, aes(x = Sepal.Length, y = Petal.Length, fill = nonparam_pred), alpha = 0.3) +
geom_point(data = train, aes(x = Sepal.Length, y = Petal.Length, color = Species), size = 2) +
ggtitle("Nonparametric Naive Bayes Decision Boundary") +
theme_minimal()
nb_parametric
nb_nonparametric
Decision boundaries produced by the parametric Naive Bayes classifier.
Decision boundaries produced by the non-parametric Naive Bayes classifier.
Notice the rough decision boundaries relative to that of its parametric counterpart.
# Parametric Naive Bayes prediction on test data
param_pred <- predict(param_nb, newdata = test)
# Non-parametric Naive Bayes prediction on test data
nonparam_pred <- predict(nonparam_nb, newdata = test)
# Create confusion matrices
param_cm <- confusionMatrix(param_pred, test$Species)
nonparam_cm <- confusionMatrix(nonparam_pred, test$Species)
output <- capture.output({
# Print confusion matrices
cat("\n=== Parametric Naive Bayes Metrics ===\n")
print(param_cm$table)
cat("Parametric Naive Bayes Accuracy: ", param_cm$overall['Accuracy'], "\n\n")
cat("=== Non-parametric Naive Bayes Metrics ===\n")
print(nonparam_cm$table)
cat("Nonparametric Naive Bayes Accuracy: ", nonparam_cm$overall['Accuracy'], "\n")
})
cat(paste(output, collapse = "\n"))
Classification performance for both Naive Bayes models.
Non-parametric Naive Bayes achieved slightly better performance on our data.
We see that the non-parametric Naive Bayes classifier achieves slightly better accuracy than its parametric counterpart.
This is because the non-parametric density estimates produce a classifier with a more flexible decision boundary.
As a result, several of the “virginica” observations that were incorrectly classified as “versicolor” by the parametric classifier ended up being classified correctly by the non-parametric model.
That being said, the decision boundaries produced by non-parametric Naive Bayes appear to be rough and disconnected.
Thus, there are some regions of the feature space where the classification boundary may be questionable, and fail to generalize well to new data.
In contrast, the parametric Naive Bayes classifier produces smooth, connected decision boundaries that appear to accurately capture the general pattern of the feature distributions for each species.
This distinction brings up an important point that “more flexible density estimation” does not equate to “better density estimation”, especially when applied to classification.
After all, there’s a reason why Naive Bayes classification is popular.
Although making less assumptions about the distribution of your data may seem desirable to produce unbiased density estimates, simplifying assumptions may be effective when there is insufficient empirical data to produce high quality estimates, or if the parametric assumptions are believed to be mostly accurate.
In the latter case, parametric estimation will introduce little to no bias to the estimator, whereas non-parametric approaches may introduce large amounts of variance.
Indeed, looking at the feature distributions below, the Gaussian assumption of parametric Naive Bayes doesn’t seem inappropriate.
For the most part, it appears the class distributions for petal and sepal length appear to be unimodal and symmetric.
iris_long <- pivot_longer(iris, cols = c(Sepal.Length, Petal.Length), names_to = "Feature", values_to = "Value")
ggplot(iris_long, aes(x = Value, fill = Species)) +
geom_density(alpha = 0.5, bw="sj") +
facet_wrap(~ Feature, scales = "free") +
labs(title = "Distribution of Sepal and Petal Lengths by Species", x = "Length (cm)", y = "Density") +
theme_minimal()
Density distributions for Petal and Sepal length.
The univariate densities appear to be unimodal and symmetric across all species for both features.
Wrap-up
Thanks for reading! We dove into the theory behind the histogram and kernel density estimators and how to apply them in context..
Let’s briefly summarize what we discussed:
Density estimation is a fundamental tool in Statistical Analysis for analyzing the distribution of a variable or as an intermediate tool for deeper statistical analysis.
Density estimation approaches may be broadly categorized as parametric or non-parametric.
Histograms and KDEs are two popular approaches for non-parametric density estimation.
Histograms produce density estimates by computing the normalized frequency of points within each distinct bin of the data.
KDEs are “smoothed” histograms that estimate the density at a given point by computing a weighted sum of its surrounding points, where neighbors are weighted in proportion to their distance.
Non-parametric density estimation can be applied to classification algorithms that require modeling the feature densities for each target class (Bayesian classification).
Classifiers built using non-parametric density estimates may be able to define more flexible decision boundaries at the cost of higher variance.
Check out the sources below if you’re interested in learning more!
The author has created all images in this article.
Sources
Learning Resources:
García-Portugués, E.
(2025). Notes for Nonparametric Statistics.
Version 6.12.0.
ISBN 978-84-09-29537-1.
Available at https://bookdown.org/egarpor/NP-UC3M/." style="color: #0066cc;">https://bookdown.org/egarpor/NP-UC3M/.
García-Portugués, E.
(2022). A Short Course on Nonparametric Curve Estimation.
Version 2.1.1.
Available at https://bookdown.org/egarpor/NP-EAFIT/." style="color: #0066cc;">https://bookdown.org/egarpor/NP-EAFIT/.
UW Stat 425: Introduction to Nonparametric Statistics (Winter 2018)
James et al., An Introduction to Statistical Learning
Hastie et al., The Elements of Statistical Learning
Andrey Akinshin, The importance of kernel density estimation bandwidth
Datasets:
Fisher, R.
(1936).
Iris [Dataset].
UCI Machine Learning Repository.
https://doi.org/10.24432/C56C76." style="color: #0066cc;">https://doi.org/10.24432/C56C76. (CC BY 4.0)
The post Non-Parametric Density Estimation: Theory and Applications appeared first on Towards Data Science.
Source: https://towardsdatascience.com/non-parametric-density-estimation-theory-and-applications/" style="color: #0066cc;">https://towardsdatascience.com/non-parametric-density-estimation-theory-and-applications/
#nonparametric #density #estimation #theory #and #applications
Non-Parametric Density Estimation: Theory and Applications
In this article, we’ll talk about what Density Estimation is and the role it plays in statistical analysis.
We’ll analyze two popular density estimation methods, histograms and kernel density estimators, and analyze their theoretical properties as well as how they perform in practice.
Finally, we’ll look at how density estimation may be used as a tool for classification tasks.
Hopefully after reading this article, you leave with an appreciation of density estimation as a fundamental statistical tool, and a solid intuition behind the density estimation approaches we discuss here.
Ideally, this article will also spark an interest in learning more about density estimation and point you towards additional resources to help you dive deeper than what is discussed here!
Contents:
Background Concepts
What is density estimation?
Histograms
Overview
Theoretical Properties
Demonstration of Theoretical Properties
Kernel Density Estimators (KDE)
Naive Density Estimator
KDE: Overview
Kernel and Bandwidth
Density Estimation for Classification
Wrap-up
Sources
Background Concepts
Learning/refreshing on the following concepts will be helpful to fully appreciate the rest of what is discussed in this article.
Bias and variance: important concepts for discussing the accuracy of the density estimation approaches discussed.
The cumulative distribution function and probability density function.
Parametric vs.
non-parametric statistics: knowing the distinction will help understand the relevance of the density estimation methods discussed.
O notation: used to describe the asymptotic behavior of the bias/variance of the density estimators.
Kernel: kind of important for the kernel density estimator.
What is density estimation?
Density estimation is concerned with reconstructing the probability density function of a random variable, X, given a sample of random variates X1, X2,…, Xn.
Density estimation plays a crucial role in statistical analysis.
It may be used as a standalone method for analyzing the properties of a random variable’s distribution, such as modality, spread, and skew.
Alternatively, density estimation may be used as a means for further statistical analysis, such as classification tasks, goodness-of-fit tests, and anomaly detection, to name a few.
Some of you may recall that the probability distribution of a random variable X can be completely characterized by its cumulative distribution function (CDF), F(⋅).
If X is a discrete random variable, then we can derive its probability mass function (PMF), p(⋅), from its CDF via the following relationship: p(Xi) = F(Xi) − F(Xi-1), where Xi-1 denotes the largest value within the discrete distribution of X that is less than Xi.
If X is continuous, then its probability density function (PDF), p(⋅), may be derived by differentiating its CDF i.e.
F′(⋅) = p(⋅).
Based on this, you may be wondering why we need methods to estimate the probability distribution of X, when we can just exploit the relationships stated above.
Certainly, given a sample of data X1,…, Xn, we may always construct an estimate of its CDF.
If X is discrete, then constructing its PMF is straightforward, as it simply requires counting the frequency of observations for each distinct value that appears in our sample.
However, if X is continuous, estimating its PDF is not so trivial.
Notice that our estimate of the CDF, F(⋅), will necessarily follow a discrete distribution, since we have a finite amount of empirical data.
Since F(⋅) is discrete, we cannot simply differentiate it to obtain an estimate of the PDF.
Thus, this motivates the need for other methods of estimating p(⋅).
To provide some additional motivation behind density estimation, the CDF may be suboptimal to use for analyzing the properties of the probability distribution of X.
For example, consider the following display.
PDF vs.
CDF of data following a bimodal distribution.
Certain properties of the distribution of X, such as its bimodal nature, are immediately clear from analyzing its PDF.
However, these properties are harder to notice from analyzing its CDF, due to the cumulative nature of the distribution.
For many folks, the PDF likely provides a more intuitive display of the distribution of X — it is larger at values of X that are more likely to “occur” and smaller for values of X that are less likely.
Broadly speaking, density estimation approaches may be categorized as parametric or non-parametric.
Parametric density estimation assumes X follows some distribution that may be characterized by some parameters (ex: X ∼ N(μ,σ)).
Density estimation in this case involves estimating the relevant parameters for the parametric distribution of X, and then plugging in these parameter estimates to the corresponding density function formula for X.
Non-parametric density estimation makes less rigid assumptions about the distribution of X, and estimates the shape of the density function directly from the empirical data.
As a result, non-parametric density estimates will typically have lower bias and higher variance compared to parametric density estimates.
Non-parametric methods may be desired when the underlying distribution of X is unknown and we’re working with a large amount of empirical data.
For the rest of this article, we’ll focus on analyzing two popular non-parametric methods for density estimation: Histograms and kernel density estimators (KDEs).
We’ll dig into how they work, the benefits and drawbacks of each approach, and how accurately they estimate the true density function of a random variable.
Finally, we’ll examine how density estimation can be applied to classification problems, and how the quality of the density estimator can impact classification performance.
Histograms
Overview
Histograms are a simple non-parametric approach for constructing a density estimate from a sample of data.
Intuitively, this approach involves partitioning the range of our data into distinct equal length bins.
Then, for any given point, assign its density to be equal to the proportion of points that reside within the same bin, normalized by the bin length.
Formally, given a sample of n observations
partition the domain into M bins
such that
For a given point x ∈ βl, where βl denotes the lth bin, the density estimate produced by the histogram will be
Pointwise density estimate of the histogram.
Since the histogram density estimator assigns uniform density to all points within the same bin, the density estimate will be discontinuous at all of its breakpoints where the density estimates differ.
Histogram density estimate for the standard Gaussian.
Uniform densities are assigned to all points within the same bin.
Above, we have the histogram density estimate of the standard Gaussian distribution generated from a sample of 1000 data points.
We see that x = 0 and x = −0.5 lie within the same bin, and thus have identical density estimates.
Theoretical Properties
Histograms are a simple and intuitive method for density estimation.
They make no assumptions about the underlying distribution of the random variable.
Histogram estimation simply requires tuning the bin width, h, and the point where the histogram bins originate from, t0.
However, we’ll see very soon that the accuracy of the histogram estimator is highly dependent on tuning these parameters appropriately.
As desired, the histogram estimator is a true density function.
It is non-negative over its entire domain.
It integrates to 1.
Integral of the histogram density estimator.
We can evaluate the accuracy of the histogram estimator for estimating the true density, p(⋅), by decomposing its mean squared error into its bias and variance terms.
First, lets examine its bias at a given point x ∈ (bk-1, bk].
Expected value of the pointwise histogram density estimate.
Let’s take a bit of a leap here.
Using the Taylor series expansion, the fact that the PDF is the derivative of the CDF, and |x − bk-1| ≤ h, we can derive the following.
Thus, we have
which implies
Asymptotic bias of the histogram density estimator.
Therefore, the histogram estimator is an unbiased estimator of the true density, p(⋅), as the bin width approaches 0.
Now, let’s analyze the variance of the histogram estimator.
Notice that as h → ∞, we have
Therefore,
Asymptotic variance of the histogram density estimator.
Now, we’re at a bit of an impasse; we see that as h → ∞, the bias of the histogram density estimate decreases, while its variance increases.
We’re typically concerned with the accuracy of the density estimate at large sample sizes (i.e.
as n → ∞).
Therefore, to maximize the accuracy of the histogram density estimate, we’ll want to tune h to achieve the following behavior:
Choose h to be small to minimize bias.
As h → 0 and n → ∞, we must have nh → ∞ to minimize variance.
In other words, the large sample size should overpower the small bin width, asymptotically.
This bias-variance trade-off is not unexpected:
Small bin widths may capture the density around a particular point with high precision.
However, density estimates may change from small random variations across data sets as less points will fall within the same bin.
Large bin widths include more data points when computing the density estimate at a given point, which means density estimates will be more robust to small random variations in the data.
Let’s illustrate this trade-off with some examples.
Demonstration of Theoretical Properties
First, we’ll look at how small bin widths may lead to large variance in the histogram density estimator.
For this example, we’ll draw four samples of 50 random variates, where each sample is drawn from a standard Gaussian distribution.
We’ll set a relatively small bin width (h = 0.2).
set.seed(25)
# Standard Gaussian
mu <- 0
sd <- 1
# Parameters for density estimate
n <- 50
h <- 0.2
# Generate 4 samples of standard Gaussian
samples <- replicate(4, rnorm(n, mean = mu, sd = sd), simplify = FALSE)
# Setup 2x2 plot
par(mfrow = c(2, 2), mar = c(4, 4, 3, 1))
# Plot histograms
titles <- paste("Sample", 1:4)
invisible(mapply(plot_histogram, samples, title = titles,
MoreArgs = list(binwidth = h, origin = 0, line = 0)))
Histogram density estimates (h = 0.2) generated from 4 different samples of the standard Gaussian.
Notice the high variability in density estimates across samples.
It’s clear that the histogram density estimates vary quite a bit.
For instance, we see that the pointwise density estimate at x = 0 ranges from approximately 0.2 in Sample 4 to approximately 0.6 in Sample 2.
Additionally, the distribution of the density estimate produced in Sample 1 appears almost bimodal, with peaks around −1 and a little above 0.
Let’s repeat this exercise to demonstrate how large bin widths may result in a density estimate with lower variance, but higher bias.
For this example, let’s draw four samples from a bimodal distribution consisting of a mixture of two Gaussian distributions, N(0, 1) and N(3, 1).
We’ll set a relatively large bin width (h = 2).
set.seed(25)
# Bimodal distribution parameters - mixture of N(0, 1) and N(4, 1)
mu_1 <- 0
sd_1 <- 1
mu_2 <- 3
sd_2 <- 1
# Density estimation parameters
n <- 100
h <- 2
# Generate 4 samples from bimodal distribution
samples <- replicate(4, c(rnorm(n/2, mean = mu_1, sd = sd_1), rnorm(n/2, mean = mu_2, sd = sd_2)), simplify = FALSE)
# Set up 2x2 plotting grid
par(mfrow = c(2, 2), mar = c(4, 4, 3, 1))
# Plot histograms
titles <- paste("Sample", 1:4)
invisible(mapply(plot_histogram, samples, title = titles,
MoreArgs = list(binwidth = h, origin = 0, line = 0)))
Histogram density estimates (h = 2) generated from 4 different samples of a bimodal distribution.
These histograms fail to capture the bimodal nature of the data.
There is still some variation in the density estimates across the four histograms, but they appear stable relative to the density estimates we saw above with smaller bin widths.
For instance, it appears that the pointwise density estimate at x = 0 is approximately 0.15 across all the histograms.
However, it’s clear that these histogram estimators introduce a large amount of bias, as the bimodal distribution of the true density function is masked by the large bin widths.
Additionally, we mentioned previously that the histogram estimator requires tuning the origin point, t0.
Let’s look at an example that illustrates the impact that the choice of t0 can have on the histogram density estimate.
set.seed(123)
# Distribution and density estimation parameters
# Bimodal distribution: mixture of N(0, 1) and N(5, 1)
n <- 50
data <- c(rnorm(n/2, mean = 0, sd = 1), rnorm(n/2, mean = 5, sd = 1))
h <- 3
# Set up plotting grid
par(mfrow = c(1, 2), mar = c(4, 4, 3, 1))
# Same bin width, different origins
plot_histogram(data, binwidth = h, origin = 0, title = paste("Bin width = ", h, ", Origin = 0"))
plot_histogram(data, binwidth = h, origin = 1, title = paste("Bin width = ", h, ", Origin = 1"))
Histogram density estimates of a bimodal distribution with different origin points.
Notice the histogram on the right fails to capture the bimodal nature of the data.
The histogram density estimates above differ in their origin point by a magnitude of 1.
The impact of the different origin point on the resulting histogram density estimates is evident.
The histogram on the left captures the fact that the distribution is bimodal with peaks around 0 and 5.
In contrast, the histogram on the right gives the impression that the density of X follows a unimodal distribution with a single peak around 5.
Histograms are a simple and intuitive approach to density estimation.
However, histograms will always produce density estimates that follow a discrete distribution, and we’ve seen that the resulting density estimate may be highly dependent on an arbitrary choice of the origin point.
Next, we’ll look at an alternative method for density estimation, Kernel Density Estimation, that addresses these shortcomings.
Kernel Density Estimators (KDE)
Naive Density Estimator
We’ll first look at the most basic form of a kernel density estimator, the naive density estimator.
This approach is also known as the “moving histogram”; it is an extension of the traditional histogram density estimator that computes the density at a given point by examining the number of observations that fall within an interval that is centered around that point.
Formally, the pointwise density estimate at x produced by the naive density estimator can be written as follows.
Pointwise density estimate of the Naive Density Estimator.
Its corresponding kernel is defined as follows.
Naive Density Estimator kernel function.
Unlike the traditional histogram density estimate, the density estimate produced by the moving histogram does not vary based on the choice of origin point.
In fact, there is no concept of “origin point” in the moving histogram, as the density estimate at x only depends on the points that lie within the neighborhood (x − (h/2), x + (h/2)).
Let’s examine the density estimate produced by the naive density estimator for the same bimodal distribution as we used above for highlighting the histogram’s dependency on origin point.
set.seed(123)
# Bimodal distribution - mixture of N(0, 1) and N(5, 1)
data <- c(rnorm(n/2, mean = 0, sd = 1), rnorm(n/2, mean = 5, sd = 1))
# Density estimate parameters
n <- 50
h <- 1
# Naive Density Estimator: KDE with rectangular kernel using half the bin width
# Rectangular kernel counts points within (x - h, x + h)
pdf_est <- density(data, kernel = "rectangular", bw = h/2)
# Plot PDF
plot(pdf_est, main = "NDE: Bimodal Gaussian", xlab = "x", ylab = "Density", col = "blue", lwd = 2)
rug(data)
polygon(pdf_est, col = rgb(0, 0, 1, 0.2), border = NA)
grid()
Naive Density Estimate of a bimodal distribution containing a mixture of N(0, 1) and N(5, 1).
Clearly, the density estimate produced by the naive density estimator captures the bimodal distribution much more accurately than the traditional histogram.
Additionally, the density at each point is captured with much finer granularity.
That being said, the density estimate produced by the NDE is still quite “rough” i.e.
the density estimate does not have smooth curvature.
This is because each observation is weighted as “all or nothing” when computing the pointwise density estimate, which is apprent from its kernel, K0.
Specifically, all points within the neighborhood (x − (h/2), x + (h/2)) contribute equally to the density estimate, while points outside the interval contribute nothing.
Ideally, when computing the density estimate for x, we would like to weigh points in proportion to their distance from x, such that the points closer/farther from x have a higher/lower impact on its density estimate, respectively.
This is essentially what the KDE does: it generalizes the naive density estimator by replacing the uniform density function with an arbitrary density function, the kernel.
Intuitively, you can think of the KDE as a smoothed histogram.
KDE: Overview
The kernel density estimator generated from a sample X1,…, Xn, can be defined as follows:
Pointwise density estimate of the KDE.
Below are some popular choices for kernels used in density estimation.
These are just several of the more popular kernels that are typically used for density estimation.
For more information about kernel functions, check out the Wikipedia.
If you’re seeking for some intuition behind what exactly a kernel function is (as I was), check out this quora thread.
We can see that the KDE is a genuine density function.
It is always non-negative, since K(⋅) is a density function.
It integrates to 1.
Integral of the KDE.
Kernel and Bandwidth
In practice, K(⋅) is chosen to be symmetric and unimodal around 0 (∫u⋅K(u)du = 0).
Additionally, K(⋅) is typically scaled to have unit variance when used for density estimation (∫u2⋅K(u)du = 1).
This scaling essentially standardizes the impact that the choice of bandwidth, h, has on the KDE, regardless of the kernel being used.
Since the KDE at a given point is the weighted sum of its neighboring points, where the weights are computed by K(⋅), the smoothness of the density estimate is inherited from the smoothness of the kernel function.
Smooth kernel functions will produce smooth KDEs.
We can see that the Gaussian kernel depicted above is infinitely differentiable, so KDEs with the Gaussian kernel will produce density estimates with smooth curvature.
On the other hand, the other kernel functions (Epanechnikov, rectangular, triangular) are not differentiable everywhere (ex: ±1), and in the case of the rectangular and triangular kernels, do not have smooth curvature.
Thus, KDEs using these kernels will produce rougher density estimates.
However, in practice, we’ll see that as long as the kernel function is continuous, the choice of the kernel has relatively little impact on the KDE compared to the choice of bandwidth.
set.seed(123)
# sample from standard Gaussian
x <- rnorm(50)
# kernel/bandwidths for KDEs
kernels <- c("gaussian", "epanechnikov", "rectangular", "triangular")
bandwidths <- c(0.5, 1, 2)
colors_k <- rainbow(length(kernels))
colors_b <- rainbow(length(bandwidths))
plot_kde_comparison <- function(values, label, type = c("kernel", "bandwidth")) {
type <- match.arg(type)
plot(NULL, xlim = range(x) + c(-1, 1), ylim = c(0, 0.5),
xlab = "x", ylab = "Density", main = paste("KDE with Different", label))
for (i in seq_along(values)) {
if (type == "kernel") {
d <- density(x, kernel = values[i])
col <- colors_k[i]
} else {
d <- density(x, bw = values[i], kernel = "gaussian")
col <- colors_b[i]
}
lines(d$x, d$y, col = col, lwd = 2)
}
curve(dnorm(x), add = TRUE, lty = 2, lwd = 2)
legend("topright", legend = c(as.character(values), "True Density"),
col = c(if (type == "kernel") colors_k else colors_b, "black"),
lwd = 2, lty = c(rep(1, length(values)), 2), cex = 0.8)
rug(x)
}
plot_kde_comparison(kernels, "Kernels", type = "kernel")
plot_kde_comparison(bandwidths, "Bandwidths", type = "bandwidth")
We see that the KDEs for the standard Gaussian with various kernels are relatively similar, compared to the KDEs produced with various bandwidths.
Accuracy of the KDE
Let’s examine how accurately the KDE estimates the true density, p(⋅).
As we did with the histogram estimator, we can decompose its mean squared error into its bias and variance terms.
For details behind how to derive these bias and variance terms, check out lecture 6 of these notes.
The bias and variance of the KDE at x can be expressed as follows.
Asymptotic bias and variance of the KDE.
Intuitively, these results give us the following insights:
The effect of K(⋅) on the accuracy of the KDE is primarily captured via the term σ2K = ∫K(u)2du.
The Epanechnikov kernel minimizes this integral, so theoretically it should produce the optimal KDE.
However, we’ve seen that the choice of kernel has little practical impact on the KDE relative to its bandwidth.
Additionally, the Epanechnikov kernel has a bounded support interval ([−1, 1]).
As a result, it may produce rougher density estimates relative to kernels that are nonzero across the entire real number space (ex: Gaussian).
Thus, the Gaussian kernel is commonly used in practice.
Recall that the asymptotic bias and variance of the histogram estimator as h → ∞ was O(h) and O(1/(nh)), respectively.
Comparing these against KDE tells us that the KDE improves upon the histogram density estimator primarily through decreased asymptotic bias.
This is expected: the kernel smoothly varies the weight of the neighboring points of x when computing the pointwise density at x, instead of assigning uniform density to arbitrary fixed intervals of the domain.
In other words, the KDE imposes a less rigid structure on the density estimate compared to the histogram approach.
For histograms and KDEs, we’ve seen that the bandwidth h can have a significant impact on the accuracy of the density estimate.
Ideally, we would select the h such that the mean squared error of the density estimator is minimized.
However, it turns out that this theoretically optimal h depends on the curvature of the true density p(⋅), which is unknown practice (otherwise we wouldn’t need density estimation)!
Some popular approaches for bandwidth selection include:
Assuming the true density resembles some reference distribution p0(⋅) (ex: Gaussian), then plugging in the curvature of p0(⋅) to derive the bandwidth.
This approach is simple, but it assumes the distribution of the data, so it may be a poor choice if you’re looking to build density estimates to explore your data.
Non-parametric approaches to bandwidth selection, such as cross-validation and plug-in methods.
The unbiased cross-validation and Sheather-Jones methods are popular bandwidth selectors and typically produce fairly accurate density estimates.
For more information on the impact of bandwidth selection on the KDE, check out this blog post.
set.seed(42)
# Simulate data: a bimodal distribution
x <- c(rnorm(150, mean = -2), rnorm(150, mean = 2))
# Define true density
true_density <- function(x) {
0.5 * dnorm(x, mean = -2, sd = 1) +
0.5 * dnorm(x, mean = 2, sd = 1)
}
# Create plotting range
x_grid <- seq(min(x) - 1, max(x) + 1, length.out = 500)
xlim <- range(x_grid)
ylim <- c(0, max(true_density(x_grid)) * 1.2)
# Base plot
plot(NULL, xlim = xlim, ylim = ylim,
main = "KDE: Various Bandwidth Selection Methods",
xlab = "x", ylab = "Density")
# KDE with different bandwidths
lines(density(x), col = "red", lwd = 2, lty = 4)
h_scott <- 1.06 * sd(x) * length(x)^(-1/5)
lines(density(x, bw = h_scott), col = "blue", lwd = 2, lty = 2)
lines(density(x, bw = bw.ucv(x)), col = "darkgreen", lwd = 2, lty = 3)
lines(density(x, bw = bw.SJ(x)), col = "purple", lwd = 2, lty = 4)
# True density
lines(x_grid, true_density(x_grid), col = "black", lwd = 2)
# Add legend
legend("topright",
legend = c("Silverman (Default))", "Scott's Rule", "Unbiased CV",
"Sheather-Jones", "True Density"),
col = c("red", "blue", "darkgreen", "purple", "black"),
lty = 1:6, lwd = 2, cex = 0.8)
KDEs using various bandwidth selection methods where the underlying data follows a bimodal distribution.
Notice the KDEs using the Sheather-Jones and Unbiased Cross-Validation methods produce density estimates closest to the true density.
Density Estimation for Classification
We’ve discussed a great deal about the underlying theory of histograms and KDE, and we’ve demonstrated how they perform at modeling the true density of some sample data.
Now, we’ll look at how we can apply what we learned about density estimation for a simple classification task.
For instance, say we want to build a classifier from a sample of n observations (x1, y1),…, (xn, yn), where each xi comes from a p-dimensional feature space, X, and yi corresponds to the target labels drawn from Y = {1,…, m}.
Intuitively, we want to build a classifier such that for each observation, our classifier assigns it the class label k such that the following is satisfied.
The Bayes classifier does precisely that, and computes the conditional probability above using the following equation.
The Bayes Classifier
This classifier relies on the following:
πk = P(Y = k): the prior probability that an observation (xi, yi) belongs to the kth class (i.e.
yi = k).
This can be estimated by simply counting the proportion of points in each class from our sample data.
fk(x) ≡ P(X = x | Y = k): the p-dimensional density function of X for all observations in target class k.
This is harder to estimate: for each of the m target classes, we must determine the shape of the distribution for each dimension of X, and also whether there are any associations between the different dimensions.
The Bayes classifier is optimal if the quantities above can be computed precisely.
However, this is impossible to achieve in practice when working with a finite sample of data.
For more detail behind why the Bayes classifier is optimal, check out this site.
So the question becomes, how can we approximate the Bayes classifier?
One popular method is the Naive Bayes classifier.
Naive Bayes assumes class-conditional independence, which means that for each target class, it reduces the p-dimensional density estimation problem into p separate univariate density estimation tasks.
These univariate densities may be estimated parametrically or non-parametrically.
A typical parametric approach would assume that each dimension of X follows a univariate Gaussian distribution with class-specific mean and a diagonal co-variance matrix, whereas a non-parametric approach may model each dimension of X using a histogram or KDE.
The parametric approach to univariate density estimation in Naive Bayes may be useful when we have a small amount of data relative to the size of the feature space, as the bias introduced by the Gaussian assumption may help reduce the variance of the classifier.
However, the Gaussian assumption may not always be appropriate depending on the distribution of data that you’re working with.
Let’s examine how parametric vs.
non-parametric density estimates can impact the decision boundary of the Naive Bayes classifier.
We’ll build two classifiers on the Iris dataset: one of them will assume each feature follows a Gaussian distribution, and the other will build kernel density estimates for each feature.
# Parametric Naive Bayes
param_nb <- naive_bayes(Species ~ ., data = train)
# Nonparametric Naive Bayes
# KDE with Gaussian kernel and Sheather-Jones bandwidth
nonparam_nb <- naive_bayes(Species ~ ., data = train,
usekernel = TRUE,
kernel="gaussian",
bw="sj") # play with bandwidth to see how it affects the classification boundaries!
# Create grid for plotting decision boundaries
x_seq <- seq(min(iris2D$Sepal.Length), max(iris2D$Sepal.Length), length.out = 200)
y_seq <- seq(min(iris2D$Petal.Length), max(iris2D$Petal.Length), length.out = 200)
grid <- expand.grid(Sepal.Length = x_seq, Petal.Length = y_seq)
# Predict class for each point on grid
grid$param_pred <- predict(param_nb, grid)
grid$nonparam_pred <- predict(nonparam_nb, grid)
# Plot decision boundaries
nb_parametric <- ggplot() +
geom_tile(data = grid, aes(x = Sepal.Length, y = Petal.Length, fill = param_pred), alpha = 0.3) +
geom_point(data = train, aes(x = Sepal.Length, y = Petal.Length, color = Species), size = 2) +
ggtitle("Parametric Naive Bayes Decision Boundary") +
theme_minimal()
nb_nonparametric <- ggplot() +
geom_tile(data = grid, aes(x = Sepal.Length, y = Petal.Length, fill = nonparam_pred), alpha = 0.3) +
geom_point(data = train, aes(x = Sepal.Length, y = Petal.Length, color = Species), size = 2) +
ggtitle("Nonparametric Naive Bayes Decision Boundary") +
theme_minimal()
nb_parametric
nb_nonparametric
Decision boundaries produced by the parametric Naive Bayes classifier.
Decision boundaries produced by the non-parametric Naive Bayes classifier.
Notice the rough decision boundaries relative to that of its parametric counterpart.
# Parametric Naive Bayes prediction on test data
param_pred <- predict(param_nb, newdata = test)
# Non-parametric Naive Bayes prediction on test data
nonparam_pred <- predict(nonparam_nb, newdata = test)
# Create confusion matrices
param_cm <- confusionMatrix(param_pred, test$Species)
nonparam_cm <- confusionMatrix(nonparam_pred, test$Species)
output <- capture.output({
# Print confusion matrices
cat("\n=== Parametric Naive Bayes Metrics ===\n")
print(param_cm$table)
cat("Parametric Naive Bayes Accuracy: ", param_cm$overall['Accuracy'], "\n\n")
cat("=== Non-parametric Naive Bayes Metrics ===\n")
print(nonparam_cm$table)
cat("Nonparametric Naive Bayes Accuracy: ", nonparam_cm$overall['Accuracy'], "\n")
})
cat(paste(output, collapse = "\n"))
Classification performance for both Naive Bayes models.
Non-parametric Naive Bayes achieved slightly better performance on our data.
We see that the non-parametric Naive Bayes classifier achieves slightly better accuracy than its parametric counterpart.
This is because the non-parametric density estimates produce a classifier with a more flexible decision boundary.
As a result, several of the “virginica” observations that were incorrectly classified as “versicolor” by the parametric classifier ended up being classified correctly by the non-parametric model.
That being said, the decision boundaries produced by non-parametric Naive Bayes appear to be rough and disconnected.
Thus, there are some regions of the feature space where the classification boundary may be questionable, and fail to generalize well to new data.
In contrast, the parametric Naive Bayes classifier produces smooth, connected decision boundaries that appear to accurately capture the general pattern of the feature distributions for each species.
This distinction brings up an important point that “more flexible density estimation” does not equate to “better density estimation”, especially when applied to classification.
After all, there’s a reason why Naive Bayes classification is popular.
Although making less assumptions about the distribution of your data may seem desirable to produce unbiased density estimates, simplifying assumptions may be effective when there is insufficient empirical data to produce high quality estimates, or if the parametric assumptions are believed to be mostly accurate.
In the latter case, parametric estimation will introduce little to no bias to the estimator, whereas non-parametric approaches may introduce large amounts of variance.
Indeed, looking at the feature distributions below, the Gaussian assumption of parametric Naive Bayes doesn’t seem inappropriate.
For the most part, it appears the class distributions for petal and sepal length appear to be unimodal and symmetric.
iris_long <- pivot_longer(iris, cols = c(Sepal.Length, Petal.Length), names_to = "Feature", values_to = "Value")
ggplot(iris_long, aes(x = Value, fill = Species)) +
geom_density(alpha = 0.5, bw="sj") +
facet_wrap(~ Feature, scales = "free") +
labs(title = "Distribution of Sepal and Petal Lengths by Species", x = "Length (cm)", y = "Density") +
theme_minimal()
Density distributions for Petal and Sepal length.
The univariate densities appear to be unimodal and symmetric across all species for both features.
Wrap-up
Thanks for reading! We dove into the theory behind the histogram and kernel density estimators and how to apply them in context..
Let’s briefly summarize what we discussed:
Density estimation is a fundamental tool in Statistical Analysis for analyzing the distribution of a variable or as an intermediate tool for deeper statistical analysis.
Density estimation approaches may be broadly categorized as parametric or non-parametric.
Histograms and KDEs are two popular approaches for non-parametric density estimation.
Histograms produce density estimates by computing the normalized frequency of points within each distinct bin of the data.
KDEs are “smoothed” histograms that estimate the density at a given point by computing a weighted sum of its surrounding points, where neighbors are weighted in proportion to their distance.
Non-parametric density estimation can be applied to classification algorithms that require modeling the feature densities for each target class (Bayesian classification).
Classifiers built using non-parametric density estimates may be able to define more flexible decision boundaries at the cost of higher variance.
Check out the sources below if you’re interested in learning more!
The author has created all images in this article.
Sources
Learning Resources:
García-Portugués, E.
(2025). Notes for Nonparametric Statistics.
Version 6.12.0.
ISBN 978-84-09-29537-1.
Available at https://bookdown.org/egarpor/NP-UC3M/.
García-Portugués, E.
(2022). A Short Course on Nonparametric Curve Estimation.
Version 2.1.1.
Available at https://bookdown.org/egarpor/NP-EAFIT/.
UW Stat 425: Introduction to Nonparametric Statistics (Winter 2018)
James et al., An Introduction to Statistical Learning
Hastie et al., The Elements of Statistical Learning
Andrey Akinshin, The importance of kernel density estimation bandwidth
Datasets:
Fisher, R.
(1936).
Iris [Dataset].
UCI Machine Learning Repository.
https://doi.org/10.24432/C56C76. (CC BY 4.0)
The post Non-Parametric Density Estimation: Theory and Applications appeared first on Towards Data Science.
Source: https://towardsdatascience.com/non-parametric-density-estimation-theory-and-applications/
#nonparametric #density #estimation #theory #and #applications
·10 Views