I am having trouble understanding how to implement a Gaussian kernel density estimation of the following dataset in R. I appreciate if you can help me understand the mechanism of how to do it. I am currently trying to get a formula for the bell shaped curves at the bottom of the following picture. As you can see there is one bell shaped curve for each data point. (Note the picture does not represent the data I am using.)
This is my data:
x<-c(4.09, 4.46, 4.61, 4.30, 4.03, 5.22, 4.21, 4.07, 4.02, 4.58, 4.66, 4.05, 4.23, 5.51, 4.03, 4.72, 4.47, 4.50, 5.80, 4.30, 4.09, 4.78, 4.18, 4.45, 4.40, 5.60, 4.37, 4.42, 4.88, 4.20, 4.45, 4.10, 4.43, 4.58, 4.40, 4.38)
(x has 36 elements)
This is the kernel density estimator:
(If you can't see the image, it's from this page http://sfb649.wiwi.hu-berlin.de/fedc_homepage/xplore/tutorials/xlghtmlnode33.html)
is the Gaussian kernel function and h=.1516 is the bandwidth selected by Scott.
So, plugging in we get f hat (x) = 1/(36*.1516) (1/sqrt(2pi))[e^(-1/2 ((4.09-x)/.1516)^2 + e^(-1/2 ((4.46-x)/.1516)^2 + ... + e^(-1/2 ((4.38-x)/.1516)^2]
Ok. So we have a function of x. But how do we get the equation of each of the bell shaped curves in the above diagram? If we plug in, for example, 4.09, into f hat (x) we get a number, not a curve/function/distribution. Can someone help me understand the procedure to find the equation for the bell shaped curve/kernel density estimate?
Here's a function that will return your fhat function given your x
values and h
value
get_fhat <- function(x, h) {
Vectorize(function(z) 1/length(x)/h*sum(dnorm((x-z)/h)))
}
This function returns a function that we can use to get values. We Vectorize
it so we can pass in multiple values at once to the function.
We can get a single value or plot it with
fhat <- get_fhat(x, .1516)
fhat(4.09)
# [1] 0.9121099
curve(fhat, from=min(x), to=max(x))
How do you determine how wide to make the interval? Does it have to do with the bandwidth h?
You could extend the ends till
fhat
falls below some threshold if you like. But since you are summing normal densities they technically have infinite support.