Investigating the Statistical Properties of the Double Kernel Density Estimator
Researchers have recently been motivated to use the double kernel density due to information loss due to data aggregation of incidents into geographical areas such as cities or neighborhoods. This aggregation has been deemed necessary both to reduce the effects of variance on the relative errors of...
Saved in:
Main Authors: | , |
---|---|
Format: | Dissertation |
Language: | English |
Published: |
ProQuest Dissertations & Theses
01-01-2018
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Researchers have recently been motivated to use the double kernel density due to information loss due to data aggregation of incidents into geographical areas such as cities or neighborhoods. This aggregation has been deemed necessary both to reduce the effects of variance on the relative errors of the results, and to preserve the privacy of individuals who have been stricken with disease. Use of the double kernel density also allows for the linking of location data with other data sources, such as welfare levels, air pollution and other attributes not available for individuals.As far as we know, the statistical properties of the double kernel density are not yet fully understood. We would like to know whether or not the estimates for risk or incidence rate obtained using the double kernel density are accurate or misleading under different conditions. Of particular concern are the peaks of the true risk function. These peaks indicate a point around which the risk of contracting a disease is at its highest. While the accuracy of the estimate of the magnitude of these peaks is important, it is critical that the location estimate be accurate so as not create any misleading associations with other, nearby points of interest.This study examines some of the statistical properties of the double kernel density. In particular, we empirically analyze how different factors of the population and incidence distributions affect the accuracy of the double kernel density.Theory and method: The type of study we are interested in involves incidents of a chronic disease. The common measure of frequency used in the literature is called the incidence rate. For a given population, time period, and set of incidents, we can compute the overall incidence rate by taking the total number of incidents and dividing by the total population. However, we may wish to compare the incidence rate at different locations within an area. To do this, we compute the incidence rate point-wise. The basic unit of our study is the experiment, a set of monte carlo simulations run with the same initial setup with the same measurements taken. Each experiment is run with a fixed set of parameters. We run a set of experiments, varying one parameter at a time, or in some cases two parameters in tandem, in order to observe the effect of this parameter on the accuracy of the double kernel density.In order to compute the double kernel density, a parameter known as the bandwidth must be set, and the accuracy of the double kernel density is highly dependent upon it. In each experiment, we used two techniques, Silverman Rule of Thumb and Least Squares Cross Validation, to select the bandwidth. We compare the results of these two techniques. In order to describe the accuracy of the double kernel density as a method of estimating the true risk function λ, we use several accuracy measures. In particular, for each experiment we measure mean integrated squared error, mean integrated absolute error, supremum error, peak bias, peak drift, centroid bias, and centroid drift. |
---|---|
ISBN: | 9798471190931 |