Here we will choose the data to explore
Welcome to the statistical analysis assistant for Small Area Estimation. Small area estimation refers to statistical modelling where the purpose is to find estimates of a function of a variable/quantity for each of a set of groups (areas) which the population as a whole can be subdivided into.
For example we might be interested in the average salary of individuals for each postal code area or the proportion who are likely to vote for a particular political party in each constituency. Here the population might be the whole country but this can then be subdivided into post code areas and constituencies respectively. In the first example of salaries we may not only want to estimate the mean but also other quantities for examples percentiles (e.g. what is the 90% percentile or the salary above which only 10% of the area earn) and what percentage of the area are in poverty (perhaps defined by the proportion of individuals earning below a threshold)
In this SAA we will particularly be concerning ourselves with unit level models. For a unit level model we require 2 datasets – a sample dataset and a population dataset. The sample dataset contains a sample of individual from some (but not necessarily all) of the groups in the population and for each individual the variable of interest which we will call Y (e.g. salary, voting intention) is collected along with a lot of other variables which we will call X that might be thought to predict the variable of interest (e.g. gender, benefits, family size).
The second population dataset contains records for the WHOLE population i.e. everybody in all groups. This dataset contains the same predictor variables X but here the variable of interest Y is absent. The rationale for the unit level model is therefore to fit a (multilevel) regression model to the Y in the sample dataset to investigate the relationship between Y and X. We then use this model to predict the values of Y for the WHOLE population using the population dataset and then use the estimated Y produced to estimate small area quantities (e.g. means, proportions and percentiles) for each small area.
We will first take a look at the response variable that we wish to estimate at our small areas. We will on this page look at some summary information about this variable and also consider whether the variable needs transforming. We often transform variables so that we can fit a Normal response model and assume normality for the residuals. So firstly we ask for the name of the response variable and a value for the parameter lambda used in the Box Cox transformation later.
In the table below we will look at how representative the sample is of the population in each small area. The larger percentage of the population that is in the sample the more confidence we will have in our small area estimates and the less we will have to use the response – predictor variable relationships across all areas to estimate those small area estimates.
We next look at the shape of the response variable to see whether it needs transforming. First we look at the response itself as shown in the histogram below.
One way of correcting for skew (to the right) is to use a log-transformation. A log transformation only works for positive values and so we first shift our values so that they are all positive and then perform the transformation. The transformed variable is shown in the histogram below.
A more general transformation is the Box-Cox transformation which transforms the original response y to the function where lambda is a parameter that needs inputting (as you did at the top of the page). Again the Box-Cox requires a shift prior to transforming. For the current value of \(\lambda\) we get the histogram shown below
Finally another alternative transformation is the Dual Power transformation. This transformation transforms the original response y to the function . Here again lambda is a parameter to be input. The histogram for the dual power transformed variable looks as follows:
When you come to fit models for small area estimation you will be allowed to choose between these possible transformations and so it is worth looking at the shapes of the histograms here. However it is also worth noting that although the multilevel models used in small area estimation make normality assumptions it is the residuals from the models rather than the responses that should be normally distributed. That said skewed responses often lead to non-normal residuals.
We will use predictor variables that are present in both the datasets so that we can use the relationship that we find between our response and the predictors in the sample dataset to predict the response for the population dataset. It is therefore good to look at the predictors in more detail before we start modelling. Here you can look at one predictor at a time so in the pull down below choose a predictor to investigate.
We can also superimpose the population and sample in the same plot to investigate the closeness of their distribution for this variable. This is shown below with the population in blue and sample in green.
We will be fitting a multilevel model as part of the SAE modelling and so it is also interesting to look at how much of the variability in the predictor variables is due to differences between small areas.
Now that we have looked at the response and predictor variables we will next fit a small area estimation model. Here we fit a multilevel model to the sample dataset and then use the same model to predict the response in the population dataset and thus have predictions for all individuals in the population. From these predictions we can form small area statistics by using the predicted values for individuals in each small area.
In order to fit the model we here have to reinput the response variable and all of the predictors we wish to use in the estimation. We are also given the choice of whether to fit a model to the original response or to use a logged or Box-Cox transformation. To start the model running make these selections from the box below. Note that we are using MCMC estimation which will not only give us small area estimates but also Bayesian credible intervals. It is however a computationally intensive procedure and so this page will take some time to run.
The model being fitted is:
The estimates for this model are as follows:
Here is a graph of the model residuals
We can also visualise these data graphically and so in the graph below you will see the means plotted with 95% credible intervals to illustrate differences across the small areas. The small areas are listed in the order they appear in the table.
The beauty of using MCMC for small area estimation and the fact that it predicts values for each observation in the population is that we can use these predicted datasets from each small area to look at other statistics for each small area. For example we can sort the predicted data and from this construct quantiles to get an idea of the shape of the distribution in each area.
These can be visualised in the plot below where we see different coloured lines for 5 different quantiles in the dataset. Here on the x axis the small areas are sorted in the order they appear in the earlier table.
To see these values in detail the table below gives the values for the same 5 quantiles in tabular form
There are other interesting statistics that are often used in small area estimation, particularly when looking at income measures and below we use the same error bar plot format plots to illustrate each of these in turn.
The Gini index (or coefficient) is perhaps the most commonly used measure of inequality (particularly by economists). It attempts to measures dispersion in a frequency distribution with 0 meaning all individuals in an area have the same value of the response (often income) with a value of 1 then representing the extreme case of all the values for the response (all the income) in an area being concentrated on one individual with all other individuals having response 0. The Gini coefficient can therefore be used to measure relative inequality across a series of small areas.
The formula for the Gini index can be written
The formula for HCR can be written
Another similar index to the HCR is the Poverty gap index (PGI) which is defined as the average poverty gap in the population as a proportion of the threshold (poverty line). In other words we use the same threshold as the HCR but instead of simply noting what proportion are below the threshold we look at how far (in terms of percentages) those below the threshold are in practice.
The formula for the PGI can be written
Finally we can summarise all of these indices in tabular form so that it is easier to see values for individual small areas. We do this in the table below.
For comparison we can also fit models using interoperability with the R statistical software and the emdi package. The emdi package only fits Normal response models to continuous data but does allow a selection of transformations – identity, log and Box-Cox. Below you are asked again to input options for the model.
Emdi produces plots of the fit of the normal response model to the data. Below we can first see a plot of the individual level (Pearson) residuals from the sample plotted as a density plot in blue with superimposed a best fitting normal distribution. The amount the blue density appears outside the black normal curve represents how good/poor a fit the normal assumption is.
We can look at a similar plot for the
Another way of looking at fit is via quantile quantile plots that compare the actual values of the residuals with those that one would theoretically expect from the underlying assumed distribution, in this case the normal distribution.
In the plot below we see to the left such a Q-Q plot for the individual level residuals and to the right a similar plot for the
When we use a Box-Cox transform we need to calculate values for the parameter \(\lambda\) that describes the specific transformation. For the MCMC algorithms this forms part of the algorithm whilst emdi evaluates the likelihood at a grid of values. The below plot shows which values maximise the (log) likelihood.
This can be also found as follows in the emdi output:
We can also visualise these data graphically and so in the graph below you will see the means plotted with 95% confidence intervals to illustrate differences across the small areas. The small areas are listed in the order they appear in the table.
As with MCMC, the EMDI algorithm for small area estimation also constructs estimates for other statistics for each small area. For example we can get estimates for quantiles to get an idea of the shape of the distribution in each area.
These can be visualised in the plot below where we see different coloured lines for 5 different quantiles in the dataset. Here on the x axis the small areas are sorted in the order they appear in the earlier table.
To see these values in detail the table below gives the values for the same 5 quantiles in tabular form.
There are other interesting statistics that are often used in small area estimation, particularly when looking at income measures and below we use the same error bar plot format plots to illustrate each of these in turn.
The Gini index (or coefficient) is perhaps the most commonly used measure of inequality (particularly by economists). It attempts to measures dispersion in a frequency distribution with 0 meaning all individuals in an area have the same value of the response (often income) with a value of 1 then representing the extreme case of all the values for the response (all the income) in an area being concentrated on one individual with all other individuals having response 0. The Gini coefficient can therefore be used to measure relative inequality across a series of small areas.
The formula for the Gini index can be written
The formula for HCR can be written
Another similar index to the HCR is the Poverty gap index (PGI) which is defined as the average poverty gap in the population as a proportion of the threshold (poverty line). In other words we use the same threshold as the HCR but instead of simply noting what proportion are below the threshold we look at how far (in terms of percentages) those below the threshold are in practice.
The formula for the PGI can be written
Finally we can summarise all of these indices in tabular form so that it is easier to see values for individual small areas. We do this in the table below.