The following paper and code where prepared for the module GIS and Science of the MRes Advanced Spatial Analysis and Visualization of the Centre for Advanced Spatial Analysis (CASA) of UCL.

You can find the code and the data used on my GitHub page github.com/mandarini

Introduction:

The three functions explained below work together to explore and present further actions and possibilities of linear regression models. My goal is to help users experiment with the explanatory variables of a linear regression model. As a result, these experiments will help the user reach conclusions about how the initial dependent variable will change, if one of the explanatory variables is changed as wished.

The need of such a set of functions can be seen in the results. There is a wide range of bibliography and programming packages that deal with linear regression models. Combining existing functions and methods, I have tried to take this analysis one step further, by allowing the user to change one parameter and visualize the results of that choice. I believe that the automation of this procedure is vital, because taking it step by step is time-consuming and not necessary, since the same steps are performed each time, and the only thing that changes is the inputs.

To achieve my goal I created three functions that work together, one using the output of the other, to produce, finally, maps of the initial situation and the future situation.

Potential Application / An example

Linear regression is a method that attempts to model the relationship between a scalar dependent variable and one or more explanatory variables1. This set of functions can be used to explore what effects a change of one explanatory variable has in the dependent variable.

Data Used

To demonstrate the use and capabilities of this set of functions I am using a data set that contains a range of demographic and related data for each ward in Greater London2. I chose this dataset, because in previous work I had had the chance to prepare it for use in the R project. However, any set of data can be used similarly.

In the example presented here, a model is made to predict the average GCSE scores in London wards as of 2011. In the first graph we see two histograms. We can see that our model is good, as the two histograms are almost concur (figure 1). One of the four variables kept in the final model is the percentage of unauthorized absence in all schools. We want to see how the GCSE scores would increase if there were less absences in schools. So, we assume that absences are reduced by 20%. In the first map (figure 2) we see the current situation. In the second map (figure 3) we see the future, supposed situation, assuming that absences are reduced, therefore pupils are more consistent as far as class attendance is concerned. We see that the range in grades is smaller, and now, even by such a small change, we have more B’s and A’s than before3. So, we can use that knowledge to propose a higher strictness on the unauthorized absences, claiming that students could have better scores that way.

Limitations / Further Work

This set of functions has its own limitations. The first limitation that I came by, is that I could not find an automated way to perform a backward regression in the whole data set. I believe that this is due to the time this would take, to check all possible combinations and come up with the best model that would exist in a set of data. So I had to limit my function to have a choice of up to 6 variables. I could add more with just a few lines of code. I just figured that since it is now a human that chooses the variables, they will be able to choose the 6 most possibly significant variables out of a data set. And if they would fail, they could run the function again to achieve a good model.

The second limitation is that I left the width of the histogram bins to the default value, which is range/30 (http://docs.ggplot2.org/current/geom_histogram.html). This is not a good value, as stated in the above citation, however I could not preassume a good value. There are functions and methods available that choose the optimum width for bins, but I figured that something like that would just make the function more complicated and would raise its requirements (since additional r-packages would need to be loaded). So, since the first plotting of the histograms is just there for a quick check of how good the function is, I left it with the default value. Another problem is that the functions create and store in RAM a number of datasets that are used only once or that their creation can be avoided. The are just created as a bridge to achieve a further goal. So, an improvement to the functions would be to find a way to eliminate all the not useful datasets.

Another limitation is that the preparation of the spatial coordinates (the fortify-ing of the shapefile) to be used in ggplot2 must be conducted by the user, and is not implemented in one of the functions. This is because in the fortify function the user must specify the field by which the coordinates are grouped, and to automate that in a new function would be useless, because it would not save the user of any time.

One last limitation and/or issue is in the last function where the maps are plotted. In the colouring of the dots, I have put the white part to represent the mean price of the values. It would be more correct, as far as visualization and comparison are concerned, to use the mean price of the actual values in both actual and predicted maps, so that the user can spot the change more easily. However, there is an issue there, that if the predicted variables are very different from the actual ones, the mean price of the actual ones will not have any meaning in the new map, and all values will be of almost the same colour.

Requirements

1. R project (http://www.r-project.org/)

2. internet connection ( only to download any missing packages)

3. R packages:

- rms (http://cran.r-project.org/web/packages/rms/index.html)
- ggplo2 (http://cran.r-project.org/web/packages/ggplot2/index.html)
- maptools (http://cran.r-project.org/web/packages/maptools/index.html)

References

documentation that helped me understand linear regression modeling

http://en.wikipedia.org/wiki/Linear_regression (reference number 1)

Dunn (1989) Building regression models: the importance of graphicsFile

Jones (1984) Graphical Methods for exploring relationshipsFile

http://en.wikipedia.org/wiki/Simple_linear_regression

http://en.wikipedia.org/wiki/Linear_model

in class lectures and notes kept in class

documentation that helped me learn how to make functions in r / helped me improve my coding skills

http://www.rexamples.com/4/Reading%20user%20input

http://geo.maua.sp.gov.br/maua/pacotes/rlib/linux/gtools/html/ask.html

documentation that helped me better understand certain r functions

http://stat.ethz.ch/R-manual/R-patched/library/graphics/html/hist.html

http://docs.ggplot2.org/current/geom_histogram.html

http://stackoverflow.com/questions/3541713/how-to-plot-two-histograms-together-in-r

http://rfunction.com/archives/539

http://spatial.ly/2013/12/introduction-spatial-data-ggplot2/

also I had to study the r packages I used

http://cran.r-project.org/web/packages/rms/rms.pdf

http://cran.r-project.org/web/packages/ggplot2/ggplot2.pdf

documentation that helped me understand gcse capped points

Brief Guide 6: KS4 Capped Points Scores, Lincolnshire School Improvement Service

data downloaded from

http://data.london.gov.uk/datastore/package/ward-profiles-and-atlas