6. Local regression¶
Regression models are typically “global”. That is, all date are used simultaneously to fit a single model. In some cases it can make sense to fit more flexible “local” models. Such models exist in a general regression framework (e.g. generalized additive models), where “local” refers to the values of the predictor values. In a spatial context local refers to location. Rather than fitting a single regression model, it is possible to fit several models, one for each location (out of possibly very many) locations. This technique is sometimes called “geographically weighted regression” (GWR). GWR is a data exploration technique that allows to understand changes in importance of different variables over space (which may indicate that the model used is misspecified and can be improved).
There are two examples here. One short example with California precipitation data, and than a more elaborate example with house price data.
Here is an example of GWR with California precipitation data. Get the data (precipitation data and counties).
Compute annual average precipitation
Global regression model
Create objects with a planar crs.
Get the optimal bandwidth
Create a regular set of points to estimate parameters for.
Run the function
Link the results back to the raster
California House Price Data¶
We will use house prices data from the 1990 census, taken from “Pace, R.K. and R. Barry, 1997. Sparse Spatial Autoregressions. Statistics and Probability Letters 33: 291-297.” You can download the data here
Each record represents a census “blockgroup”. The longitude and latitude of the centroids of each block group are available. We can use that to make a map and we can also use these to link the data to other spatial data. For example to get county-membership of each block group. To do that, let’s first turn this into a SpatialPointsDataFrame to find out to which county each point belongs.
Now get the county boundaries and assign CRS of the houses data matches that of the counties (because they are both in longitude/latitude!).
Do a spatial query (points in polygon)
We can summarize the data by county. First combine the extracted county data with the original data.
Compute the population by county
Income is harder because we have the median household income by blockgroup. But it can be approximated by first computing total income by blockgroup, summing that, and dividing that by the total number of households.
Before we make a regression model, let’s first add some new variables that we might use, and then see if we can build a regression model with house price as dependent variable. The authors of the paper used a lot of log tranforms, so you can also try that.
Ordinary least squares regression:
Geographicaly Weighted Regression¶
Of course we could make the model more complex, with e.g. squared income, and interactions. But let’s see if we can do Geographically Weighted regression. One approach could be to use counties.
First I remove records that were outside the county boundaries
Then I write a function to get what I want from the regression (the coefficients in this case)
And now run this for all counties using sapply:
Plot of a single coefficient
There clearly is variation in the coefficient (\(beta\)) for income. How does this look on a map?
First make a data.frame of the results
Fix the counties object. There are too many counties because of the presence of islands. I first aggregate (‘dissolve’ in GIS-speak’) the counties such that a single county becomes a single (multi-)polygon.
Now we can merge this SpatialPolygonsDataFrame with data.frame with the regression results.
To show all parameters in a ‘conditioning plot’, we need to first scale the values to get similar ranges.
Is this just random noise, or is there spatial autocorrelation?
By grid cell¶
An alternative approach would be to compute a model for grid cells. Let’s use the ‘Teale Albers’ projection (often used when mapping the entire state of California).
Create a RasteLayer using the extent of the counties, and setting an arbitrary resolution of 50 by 50 km cells
Get the xy coordinates for each raster cell:
For each cell, we need to select a number of observations, let’s say within 50 km of the center of each cell (thus the data that are used in different cells overlap). And let’s require at least 50 observations to do a regression.
First transform the houses data to Teale-Albers
Set up a new regression function.
Run the model for al cells if there are at least 50 observations within a radius of 50 km.
For each cell get the income coefficient:
Use these values in a RasterLayer
So that was a lot of ‘home-brew-GWR’.
Question 1: Can you comment on weaknesses (and perhaps strengths) of the approaches I have shown?
Question 2: Can you do it the easier and more professional way for these data, using the spgwr package?
Now use the spgwr package (and the the function) to fit the model. You can do this with all data, as long as you supply and argument (to avoid estimating a model for each observation point. You can use a raster similar to the one I used above (perhaps disaggregate with a factor 2 first).
This is how you can get the points to use:
Create a RasterLayer with the correct extent
Set to a desired resolution. I choose 25 km
I only want cells inside of CA, so I add some more steps.
Extract the coordinates that are not .
I don’t want the third column
Now specificy the model
returns a list-like object that includes (as first element) a that has the model coeffients. Plot these using , and after that, transfer them to a object.
To extract the SpatialPointsDataFrame:
To reconnect these values to the raster structure (etc.)
Question 3: Briefly comment on the results and the differences (if any) with the two home-brew examples.
The sample reading passage below is followed by a writing prompt.
Adapted from “A Matter of Degrees”
By Thomas Frank in Harper’s, August 2012
“The world is awash with fake [college] degrees,” says Les Rosen of Employment Screening Resources, a leading background-check outfit. In several [instances], the fakers actually studied at the institutions named on their résumés—they just failed to graduate. Others conjured their accomplishments out of thin air. Still others simply purchased their credentials from unaccredited institutions. All three approaches are undoubtedly on the rise. A consultancy in Wisconsin has for many years maintained a tally of educational whoppers told by the various job applicants it is asked to investigate; the resulting “Liars Index” (a term the consultancy has trademarked) reached its highest level ever in the second half of 2011. Just how widespread is the problem? Rosen estimates that some 40 percent of job applicants misrepresent in some way their educational attainments. And he reminds me that this figure includes only those people “who are so brazen about it that they’ve signed a release and authorization for a background check.” Among those who aren’t checked—who work for companies that don’t hire a professional background screener, or who refuse to sign a release—the fudging is sure to be even more common.
It takes only a few hours researching diploma mills to make you wonder about the swirling tides of fraud that advance and retreat beneath society’s placid, meritocratic surface. And eventually you start wondering about that surface, too, where everything seems to be in its place and everyone has the salary he or she deserves. The diploma mills hold up a mirror to the self-satisfied world of white-collar achievement, and what you see there isn’t pretty. Think about it this way: Who purchases bogus degrees? Judging by how the industry advertises itself, the customers are desperate people whose careers are going nowhere. They know they need a diploma to succeed, but they can hardly afford to borrow fifty grand and waste four years of their lives at Frisbee State; they’ve got jobs, and families, and car payments to make. Someone offers them a college degree in recognition of their actual experience—and not only does it sound attractive, it sounds fair. Who is to say that they are less deserving of life’s good things than someone whose parents paid for him to goof off at a glorified country club two decades ago? And who, really, is to say that they know less than the graduate turned out last month by some adjunct-run, beer-soaked, grade-inflated, but fully accredited debt factory in New England or California?
[T]he sacred Credential signifies less and less each year but costs more and more to obtain. Yet we act as though it represents everything. It’s a million-dollar coin made of pot metal—of course it attracts counterfeiters. And of course its value must be defended by an ever-expanding industry of résumé checkers and diploma-mill hunters. The boundaries are artificial, and that is precisely why they must be regulated so intensely: they are the only thing keeping the bunglers and knaves who rule us in their jobs.
Prompt: After reading the article, “A Matter of Degrees,” write an essay between 500 and 800 words in which you argue whether or not a college degree is merely a “million-dollar coin made of pot metal.” If you agree, support your point with original and compelling arguments and then explain why, nevertheless, you've chosen to attend Cal Poly. If you do not agree, defend your position using compelling counterarguments. Your essay should show an understanding of the article without simply repeating it, and you should incorporate specific details from your own experience and knowledge into your response.