class: center, middle, inverse, title-slide # Geographic Data Science in R ## Katie Jolly ### Slides: katiejolly.io/rnorth-19 ### August 16, 2019 --- ### Framing Geographic Data Science (GDS) <br> <br> As opposed to traditional Geographic Information Science (GIS), GDS is: <br> <br> <ul> <li> Interdisciplinary with Geography, Computing, and Statistics <br><br> <li> Code-based workflow <br><br> <li> Maximally reproducible </ul> .footnote[Source: https://www.robinlovelace.net/2017/05/02/can-geographic-data-save-the-world/] --- class: center, middle ## Geographic data is a large category! --- background-image: url("https://pvsmt99345.i.lithium.com/t5/image/serverpage/image-id/50441iC2825B417745932D?v=1.0") ## Raster data ๐ท Think: aerial imagery, elevation models, remote sensing .footnote[Source: https://community.alteryx.com/t5/Data-Science-Blog/Vector-and-Raster-A-Tale-of-Two-Spatial-Data-Types/ba-p/336141] --- background-image: url("https://pvsmt99345.i.lithium.com/t5/image/serverpage/image-id/49570i26EF3FAEACD21BD4/image-size/medium?v=1.0&px=400") ## Vector data ๐ Think: Census data, points on a map, roads .footnote[Source: https://community.alteryx.com/t5/Data-Science-Blog/Vector-and-Raster-A-Tale-of-Two-Spatial-Data-Types/ba-p/336141] --- class: inverse, center, middle ### Quirks (fun parts) of working with geographic data --- ### Projections and coordinate reference systems How do you **translate** your data from a 3D shape to a 2D map... <img src="https://media.opennews.org/cache/06/37/0637aa2541b31f526ad44f7cb2db7b6c.jpg" style="display: block; margin: auto;" /> --- [West Wing video](https://youtu.be/OH1bZ0F3zVU?t=34) ![](images/ww.PNG) --- ### Two common projections for US data When working with something around the size of a **state** or **smaller**, it can often be a good idea to use state plane or UTM projections. <br> <img src="images/projections.png" width="1200" style="display: block; margin: auto;" /> <br> Many states and cities also have **custom** or **recommended** projections, so it's worth doing some research before picking something! <br> .footnote[Projection resources: [ArcGIS help](http://desktop.arcgis.com/en/arcmap/10.3/guide-books/map-projections/what-are-map-projections.htm), [Geocomputation with R: Reprojecting data](https://geocompr.robinlovelace.net/reproj-geo-data.html)] <br> --- class: inverse, center, middle ## Honeybee Permits in Minneapolis ๐ --- ## I'll talk about... * Reading in spatial data * Reprojecting data * Basic maps * Spatial join * Neighborhood definition * Spatial clustering (Moran's I) * Modifiable areal unit problem --- ## Reading in spatial data ### Shapefiles The most common file format is the **shapefile**, which is actually a collection of files. It's important to keep all of these parts in one directory! <img src="images/honeybee-files.PNG" width="993" style="display: block; margin: auto;" /> But when you read in the data, it only looks like you're using the *.shp* file. ```r library(sf) honeybees <- st_read("data/honeybees/Honey_Bee_Permits_2017.shp") ``` --- ## Reading in spatial data ### APIs If you're getting data from somewhere like an open data portal, using the API endpoint can often be easier. * Easier for people trying to run your file * Easier for your file management ```r honeybees <- st_read("https://opendata.arcgis.com/datasets/f99ce43936d74f718e92a37a560ad875_0.geojson") ``` There are reasons for one way over another, but I prefer APIs when possible. Either way you'll get the same data. --- ## Reprojecting data We should first check the current projection. ```r st_crs(honeybees) ``` ``` ## Coordinate Reference System: ## EPSG: 4326 ## proj4string: "+proj=longlat +datum=WGS84 +no_defs" ``` -- When I look at this, I notice +proj=<mark>longlat</mark>, which is a geographic, not projected, coordinate system. -- I'll use a UTM projection. ```r honeybees_utm <- honeybees %>% st_transform(26915) # UTM 15N zone st_crs(honeybees_utm) ``` ``` ## Coordinate Reference System: ## EPSG: 26915 ## proj4string: "+proj=utm +zone=15 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs" ``` --- ## Basic maps `sf` objects have a `plot` function. ```r plot(honeybees_utm %>% dplyr::select(HiveType)) ``` <img src="index_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> --- ## Basic maps But they also work nicely with `ggplot2` ```r ggplot(honeybees_utm) + geom_sf() ``` <img src="index_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> --- ## Basic maps We can use the ggplot layering logic to add contextual data, like Minneapolis neighborhood boundaries. ```r neighborhoods <- st_read("https://opendata.arcgis.com/datasets/055ca54e5fcc47329f081c9ef51d038e_0.geojson") %>% st_transform(26915) ``` <img src="index_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> --- ## Basic maps <img src="index_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> <br> When I look at this map as a geographer, I look for patterns of clustering or dispersion. I'll now walk through how to quantify that. --- ## Spatial join One way to think about clustering is to ask whether or not the permits are clustered **by neighborhood**. <img src="https://i.stack.imgur.com/CVVSH.png" style="display: block; margin: auto;" /> --- ## Spatial join ```r permits_per_nb <- neighborhoods %>% st_join(honeybees_utm) %>% # which neighborhood is each permit in? group_by(BDNAME) %>% # and when we sum by neighborhood summarise(permits = sum(!is.na(OBJECTID))) # how many permits total? summary(permits_per_nb$permits) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.000 0.000 1.000 1.034 2.000 4.000 ``` <br> On average, there are **1.034** honeybee permits per neighborhood. <br> But is this equally likely everywhere in the city? Or are permits more likely to be in certain areas? --- ## Spatial join <img src="index_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" /> --- ## Neighbor definition <img src="http://geohealthinnovations.org/wp-content/uploads/2013/01/toblerquote.png" style="display: block; margin: auto;" /> --- ## Neighbor definition One common and straightforward way to define neighbors is **queen contiguity**. <img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRxG5jeEH-7MmKzlzOcooJMOeMAMzZKqrgePBFpNP43w9W8ACq35g" style="display: block; margin: auto;" /> ```r library(spdep) # need to convert to SpatialPolygons object honeybees_sp <- as(permits_per_nb, "Spatial") # define neighbor structure nb_queen <- poly2nb(honeybees_sp, queen = TRUE) # create a matrix of binary spatial weights # (connected or not connected) weights <- nb2listw(nb_queen, style = "B") ``` --- ## Neighbor definiton <img src="index_files/figure-html/unnamed-chunk-19-1.png" style="display: block; margin: auto;" /> On average, each of the neighborhoods in Minneapolis has **5.6** queen's case neighbors. --- ## Spatial clustering <br> <br> <img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTlm-7-YfREIPTXXfloomGr0jdZk1GhPimm7WH8lZGWwVQIHzDA" width="500px" style="display: block; margin: auto;" /> --- ## Spatial clustering ### Moran's I > The Moranโs I statistic is the correlation coefficient for the relationship between a variable [like honeybee permits] and its surrounding values (Gimond) <img src="https://mgimond.github.io/Spatial/img/MoranI_scatter_plot.png" width="400px" style="display: block; margin: auto;" /> --- ## Spatial clustering ### Moran's I ```r set.seed(123) # 10000 simulations moran.mc(honeybees_sp$permits, weights, nsim=9999) ``` ``` ## ## Monte-Carlo simulation of Moran I ## ## data: honeybees_sp$permits ## weights: weights ## number of simulations + 1: 10000 ## ## statistic = 0.13366, observed rank = 9869, p-value = 0.0131 ## alternative hypothesis: greater ``` --- ## Spatial clustering ### Moran's I statistic = <mark>0.13366</mark>, observed rank = 9869, p-value = <mark>0.0131</mark> Based on this, we reject our null hypothesis and say that the permits are **slightly clustered** across neighborhoods. <img src="index_files/figure-html/unnamed-chunk-23-1.png" width="250px" style="display: block; margin: auto;" /> Does this make sense? --- ## Spatial clustering <img src="index_files/figure-html/unnamed-chunk-24-1.png" style="display: block; margin: auto;" /> --- ## Modifiable areal unit problem Grouping points by an areal unit may distort or exaggerate the actual data pattern! <img src="http://4.bp.blogspot.com/-_pvMVmjCTQU/TsY9QaWcJaI/AAAAAAAAA30/TGhvGkcnMPY/s1600/Penn+State+Map.gif" width="200px" style="display: block; margin: auto;" /> --- ## Modifiable areal unit problem What if we look at honeybee permits by community instead? <img src="index_files/figure-html/unnamed-chunk-27-1.png" style="display: block; margin: auto;" /> Does this look more or less clustered? --- ## Modifiable areal unit problem According to Moran's I, it's spatially random. But we found that it was clustered by neighborhood? <br> <br> ``` ## ## Monte-Carlo simulation of Moran I ## ## data: bg_sp$permits ## weights: weights ## number of simulations + 1: 10000 ## ## statistic = -0.10302, observed rank = 5470, p-value = 0.453 ## alternative hypothesis: greater ``` <br> <br> What could be some consequences of this? --- ## Modifiable areal unit problem Gerrymandering, one of my research areas, is largely an application of the modifiable areal unit problem. How we carve up our space matters! <img src="https://pbs.twimg.com/media/B-8ljgjU0AASq8g.jpg" width="500x" style="display: block; margin: auto;" /> --- class: center, middle ### Thank you for listening and happy to answer any questions! @katiejolly6 katiejolly6@gmail.com katiejolly.io