Geographic Data Science in R

# Geographic Data Science in R
## Katie Jolly
### Slides: katiejolly.io/rnorth-19
### August 16, 2019

---

### Framing Geographic Data Science (GDS)

As opposed to traditional Geographic Information Science (GIS), GDS is:

<ul>
<li> Interdisciplinary with Geography, Computing, and Statistics 
<li> Code-based workflow 
<li> Maximally reproducible
</ul>

---
class: center, middle

## Geographic data is a large category!

---

background-image: url("https://pvsmt99345.i.lithium.com/t5/image/serverpage/image-id/50441iC2825B417745932D?v=1.0")
## Raster data 📷

Think: aerial imagery, elevation models, remote sensing

.footnote[Source: https://community.alteryx.com/t5/Data-Science-Blog/Vector-and-Raster-A-Tale-of-Two-Spatial-Data-Types/ba-p/336141]

---

background-image: url("https://pvsmt99345.i.lithium.com/t5/image/serverpage/image-id/49570i26EF3FAEACD21BD4/image-size/medium?v=1.0&px=400")

## Vector data 📍

Think: Census data, points on a map, roads

.footnote[Source: https://community.alteryx.com/t5/Data-Science-Blog/Vector-and-Raster-A-Tale-of-Two-Spatial-Data-Types/ba-p/336141]

---
class: inverse, center, middle

### Quirks (fun parts) of working with geographic data

---

### Projections and coordinate reference systems

How do you **translate** your data from a 3D shape to a 2D map...

---

[West Wing video](https://youtu.be/OH1bZ0F3zVU?t=34)

![](images/ww.PNG)

---

### Two common projections for US data

When working with something around the size of a **state** or **smaller**, it can often be a good idea to use state plane or UTM projections.

Many states and cities also have **custom** or **recommended** projections, so it's worth doing some research before picking something!

.footnote[Projection resources: [ArcGIS help](http://desktop.arcgis.com/en/arcmap/10.3/guide-books/map-projections/what-are-map-projections.htm), [Geocomputation with R: Reprojecting data](https://geocompr.robinlovelace.net/reproj-geo-data.html)]

---

## Honeybee Permits in Minneapolis 🐝

---

## I'll talk about...

* Reading in spatial data

* Reprojecting data

* Basic maps

* Spatial join

* Neighborhood definition

* Spatial clustering (Moran's I)

* Modifiable areal unit problem

---

## Reading in spatial data

### Shapefiles

The most common file format is the **shapefile**, which is actually a collection of files. It's important to keep all of these parts in one directory!

But when you read in the data, it only looks like you're using the *.shp* file.

```r
library(sf)
honeybees <- st_read("data/honeybees/Honey_Bee_Permits_2017.shp")
```

---

## Reading in spatial data

### APIs

If you're getting data from somewhere like an open data portal, using the API endpoint can often be easier.

* Easier for people trying to run your file

* Easier for your file management

```r
honeybees <- st_read("https://opendata.arcgis.com/datasets/f99ce43936d74f718e92a37a560ad875_0.geojson")
```

There are reasons for one way over another, but I prefer APIs when possible. Either way you'll get the same data.

---

## Reprojecting data

We should first check the current projection.

```r
st_crs(honeybees)
```

```
## Coordinate Reference System:
##   EPSG: 4326 
##   proj4string: "+proj=longlat +datum=WGS84 +no_defs"
```

--
When I look at this, I notice +proj=longlat, which is a geographic, not projected, coordinate system.

I'll use a UTM projection.

```r
honeybees_utm <- honeybees %>%
 st_transform(26915) # UTM 15N zone
st_crs(honeybees_utm)
```

```
## Coordinate Reference System:
##   EPSG: 26915 
##   proj4string: "+proj=utm +zone=15 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs"
```

---

## Basic maps

`sf` objects have a `plot` function.

```r
plot(honeybees_utm %>% dplyr::select(HiveType))
```

---

## Basic maps

But they also work nicely with `ggplot2`

```r
ggplot(honeybees_utm) +
  geom_sf()
```

---

## Basic maps

We can use the ggplot layering logic to add contextual data, like Minneapolis neighborhood boundaries.

```r
neighborhoods <- st_read("https://opendata.arcgis.com/datasets/055ca54e5fcc47329f081c9ef51d038e_0.geojson") %>%
 st_transform(26915)
```

---

## Basic maps

When I look at this map as a geographer, I look for patterns of clustering or dispersion. I'll now walk through how to quantify that.

---

## Spatial join

One way to think about clustering is to ask whether or not the permits are clustered **by neighborhood**.

---

## Spatial join

```r
permits_per_nb <- neighborhoods %>% 
 st_join(honeybees_utm) %>% # which neighborhood is each permit in?
 group_by(BDNAME) %>% # and when we sum by neighborhood
 summarise(permits = sum(!is.na(OBJECTID))) # how many permits total?

summary(permits_per_nb$permits)
```

```
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   1.000   1.034   2.000   4.000
```

On average, there are **1.034** honeybee permits per neighborhood.

But is this equally likely everywhere in the city? Or are permits more likely to be in certain areas?

---

## Spatial join

---

## Neighbor definition

---

## Neighbor definition

One common and straightforward way to define neighbors is **queen contiguity**.

```r
library(spdep)

# need to convert to SpatialPolygons object
honeybees_sp <- as(permits_per_nb, "Spatial")

# define neighbor structure
nb_queen <- poly2nb(honeybees_sp, queen = TRUE)

# create a matrix of binary spatial weights 
# (connected or not connected)
weights <- nb2listw(nb_queen, style = "B") 
```

---

## Neighbor definiton

On average, each of the neighborhoods in Minneapolis has **5.6** queen's case neighbors.

---

## Spatial clustering

---

## Spatial clustering

### Moran's I

> The Moran’s I statistic is the correlation coefficient for the relationship between a variable [like honeybee permits] and its surrounding values (Gimond)

---

## Spatial clustering

### Moran's I

```r
set.seed(123)

# 10000 simulations
moran.mc(honeybees_sp$permits, weights, nsim=9999) 
```

```
## 
## 	Monte-Carlo simulation of Moran I
## 
## data:  honeybees_sp$permits 
## weights: weights  
## number of simulations + 1: 10000 
## 
## statistic = 0.13366, observed rank = 9869, p-value = 0.0131
## alternative hypothesis: greater
```

---

## Spatial clustering

### Moran's I

statistic = 0.13366, observed rank = 9869, p-value = 0.0131

Based on this, we reject our null hypothesis and say that the permits are **slightly clustered** across neighborhoods.

Does this make sense?

---

## Spatial clustering

---

## Modifiable areal unit problem

Grouping points by an areal unit may distort or exaggerate the actual data pattern!

---

## Modifiable areal unit problem

What if we look at honeybee permits by community instead?

Does this look more or less clustered?

---

## Modifiable areal unit problem

According to Moran's I, it's spatially random. But we found that it was clustered by neighborhood?

```
## 
## 	Monte-Carlo simulation of Moran I
## 
## data:  bg_sp$permits 
## weights: weights  
## number of simulations + 1: 10000 
## 
## statistic = -0.10302, observed rank = 5470, p-value = 0.453
## alternative hypothesis: greater
```

What could be some consequences of this?

---

## Modifiable areal unit problem

Gerrymandering, one of my research areas, is largely an application of the modifiable areal unit problem. How we carve up our space matters!

---

### Thank you for listening and happy to answer any questions!

@katiejolly6

katiejolly6@gmail.com

katiejolly.io