Data wrangling is the process of reconfiguring data so that it’s ready for analysis. It’s often used with the term data cleaning in the same sentence. So much of analysis is getting your data ready to be analyzed, so it’s an important skill to know! It may sound not as exciting, but I like to think of it as a puzzle. I hope you’ll have fun with it as well!
We will use the dplyr
package, which is part of the tidyverse
environment (as are ggplot
, tidyr
, readr
, etc). There’s a great Cheat Sheet that covers data wrangling with both dplyr
and tidyr
, two different but highly related packages. If you get stuck at any point, the cheat sheet is a great first resource! This tutorial assumes at least a working knowledge of dplyr
, so I’d recommend checking out the cheat sheet if you’ve never used the package before!
To complete this tutorial, I recommend starting an rmarkdown so you can document your work and hopefully share it later! If you haven’t worked with rmarkdown before, this Cheat Sheet. For more information about rmarkdown, check out this documentation.
For this project we will need to load a few libraries. If you need to first install the library using install.packages()
. Make sure the name of the package has quotation marks around it! And if RStudio asks if you want to restart R, click yes!
# packages we need
library(tidyverse) # this includes ggplot, dplyr, tidyr...
library(lubridate) # this is for manipulating dates and times
library(ggmap) # we will make a few maps of the data with this package
We will be working with a random sample of DC bikshare data from 2015 Q1. I got the data from a SQL database through the bikedata package. But no worries, I’ve already compiled all the data you need so that’s just for background information.
Use the code below to read in data I have stored on github!
Trips <- read_csv("https://raw.githubusercontent.com/katiejolly/bikeshare-dplyr/master/dc_bikes_sample_2015Q1.csv")
# this has information about each trip on a bike
Stations <- read_csv("https://raw.githubusercontent.com/katiejolly/bikeshare-dplyr/master/Capital_Bike_Share_Locations.csv") # this has the locations of all the docking stations
Play around with the tables a bit and think about what a row represents and what each variables means.
Some ways to do this are looking at the head
and str
of the data. A handy tidyverse
way of doing this is the glimpse
function.
glimpse(Trips)
## Observations: 100,000
## Variables: 7
## $ duration <int> 506730, 653609, 3175761, 432546, 682000, 225...
## $ sdate <chr> "3/11/2015 7:10", "2/16/2015 11:53", "2/4/20...
## $ sstation <chr> "Lincoln Park / 13th & East Capitol St NE", ...
## $ edate <chr> "3/11/2015 7:19", "2/16/2015 12:04", "2/4/20...
## $ estation <chr> "Columbus Circle / Union Station", "20th St ...
## $ bike_num <chr> "W20814", "W21528", "W00800", "W21715", "W00...
## $ subscription_type <chr> "Registered", "Registered", "Registered", "R...
glimpse(Stations)
## Observations: 485
## Variables: 8
## $ objectid <int> 177403757, 177403758, 177403759, 177403760, 17740...
## $ id <int> 49, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 6...
## $ address <chr> "M St & New Jersey Ave SE", "Massachusetts Ave & ...
## $ terminal_num <int> 31208, 31200, 31603, 31212, 31213, 31604, 31605, ...
## $ lat <dbl> 38.87630, 38.91010, 38.90570, 38.90534, 38.90276,...
## $ long <dbl> -77.00370, -77.04440, -77.00560, -77.04677, -77.0...
## $ nbikes <int> 9, 26, 8, 11, 5, 4, 2, 10, 5, 15, 9, 2, 4, 3, 1, ...
## $ empty_docks <int> 8, 18, 6, 7, 18, 10, 9, 4, 10, 4, 6, 13, 7, 11, 1...
Now that we have an idea of what data is available to us, let’s wrangle it!
For most payment options, it costs extra to travel more than 30 minutes on a bike.
Find the number of trips more than 30 minutes long. Don’t just use the duration variable, though! You’ll need to do some calculations.
Hint: convert sdate
and edate
to date objects and then make variables for the start and end times in decimal form.
I actually use the DC bikeshare quite a bit. The stations I use the most are 13th St & New York Ave NW
, MLK Library/9th & G St NW
, and Columbia Rd & Georgia Ave NW
. In 2015, I was a casual rider.
Find how many other casual riders there were at those stations.
Hint: You’ll need to filter
before your group_by
On what days of the year were there the most casual riders at those stations?
Hint: Use as_date(sdate)
to get the date.
Which station overall has the highest percent of its ridership in the casual category?
Hint: You’ll need to mutate
and do some calculations and use the Trips
data.
Metro Center / 12th & G St NW
is a very popular station for people who bike to work. What are the most common end stations for people who start there?
What about people who start at Smithsonian / Jefferson Dr & 12th St SW
? This station is probably a lot of tourists!
Map both of these patterns with ggmap
(written from ggplot
, so use what you’ve learned about that package!)
Hint: just add onto your code from (a) to include both start stations. You may want to use %in%
.
Hint: use a left_join
with the Stations
table to get location information!
Start with this code for your map.
myMap <- get_map(location="Logan Circle",source="google",maptype="roadmap",zoom=13)
ggmap(myMap) + ...
For this question, let’s start by making a new variable for the start and end time. We will use the lubridate
package for this.
Right now, the dates are character vectors. Let’s make them date objects!
Trips <- Trips %>%
mutate(sdate = mdy_hm(sdate), edate = mdy_hm(edate)) # this uses the mdy_hm (for monthdayyear_hourminute) from the lubridate package
Now we can do some calculations with those date objects!
Long_trips <- # create a new data frame
Trips %>% # start with the trips data
dplyr::mutate(startTime = lubridate::hour(sdate) + lubridate::minute(sdate)/60,
endTime = lubridate::hour(edate) + lubridate::minute(edate)/60) %>% # calculate the start and end time in decimal form. 9:30 will be represented as 9.5, for example.
dplyr::mutate(total = endTime-startTime) %>% # find the total duration of the trip in decimal form
filter(total > .5) # find where the duration is greater than 30 min (so, 0.5 in decimal form)
Let’s check out our new table.
nrow(Long_trips)
## [1] 4823
4823 trips were more than 30 minutes long.
nrow(Long_trips)/nrow(Trips)
## [1] 0.04823
4.8% of all trips were more than 30 minutes long.
I actually use the DC bikeshare quite a bit. The stations I use the most are 13th St & New York Ave NW
, MLK Library/9th & G St NW
, and Columbia Rd & Georgia Ave NW
. In 2015, I was a casual rider.
Find how many other casual riders there were at those stations.
Hint: You’ll need to filter
before your group_by
casual_favoriteStations <- Trips %>%
filter(sstation %in% c('13th St & New York Ave NW', 'MLK Library/9th & G St NW', 'Columbia Rd & Georgia Ave NW')) %>% # filter out the stations we want
filter(subscription_type == "Casual") %>% # get just the casual riders (this can be written as just one filter statement if you want)
group_by(sstation) %>% # we want our calculations to be per station
summarise(casual_riders = n()) # total number of riders at the station over the time period. n() counts obserations in a group
head(casual_favoriteStations)
## # A tibble: 3 x 2
## sstation casual_riders
## <chr> <int>
## 1 13th St & New York Ave NW 47
## 2 Columbia Rd & Georgia Ave NW 14
## 3 MLK Library/9th & G St NW 52
On what days of the year were there the most casual riders at those stations?
Hint: Use as_date(sdate)
to get the date.
casual_popularDays <- Trips %>%
filter(sstation %in% c('13th St & New York Ave NW', 'MLK Library/9th & G St NW', 'Columbia Rd & Georgia Ave NW')) %>% # filter out the stations we want
filter(subscription_type == "Casual") %>% # take out just the casual riders
mutate(date = as_date(sdate)) %>% # create a date
group_by(sstation, date) %>% # we want to group by days and stations because that becomes our case in the new table
summarize(casual_riders = n()) %>% # total number of riders on a day at a station
arrange(desc(casual_riders)) # arrange it so that the most popular day is at the top
head(casual_popularDays)
## # A tibble: 6 x 3
## # Groups: sstation [2]
## sstation date casual_riders
## <chr> <date> <int>
## 1 MLK Library/9th & G St NW 2015-03-22 5
## 2 13th St & New York Ave NW 2015-01-04 4
## 3 13th St & New York Ave NW 2015-03-29 4
## 4 MLK Library/9th & G St NW 2015-03-08 4
## 5 13th St & New York Ave NW 2015-01-24 3
## 6 13th St & New York Ave NW 2015-03-08 3
Which station overall has the highest percent of its ridership in the casual category?
Hint: You’ll need to mutate
and do some calculations and use the Trips
data.
highestType <- Trips %>%
group_by(sstation) %>% # per station
summarize(casual = sum(subscription_type == "Casual"), reg = sum(subscription_type == "Registered")) %>% # total clients of each type
mutate(percentReg = reg / (casual + reg + 20), percentCasual = casual / (casual + reg + 20)) %>% # percent of each client type
arrange(desc(percentCasual)) # order your data frame by percentCasual highest to lowest
head(highestType)
## # A tibble: 6 x 5
## sstation casual reg percentReg
## <chr> <int> <int> <dbl>
## 1 Lincoln Memorial 933 283 0.2289644
## 2 Jefferson Dr & 14th St SW 795 275 0.2522936
## 3 Ohio Dr & West Basin Dr SW / MLK & FDR Memorials 221 90 0.2719033
## 4 Smithsonian / Jefferson Dr & 12th St SW 469 227 0.3170391
## 5 Jefferson Memorial 374 285 0.4197349
## 6 21st St & Constitution Ave NW 182 138 0.4058824
## # ... with 1 more variables: percentCasual <dbl>
# as a calculation note, I added 20 to my denominator to mask stations with only a few riders
Metro Center / 12th & G St NW
is a very popular station for people who bike to work. What are the most common end stations for people who start there?
metroCenter <- Trips %>%
filter(sstation == "Metro Center / 12th & G St NW") %>% # we only want trips that start at Metro Center
group_by(estation) %>% # group the end stations
summarize(trips = n()) %>% # n() counts the number of observations in a group
arrange(desc(trips)) # greatest to least trips
head(metroCenter)
## # A tibble: 6 x 2
## estation trips
## <chr> <int>
## 1 5th & K St NW 33
## 2 5th St & Massachusetts Ave NW 26
## 3 Metro Center / 12th & G St NW 23
## 4 12th & L St NW 17
## 5 14th & Rhode Island Ave NW 17
## 6 Thomas Circle 16
What about people who start at Smithsonian / Jefferson Dr & 12th St SW
? This station is probably a lot of tourists!
Smithsonian <- Trips %>%
filter(sstation =="Smithsonian / Jefferson Dr & 12th St SW") %>% # we only want trips that start at Smithsonian/Jefferson
group_by(estation) %>% # group the end stations and start stations
summarize(trips = n()) %>% # n() counts the number of observations in a group
arrange(desc(trips)) # greatest to least trips
head(Smithsonian)
## # A tibble: 6 x 2
## estation trips
## <chr> <int>
## 1 Smithsonian / Jefferson Dr & 12th St SW 133
## 2 Lincoln Memorial 80
## 3 Jefferson Dr & 14th St SW 42
## 4 Jefferson Memorial 25
## 5 5th & K St NW 19
## 6 Columbus Circle / Union Station 19
Map both of these patterns with ggmap
(written from ggplot
, so use what you’ve learned about that package!)
Hint: just add onto your code from (a) to include both start stations. You may want to use %in%
.
Hint: use a left_join
with the Stations
table to get location information!
Start with this code for your map.
myMap <- get_map(location="Logan Circle",source="google",maptype="roadmap",zoom=13)
ggmap(myMap) + ...
# make the data frame we will need for mapping
smithsonianAndMC <- Trips %>%
filter(sstation %in% c("Smithsonian / Jefferson Dr & 12th St SW", "Metro Center / 12th & G St NW")) %>% # we want trips from both start stations now!
group_by(sstation, estation) %>% # group the end stations and start stations
summarize(trips = n()) %>% # n() counts the number of observations in a group
left_join(Stations, by = c("estation" = "address")) %>% # join by end station to get its location data
arrange(desc(trips))# greatest to least trips
head(smithsonianAndMC)
## # A tibble: 6 x 10
## # Groups: sstation [2]
## sstation
## <chr>
## 1 Smithsonian / Jefferson Dr & 12th St SW
## 2 Smithsonian / Jefferson Dr & 12th St SW
## 3 Smithsonian / Jefferson Dr & 12th St SW
## 4 Metro Center / 12th & G St NW
## 5 Metro Center / 12th & G St NW
## 6 Smithsonian / Jefferson Dr & 12th St SW
## # ... with 9 more variables: estation <chr>, trips <int>, objectid <int>,
## # id <int>, terminal_num <int>, lat <dbl>, long <dbl>, nbikes <int>,
## # empty_docks <int>
ggmap(myMap) + # start with the starter code I gave you
geom_point(data = smithsonianAndMC, # specify the data you'll use
aes(x = long, # longitude
y = lat, # latitude
color = sstation, # we will use color to show the start station
size = trips), # big circles will be stations with lots of trips!
alpha = 0.6) +# make the circles a little more transparent so we can see overlap
theme(axis.line=element_blank(),
axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks=element_blank(),
axis.title.x=element_blank(),
axis.title.y=element_blank()) # this is all just to get rid of axis labels
## Warning: Removed 16 rows containing missing values (geom_point).
I know this map is a little hard to see, but the data wrangling is more important and this is a good start for a visualization! I can cover better mapping techniques at another time.