Setting up our project

Data wrangling is the process of reconfiguring data so that it’s ready for analysis. It’s often used with the term data cleaning in the same sentence. So much of analysis is getting your data ready to be analyzed, so it’s an important skill to know! It may sound not as exciting, but I like to think of it as a puzzle. I hope you’ll have fun with it as well!

We will use the dplyr package, which is part of the tidyverse environment (as are ggplot, tidyr, readr, etc). There’s a great Cheat Sheet that covers data wrangling with both dplyr and tidyr, two different but highly related packages. If you get stuck at any point, the cheat sheet is a great first resource! This tutorial assumes at least a working knowledge of dplyr, so I’d recommend checking out the cheat sheet if you’ve never used the package before!

To complete this tutorial, I recommend starting an rmarkdown so you can document your work and hopefully share it later! If you haven’t worked with rmarkdown before, this Cheat Sheet. For more information about rmarkdown, check out this documentation.

For this project we will need to load a few libraries. If you need to first install the library using install.packages(). Make sure the name of the package has quotation marks around it! And if RStudio asks if you want to restart R, click yes!

# packages we need

library(tidyverse) # this includes ggplot, dplyr, tidyr...
library(lubridate) # this is for manipulating dates and times
library(ggmap) # we will make a few maps of the data with this package

Getting the data

We will be working with a random sample of DC bikshare data from 2015 Q1. I got the data from a SQL database through the bikedata package. But no worries, I’ve already compiled all the data you need so that’s just for background information.

Use the code below to read in data I have stored on github!

Trips <- read_csv("https://raw.githubusercontent.com/katiejolly/bikeshare-dplyr/master/dc_bikes_sample_2015Q1.csv") 
  # this has information about each trip on a bike

Stations <- read_csv("https://raw.githubusercontent.com/katiejolly/bikeshare-dplyr/master/Capital_Bike_Share_Locations.csv") # this has the locations of all the docking stations

Take a look at the data

Play around with the tables a bit and think about what a row represents and what each variables means.

Some ways to do this are looking at the head and str of the data. A handy tidyverse way of doing this is the glimpse function.

glimpse(Trips)

## Observations: 100,000
## Variables: 7
## $ duration          <int> 506730, 653609, 3175761, 432546, 682000, 225...
## $ sdate             <chr> "3/11/2015 7:10", "2/16/2015 11:53", "2/4/20...
## $ sstation          <chr> "Lincoln Park / 13th & East Capitol St NE", ...
## $ edate             <chr> "3/11/2015 7:19", "2/16/2015 12:04", "2/4/20...
## $ estation          <chr> "Columbus Circle / Union Station", "20th St ...
## $ bike_num          <chr> "W20814", "W21528", "W00800", "W21715", "W00...
## $ subscription_type <chr> "Registered", "Registered", "Registered", "R...

glimpse(Stations)

## Observations: 485
## Variables: 8
## $ objectid     <int> 177403757, 177403758, 177403759, 177403760, 17740...
## $ id           <int> 49, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 6...
## $ address      <chr> "M St & New Jersey Ave SE", "Massachusetts Ave & ...
## $ terminal_num <int> 31208, 31200, 31603, 31212, 31213, 31604, 31605, ...
## $ lat          <dbl> 38.87630, 38.91010, 38.90570, 38.90534, 38.90276,...
## $ long         <dbl> -77.00370, -77.04440, -77.00560, -77.04677, -77.0...
## $ nbikes       <int> 9, 26, 8, 11, 5, 4, 2, 10, 5, 15, 9, 2, 4, 3, 1, ...
## $ empty_docks  <int> 8, 18, 6, 7, 18, 10, 9, 4, 10, 4, 6, 13, 7, 11, 1...

Wrangling

Now that we have an idea of what data is available to us, let’s wrangle it!

Question 1

For most payment options, it costs extra to travel more than 30 minutes on a bike.

Find the number of trips more than 30 minutes long. Don’t just use the duration variable, though! You’ll need to do some calculations.

Hint: convert sdate and edate to date objects and then make variables for the start and end times in decimal form.

Question 2

I actually use the DC bikeshare quite a bit. The stations I use the most are 13th St & New York Ave NW, MLK Library/9th & G St NW, and Columbia Rd & Georgia Ave NW. In 2015, I was a casual rider.

Part a

Find how many other casual riders there were at those stations.

Hint: You’ll need to filter before your group_by

Part b

On what days of the year were there the most casual riders at those stations?

Hint: Use as_date(sdate) to get the date.

Part c

Which station overall has the highest percent of its ridership in the casual category?

Hint: You’ll need to mutate and do some calculations and use the Trips data.

Question 3

Part a

Metro Center / 12th & G St NW is a very popular station for people who bike to work. What are the most common end stations for people who start there?

Part b

What about people who start at Smithsonian / Jefferson Dr & 12th St SW? This station is probably a lot of tourists!

Part c

Map both of these patterns with ggmap (written from ggplot, so use what you’ve learned about that package!)

Hint: just add onto your code from (a) to include both start stations. You may want to use %in%.

Hint: use a left_join with the Stations table to get location information!

Start with this code for your map.

myMap <- get_map(location="Logan Circle",source="google",maptype="roadmap",zoom=13)

ggmap(myMap) + ...

Answers

Question 1

For this question, let’s start by making a new variable for the start and end time. We will use the lubridate package for this.

Right now, the dates are character vectors. Let’s make them date objects!

Trips <- Trips %>%
  mutate(sdate = mdy_hm(sdate), edate = mdy_hm(edate)) # this uses the mdy_hm (for monthdayyear_hourminute) from the lubridate package

Now we can do some calculations with those date objects!

Long_trips <- # create a new data frame
  Trips %>% # start with the trips data
  dplyr::mutate(startTime = lubridate::hour(sdate) + lubridate::minute(sdate)/60, 
         endTime = lubridate::hour(edate) + lubridate::minute(edate)/60) %>% # calculate the start and end time in decimal form. 9:30 will be represented as 9.5, for example.
  dplyr::mutate(total = endTime-startTime) %>% # find the total duration of the trip in decimal form 
  filter(total > .5) # find where the duration is greater than 30 min (so, 0.5 in decimal form)

Let’s check out our new table.

nrow(Long_trips)

## [1] 4823

4823 trips were more than 30 minutes long.

nrow(Long_trips)/nrow(Trips)

## [1] 0.04823

4.8% of all trips were more than 30 minutes long.

Question 2

Part a

Find how many other casual riders there were at those stations.

Hint: You’ll need to filter before your group_by

casual_favoriteStations <- Trips %>% 
  filter(sstation %in% c('13th St & New York Ave NW', 'MLK Library/9th & G St NW', 'Columbia Rd & Georgia Ave NW')) %>% # filter out the stations we want
  filter(subscription_type == "Casual") %>% # get just the casual riders (this can be written as just one filter statement if you want)
  group_by(sstation) %>% # we want our calculations to be per station
  summarise(casual_riders = n()) # total number of riders at the station over the time period. n() counts obserations in a group

head(casual_favoriteStations)

## # A tibble: 3 x 2
##                       sstation casual_riders
##                          <chr>         <int>
## 1    13th St & New York Ave NW            47
## 2 Columbia Rd & Georgia Ave NW            14
## 3    MLK Library/9th & G St NW            52

Part b

On what days of the year were there the most casual riders at those stations?

Hint: Use as_date(sdate) to get the date.

casual_popularDays <- Trips %>% 
  filter(sstation %in% c('13th St & New York Ave NW', 'MLK Library/9th & G St NW', 'Columbia Rd & Georgia Ave NW')) %>% # filter out the stations we want
  filter(subscription_type == "Casual") %>% # take out just the casual riders
  mutate(date = as_date(sdate)) %>% # create a date 
  group_by(sstation, date) %>% # we want to group by days and stations because that becomes our case in the new table 
  summarize(casual_riders = n()) %>% # total number of riders on a day at a station
  arrange(desc(casual_riders)) # arrange it so that the most popular day is at the top


head(casual_popularDays)

## # A tibble: 6 x 3
## # Groups:   sstation [2]
##                    sstation       date casual_riders
##                       <chr>     <date>         <int>
## 1 MLK Library/9th & G St NW 2015-03-22             5
## 2 13th St & New York Ave NW 2015-01-04             4
## 3 13th St & New York Ave NW 2015-03-29             4
## 4 MLK Library/9th & G St NW 2015-03-08             4
## 5 13th St & New York Ave NW 2015-01-24             3
## 6 13th St & New York Ave NW 2015-03-08             3

Part c

Which station overall has the highest percent of its ridership in the casual category?

Hint: You’ll need to mutate and do some calculations and use the Trips data.

highestType <-  Trips %>% 
  group_by(sstation) %>% # per station
  summarize(casual = sum(subscription_type == "Casual"), reg = sum(subscription_type == "Registered")) %>% # total clients of each type
  mutate(percentReg = reg / (casual + reg + 20), percentCasual = casual / (casual + reg + 20)) %>%  # percent of each client type 
  arrange(desc(percentCasual)) # order your data frame by percentCasual highest to lowest

head(highestType)

## # A tibble: 6 x 5
##                                           sstation casual   reg percentReg
##                                              <chr>  <int> <int>      <dbl>
## 1                                 Lincoln Memorial    933   283  0.2289644
## 2                        Jefferson Dr & 14th St SW    795   275  0.2522936
## 3 Ohio Dr & West Basin Dr SW / MLK & FDR Memorials    221    90  0.2719033
## 4          Smithsonian / Jefferson Dr & 12th St SW    469   227  0.3170391
## 5                               Jefferson Memorial    374   285  0.4197349
## 6                    21st St & Constitution Ave NW    182   138  0.4058824
## # ... with 1 more variables: percentCasual <dbl>

# as a calculation note, I added 20 to my denominator to mask stations with only a few riders

Question 3

Part a

Metro Center / 12th & G St NW is a very popular station for people who bike to work. What are the most common end stations for people who start there?

metroCenter <- Trips %>%
  filter(sstation == "Metro Center / 12th & G St NW") %>% # we only want trips that start at Metro Center
  group_by(estation) %>% # group the end stations 
  summarize(trips = n()) %>% # n() counts the number of observations in a group
  arrange(desc(trips)) # greatest to least trips

head(metroCenter)

## # A tibble: 6 x 2
##                        estation trips
##                           <chr> <int>
## 1                 5th & K St NW    33
## 2 5th St & Massachusetts Ave NW    26
## 3 Metro Center / 12th & G St NW    23
## 4                12th & L St NW    17
## 5    14th & Rhode Island Ave NW    17
## 6                 Thomas Circle    16

Part b

What about people who start at Smithsonian / Jefferson Dr & 12th St SW? This station is probably a lot of tourists!

Smithsonian <- Trips %>%
  filter(sstation =="Smithsonian / Jefferson Dr & 12th St SW") %>% # we only want trips that start at Smithsonian/Jefferson
  group_by(estation) %>% # group the end stations and start stations
  summarize(trips = n()) %>% # n() counts the number of observations in a group
  arrange(desc(trips)) # greatest to least trips

head(Smithsonian)

## # A tibble: 6 x 2
##                                  estation trips
##                                     <chr> <int>
## 1 Smithsonian / Jefferson Dr & 12th St SW   133
## 2                        Lincoln Memorial    80
## 3               Jefferson Dr & 14th St SW    42
## 4                      Jefferson Memorial    25
## 5                           5th & K St NW    19
## 6         Columbus Circle / Union Station    19

Part c

Map both of these patterns with ggmap (written from ggplot, so use what you’ve learned about that package!)

Hint: just add onto your code from (a) to include both start stations. You may want to use %in%.

Hint: use a left_join with the Stations table to get location information!

Start with this code for your map.

myMap <- get_map(location="Logan Circle",source="google",maptype="roadmap",zoom=13)

ggmap(myMap) + ...

# make the data frame we will need for mapping
smithsonianAndMC <- Trips %>%
  filter(sstation %in% c("Smithsonian / Jefferson Dr & 12th St SW", "Metro Center / 12th & G St NW")) %>% # we want trips from both start stations now!
  group_by(sstation, estation) %>% # group the end stations and start stations
  summarize(trips = n()) %>% # n() counts the number of observations in a group
  left_join(Stations, by = c("estation" = "address")) %>% # join by end station to get its location data
  arrange(desc(trips))# greatest to least trips

head(smithsonianAndMC)

## # A tibble: 6 x 10
## # Groups:   sstation [2]
##                                  sstation
##                                     <chr>
## 1 Smithsonian / Jefferson Dr & 12th St SW
## 2 Smithsonian / Jefferson Dr & 12th St SW
## 3 Smithsonian / Jefferson Dr & 12th St SW
## 4           Metro Center / 12th & G St NW
## 5           Metro Center / 12th & G St NW
## 6 Smithsonian / Jefferson Dr & 12th St SW
## # ... with 9 more variables: estation <chr>, trips <int>, objectid <int>,
## #   id <int>, terminal_num <int>, lat <dbl>, long <dbl>, nbikes <int>,
## #   empty_docks <int>

ggmap(myMap) + # start with the starter code I gave you
  geom_point(data = smithsonianAndMC, # specify the data you'll use
             aes(x = long, # longitude
                 y = lat, # latitude
                 color = sstation, # we will use color to show the start station
                 size = trips), # big circles will be stations with lots of trips!
             alpha = 0.6) +# make the circles a little more transparent so we can see overlap
            theme(axis.line=element_blank(),
                  axis.text.x=element_blank(),
                  axis.text.y=element_blank(),
                  axis.ticks=element_blank(),
                  axis.title.x=element_blank(),
                  axis.title.y=element_blank()) # this is all just to get rid of axis labels

## Warning: Removed 16 rows containing missing values (geom_point).

I know this map is a little hard to see, but the data wrangling is more important and this is a good start for a visualization! I can cover better mapping techniques at another time.

Exploring DC’s bikeshare system: Wrangling with dplyr!

Katie Jolly

November 18, 2017

Setting up our project

Getting the data

Take a look at the data

Wrangling

Question 1

Question 2

Part a

Part b

Part c

Question 3

Part a

Part b

Part c

Answers

Question 1

Question 2

Part a

Part b

Part c

Question 3

Part a

Part b

Part c