Fast data manipulation with dplyr

3/29/2018

Outline

What is dplyr?
Why dplyr?
Several Basic Verbs
Process two or more tables
Commonalities
Additional Resources

What is "dplyr"?

A package for data manipulation and exporlation, written and maintained by Hadley Wickham. It provides some great, easy-to-use functions that are very handy when performing exploratory data analysis and manipulation.
Focussing on only data frames, it is faster and has a more consistent API that is easier to use.

Why "dplyr"?

Fast in calculation : 20X - 100X faster.
Great for data exploration and transformation : same code structure with database (SQL).
Intuitive to write and easy to read : have individual functions that correspond to the most common operations; especially when using the "chaining" syntex

Several basic verbs:

filter(): to select cases based on their values.
arrange(): to reorder the cases.
select() and rename(): to select variables based on their names.
mutate() and transmute(): to add new variables that are functions of existing variables.
summarise(): to condense multiple values to a single value.
sample_n() and sample_frac(): to take random samples.

Dataset

To explore the basic data manipulation verbs of dplyr, we'll use `hflights' dataset, which is from an R package hflights. The dataset contains commercial domestic flights that departed Houston (IAH and HOU) in 2011. The data comes from the Research and Innovation Technology Administration at the Bureau of Transporation statistics. here

Data Structure

## 'data.frame':    227496 obs. of  21 variables:
##  $ Year             : int  2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
##  $ Month            : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ DayofMonth       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ DayOfWeek        : int  6 7 1 2 3 4 5 6 7 1 ...
##  $ DepTime          : int  1400 1401 1352 1403 1405 1359 1359 1355 1443 1443 ...
##  $ ArrTime          : int  1500 1501 1502 1513 1507 1503 1509 1454 1554 1553 ...
##  $ UniqueCarrier    : chr  "AA" "AA" "AA" "AA" ...
##  $ FlightNum        : int  428 428 428 428 428 428 428 428 428 428 ...
##  $ TailNum          : chr  "N576AA" "N557AA" "N541AA" "N403AA" ...
##  $ ActualElapsedTime: int  60 60 70 70 62 64 70 59 71 70 ...
##  $ AirTime          : int  40 45 48 39 44 45 43 40 41 45 ...
##  $ ArrDelay         : int  -10 -9 -8 3 -3 -7 -1 -16 44 43 ...
##  $ DepDelay         : int  0 1 -8 3 5 -1 -1 -5 43 43 ...
##  $ Origin           : chr  "IAH" "IAH" "IAH" "IAH" ...
##  $ Dest             : chr  "DFW" "DFW" "DFW" "DFW" ...
##  $ Distance         : int  224 224 224 224 224 224 224 224 224 224 ...
##  $ TaxiIn           : int  7 6 5 9 9 6 12 7 8 6 ...
##  $ TaxiOut          : int  13 9 17 22 9 13 15 12 22 19 ...
##  $ Cancelled        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ CancellationCode : chr  "" "" "" "" ...
##  $ Diverted         : int  0 0 0 0 0 0 0 0 0 0 ...

Data Structure

hflights_df <- tbl_df(hflights); hflights_df

## # A tibble: 227,496 x 21
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
##  * <int> <int>      <int>     <int>   <int>   <int> <chr>        
##  1  2011     1          1         6    1400    1500 AA           
##  2  2011     1          2         7    1401    1501 AA           
##  3  2011     1          3         1    1352    1502 AA           
##  4  2011     1          4         2    1403    1513 AA           
##  5  2011     1          5         3    1405    1507 AA           
##  6  2011     1          6         4    1359    1503 AA           
##  7  2011     1          7         5    1359    1509 AA           
##  8  2011     1          8         6    1355    1454 AA           
##  9  2011     1          9         7    1443    1554 AA           
## 10  2011     1         10         1    1443    1553 AA           
## # ... with 227,486 more rows, and 14 more variables: FlightNum <int>,
## #   TailNum <chr>, ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## #   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## #   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## #   Diverted <int>

Data Structure

print(hflights_df, n = 20)

data.frame(head(hflights_df, 2))

##   Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## 1 2011     1          1         6    1400    1500            AA       428
## 2 2011     1          2         7    1401    1501            AA       428
##   TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance
## 1  N576AA                60      40      -10        0    IAH  DFW      224
## 2  N557AA                60      45       -9        1    IAH  DFW      224
##   TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 1      7      13         0                         0
## 2      6       9         0                         0

`filter()`

filter() allows us to select a subset of rows in a data frame. Like all single verbs, the first argument is the data frame. The second and subsequent arguments refer to variables within that data frame, selecting rows where the expression is TRUE.

Similair to base::subset() or [,], we can filter our data by rows evaluated by a certain criteria.

`filter()`

To select all flights on New Years Day, we can say:

filter(hflights_df, Month == 1, DayofMonth == 1)

Compare with base function:

hflights[hflights$Month == 1 & hflights$DayofMonth == 1, ]

`filter()`

In addition to concise code, it also supports any combination of selection on the same object:

filter(hflights_df, Month == 1 | Month == 2)
# or
filter(hflights_df, Month %in% c(1, 2))

# Wrong syntex
# filter(hflights_df, Month == 1 | 2)

`arrange()`

This function works the same as plyr::arrange() and base::order(). Instead of selecting rows, it reorders them. Also note the desc() option, which orders in descending (alphabetical or numerical) order.

Sort the rows by the given column name, also arrange by reverse choronology in year:

hflights_df %>%
    select(UniqueCarrier, DepDelay) %>%
    arrange(desc(DepDelay))

Compare with base function:

hflights[order(desc(hflights$DepDelay)), c("UniqueCarrier", "DepDelay")]

`select()`

We are very familiar with two methods of selecting a set of columns from a larger data frame. In base R, the first is the subset() function, and the other is the rows, columns slicing syntax [,].

The dplyr::select() function is even more straightforward and powerful. It allows you to rapidly zoom in on a useful subset using operations that usually only work on numeric variable positions.

`select()`

We can select three columns by name without use quotations:

select(hflights_df, Year, Month, AirTime) %>% head(1)

## # A tibble: 1 x 3
##    Year Month AirTime
##   <int> <int>   <int>
## 1  2011     1      40

or select all columns between year and day:

select(hflights_df, Year:DayofMonth, contains("Taxi")) %>% head(1)

## # A tibble: 1 x 5
##    Year Month DayofMonth TaxiIn TaxiOut
##   <int> <int>      <int>  <int>   <int>
## 1  2011     1          1      7      13

`select()`

We can use "-" to elimate column names, for example, just keep Year and Cancel information:

select(hflights_df, -(DepTime:TaxiOut), -CancellationCode, -Diverted) %>% head(4)

## # A tibble: 4 x 5
##    Year Month DayofMonth DayOfWeek Cancelled
##   <int> <int>      <int>     <int>     <int>
## 1  2011     1          1         6         0
## 2  2011     1          2         7         0
## 3  2011     1          3         1         0
## 4  2011     1          4         2         0

`select()`

Also similar to R's own subset() function, but dplyr() no longers need to write a long list of c("colname1", "colname2") or which(colname(data) == "colname3"), even does not need to find the column number.

There are a number of helper functions we can use within select(), like starts_with(), ends_with(), matches() and contains(), which can let us quickly match larger blocks of variables that meet some criterion.

`select()`

dplyr::select() has the ability to do renaming of columns with a very straightforward syntax: new_col_name = old_name. Let's select the first three columns we began with and rename them:

select(hflights_df, Yr = Year, Mth = Month, Tm = AirTime) %>% head(2)

## # A tibble: 2 x 3
##      Yr   Mth    Tm
##   <int> <int> <int>
## 1  2011     1    40
## 2  2011     1    45

`select()`

Notice how the new dataframe only contains the three columns we've just created. To keep the remaining columns, there is a subfunction called matches() that does the trick.

# rename three columns and select all
select(hflights_df,Yr = Year,Mth = Month,Tm = AirTime,matches("."))%>%head(2)

## # A tibble: 2 x 21
##      Yr   Mth    Tm DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
##   <int> <int> <int>      <int>     <int>   <int>   <int> <chr>        
## 1  2011     1    40          1         6    1400    1500 AA           
## 2  2011     1    45          2         7    1401    1501 AA           
## # ... with 13 more variables: FlightNum <int>, TailNum <chr>,
## #   ActualElapsedTime <int>, ArrDelay <int>, DepDelay <int>, Origin <chr>,
## #   Dest <chr>, Distance <int>, TaxiIn <int>, TaxiOut <int>,
## #   Cancelled <int>, CancellationCode <chr>, Diverted <int>

`mutate()`

We can create new columns with mutate(). It has the same effect as plyr::mutate(), and similar to base::transform(), but with the advantage that the newly added column can be manipulated in the same statement:

mutate(hflights_df, 
       gain = ArrDelay - DepDelay, 
       gain_per_hour = gain / (AirTime / 60)
)

Whereas in the base::transform() we need to write like this:

transform(hflights, 
          gain = ArrDelay - DepDelay, 
          gain_per_hour =  (ArrDelay - DepDelay) / (AirTime / 60)
)

`mutate()`

One cool feature of dplyr::mutate() is the ability to refer to columns that you just created.

hflights_df %>% 
  mutate(Distance_km = round(Distance*1.6), AirPoints = Distance_km*2) %>%
  select(Distance_km, AirPoints) %>%
  arrange(desc(AirPoints)) %>%
  tail(5)

## # A tibble: 5 x 2
##   Distance_km AirPoints
##         <dbl>     <dbl>
## 1        203.      406.
## 2        203.      406.
## 3        126.      252.
## 4        126.      252.
## 5        126.      252.

`mutate()`

# nesting method
tail(arrange(select(mutate(hflights_df, 
                           Distance_km = round(Distance*1.6), 
                           AirPoints = Distance_km*2), 
                    Distance_km, AirPoints), 
             desc(AirPoints)), 5)

Chaining

Increases readability significantly when there are many commands
Operator is automatically imported from the magrittr package
Can be used to replace nesting in R commands outside of dplyr

`sample_n()` & `sample_frac()`

The sample function is used to select random rows from a table. The first line of code randomly selects ten rows from the dataset, and the second line of code randomly selects 10% of the original rows from the dataset.

sample_n(hflights_df, size = 10)
sample_frac(hflights_df, size = 0.1)

Use replace = TRUE to perform a bootstrap sample. If needed, you can weight the sample with the weight argument.

Processing two or more tables

When we process data, in addition to processing individual tables, we sometimes need to operate on two or more tables. Another advantage of dplyr() is it incorporates a lot of ideas for data manipulation in SQL databases, which makes it logical and consistent to operate on the data tables (tibbles). If you are accessing the analysis in the database, the steps taken will not be a big change. Using the dplyr package can easily let us use the same function to handle the table in the database.

Processing two or more tables

gene_exp_tidy1

##    GeneId sample_name expression
## 1    gen1     Sample1        1.0
## 2    gen1     Sample2        2.0
## 3    gen1     Sample3        0.3
## 4    gen2     Sample1        4.0
## 5    gen2     Sample2        5.0
## 6    gen2     Sample3        6.0
## 7    gen3     Sample1        7.0
## 8    gen3     Sample2        0.8
## 9    gen3     Sample3        9.0
## 10   gen4     Sample1       10.0
## 11   gen4     Sample2       11.0
## 12   gen4     Sample3       12.0

Processing two or more tables

gene_exp_tidy2

##   GeneId sample_name expression
## 1   gen5     Sample1          2
## 2   gen5     Sample2          3
## 3   gen5     Sample3          4

Processing two or more tables

gene_anno

##   GeneId annotation
## 1   gen1  a related
## 2   gen2  b related
## 3   gen3  c related

gene_anno1

##   Geneid annotation
## 1   gen1  a related
## 2   gen2  b related
## 3   gen3  c related

Processing two or more tables

Combine cases

Use bind_cols(x, y) to paste tables beside each other as they are.

Use bind_rows(x, y) to paste tables below each other as they are.

# base::rbind()
rbind(gene_exp_tidy1,gene_exp_tidy2)

# dplyr::bind_rows()
 bind_rows(gene_exp_tidy1,gene_exp_tidy2)

Processing two or more tables

intersect()

Create a data frame consisting of gene1 and gene5 first, then get intersect of gene_exp_tidy1 and gene_exp_tidy3:

gene_exp_tidy3 <- filter(gene_exp_tidy1, GeneId == "gen1") %>% 
  bind_rows(gene_exp_tidy2)
intersect(gene_exp_tidy1, gene_exp_tidy3)

##   GeneId sample_name expression
## 1   gen1     Sample1        1.0
## 2   gen1     Sample2        2.0
## 3   gen1     Sample3        0.3

Processing two or more tables

union()

Find the union of gene_exp_tidy1 and gene_exp_tidy3 (automatically remove duplicates):

union(gene_exp_tidy1, gene_exp_tidy3)

Find the union of two data frames (without removing duplicates)

union_all(gene_exp_tidy1, gene_exp_tidy3)

Processing two or more tables

Find the difference between two data frames

setdiff(gene_exp_tidy1, gene_exp_tidy3)

##   GeneId sample_name expression
## 1   gen2     Sample1        4.0
## 2   gen2     Sample2        5.0
## 3   gen2     Sample3        6.0
## 4   gen3     Sample1        7.0
## 5   gen3     Sample2        0.8
## 6   gen3     Sample3        9.0
## 7   gen4     Sample1       10.0
## 8   gen4     Sample2       11.0
## 9   gen4     Sample3       12.0

Processing two or more tables

left_join(): Take the first data frame as the standard and add the relevant information of the second data frame into the first one.

left_join(gene_exp_tidy1, gene_anno, by = "GeneId")

What if the two column names are not exactly the same?

left_join(gene_exp_tidy1, gene_anno1, by =c("GeneId" = "Geneid"))

Processing two or more tables

Similiar,

right_join(gene_exp_tidy1, gene_anno1, by =c("GeneId" = "Geneid"))

inner_join(gene_exp_tidy1, gene_anno1, by =c("GeneId" = "Geneid"))

full_join(gene_exp_tidy1, gene_anno1, by =c("GeneId" = "Geneid"))

semi_join(gene_exp_tidy1, gene_anno1, by =c("GeneId" = "Geneid"))

anti_join(gene_exp_tidy1, gene_anno1, by =c("GeneId" = "Geneid"))

Row names

Tidy data does not use rownames, which store a variable outside of the columns. To work with the rownames, first move them into a column.

rownames_to_column(): Move row names into col.

a <- rownames_to_column(gene_exp_tidy2, var = "C") 
head(a, 3)

##   C GeneId sample_name expression
## 1 1   gen5     Sample1          2
## 2 2   gen5     Sample2          3
## 3 3   gen5     Sample3          4

Row names

column_to_rownames(): Move col in row names.

column_to_rownames(a, var = "C") %>% head(3)

##   GeneId sample_name expression
## 1   gen5     Sample1          2
## 2   gen5     Sample2          3
## 3   gen5     Sample3          4

Commonalities

Syntax and function of all these verbs are very similar:

The first argument is a data frame.
The subsequent arguments describe what to do with the data frame. You can refer to columns in the data frame directly without using $.
The result is a new data frame

Together these properties make it easy to chain together multiple simple steps to achieve a complex result.

Outline

What is "dplyr"?

Why "dplyr"?

Several basic verbs:

Dataset

Data Structure

Data Structure

Data Structure

filter()

filter()

filter()

arrange()

select()

select()

select()

select()

select()

select()

mutate()

mutate()

mutate()

sample_n() & sample_frac()

Processing two or more tables

Processing two or more tables

Processing two or more tables

Processing two or more tables

Processing two or more tables

Processing two or more tables

Processing two or more tables

Processing two or more tables

Processing two or more tables

Processing two or more tables

Row names

Row names

Commonalities

Additional Resource

`filter()`

`filter()`

`filter()`

`arrange()`

`select()`

`select()`

`select()`

`select()`

`select()`

`select()`

`mutate()`

`mutate()`

`mutate()`

`sample_n()` & `sample_frac()`