- What is
dplyr
? - Why
dplyr
? - Several Basic Verbs
- Process two or more tables
- Commonalities
- Additional Resources
3/29/2018
dplyr
?dplyr
?A package for data manipulation and exporlation, written and maintained by Hadley Wickham. It provides some great, easy-to-use functions that are very handy when performing exploratory data analysis and manipulation.
Focussing on only data frames, it is faster and has a more consistent API that is easier to use.
Fast in calculation : 20X - 100X faster.
Great for data exploration and transformation : same code structure with database (SQL).
Intuitive to write and easy to read : have individual functions that correspond to the most common operations; especially when using the "chaining" syntex
filter()
: to select cases based on their values.arrange()
: to reorder the cases.select()
and rename()
: to select variables based on their names.mutate()
and transmute()
: to add new variables that are functions of existing variables.summarise()
: to condense multiple values to a single value.sample_n()
and sample_frac()
: to take random samples.To explore the basic data manipulation verbs of dplyr, we'll use `hflights' dataset, which is from an R package hflights. The dataset contains commercial domestic flights that departed Houston (IAH and HOU) in 2011. The data comes from the Research and Innovation Technology Administration at the Bureau of Transporation statistics. here
## 'data.frame': 227496 obs. of 21 variables: ## $ Year : int 2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ... ## $ Month : int 1 1 1 1 1 1 1 1 1 1 ... ## $ DayofMonth : int 1 2 3 4 5 6 7 8 9 10 ... ## $ DayOfWeek : int 6 7 1 2 3 4 5 6 7 1 ... ## $ DepTime : int 1400 1401 1352 1403 1405 1359 1359 1355 1443 1443 ... ## $ ArrTime : int 1500 1501 1502 1513 1507 1503 1509 1454 1554 1553 ... ## $ UniqueCarrier : chr "AA" "AA" "AA" "AA" ... ## $ FlightNum : int 428 428 428 428 428 428 428 428 428 428 ... ## $ TailNum : chr "N576AA" "N557AA" "N541AA" "N403AA" ... ## $ ActualElapsedTime: int 60 60 70 70 62 64 70 59 71 70 ... ## $ AirTime : int 40 45 48 39 44 45 43 40 41 45 ... ## $ ArrDelay : int -10 -9 -8 3 -3 -7 -1 -16 44 43 ... ## $ DepDelay : int 0 1 -8 3 5 -1 -1 -5 43 43 ... ## $ Origin : chr "IAH" "IAH" "IAH" "IAH" ... ## $ Dest : chr "DFW" "DFW" "DFW" "DFW" ... ## $ Distance : int 224 224 224 224 224 224 224 224 224 224 ... ## $ TaxiIn : int 7 6 5 9 9 6 12 7 8 6 ... ## $ TaxiOut : int 13 9 17 22 9 13 15 12 22 19 ... ## $ Cancelled : int 0 0 0 0 0 0 0 0 0 0 ... ## $ CancellationCode : chr "" "" "" "" ... ## $ Diverted : int 0 0 0 0 0 0 0 0 0 0 ...
hflights_df <- tbl_df(hflights); hflights_df
## # A tibble: 227,496 x 21 ## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier ## * <int> <int> <int> <int> <int> <int> <chr> ## 1 2011 1 1 6 1400 1500 AA ## 2 2011 1 2 7 1401 1501 AA ## 3 2011 1 3 1 1352 1502 AA ## 4 2011 1 4 2 1403 1513 AA ## 5 2011 1 5 3 1405 1507 AA ## 6 2011 1 6 4 1359 1503 AA ## 7 2011 1 7 5 1359 1509 AA ## 8 2011 1 8 6 1355 1454 AA ## 9 2011 1 9 7 1443 1554 AA ## 10 2011 1 10 1 1443 1553 AA ## # ... with 227,486 more rows, and 14 more variables: FlightNum <int>, ## # TailNum <chr>, ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>, ## # DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>, ## # TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>, ## # Diverted <int>
print(hflights_df, n = 20)
data.frame(head(hflights_df, 2))
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum ## 1 2011 1 1 6 1400 1500 AA 428 ## 2 2011 1 2 7 1401 1501 AA 428 ## TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance ## 1 N576AA 60 40 -10 0 IAH DFW 224 ## 2 N557AA 60 45 -9 1 IAH DFW 224 ## TaxiIn TaxiOut Cancelled CancellationCode Diverted ## 1 7 13 0 0 ## 2 6 9 0 0
filter()
filter()
allows us to select a subset of rows in a data frame. Like all single verbs, the first argument is the data frame. The second and subsequent arguments refer to variables within that data frame, selecting rows where the expression is TRUE.
Similair to base::subset()
or [,]
, we can filter our data by rows evaluated by a certain criteria.
filter()
filter(hflights_df, Month == 1, DayofMonth == 1)
hflights[hflights$Month == 1 & hflights$DayofMonth == 1, ]
filter()
filter(hflights_df, Month == 1 | Month == 2) # or filter(hflights_df, Month %in% c(1, 2))
# Wrong syntex # filter(hflights_df, Month == 1 | 2)
arrange()
This function works the same as plyr::arrange()
and base::order()
. Instead of selecting rows, it reorders them. Also note the desc()
option, which orders in descending (alphabetical or numerical) order.
hflights_df %>% select(UniqueCarrier, DepDelay) %>% arrange(desc(DepDelay))
hflights[order(desc(hflights$DepDelay)), c("UniqueCarrier", "DepDelay")]
select()
We are very familiar with two methods of selecting a set of columns from a larger data frame. In base R, the first is the subset()
function, and the other is the rows, columns slicing syntax [,]
.
The dplyr::select()
function is even more straightforward and powerful. It allows you to rapidly zoom in on a useful subset using operations that usually only work on numeric variable positions.
select()
We can select three columns by name without use quotations:
select(hflights_df, Year, Month, AirTime) %>% head(1)
## # A tibble: 1 x 3 ## Year Month AirTime ## <int> <int> <int> ## 1 2011 1 40
or select all columns between year and day:
select(hflights_df, Year:DayofMonth, contains("Taxi")) %>% head(1)
## # A tibble: 1 x 5 ## Year Month DayofMonth TaxiIn TaxiOut ## <int> <int> <int> <int> <int> ## 1 2011 1 1 7 13
select()
We can use "-" to elimate column names, for example, just keep Year and Cancel information:
select(hflights_df, -(DepTime:TaxiOut), -CancellationCode, -Diverted) %>% head(4)
## # A tibble: 4 x 5 ## Year Month DayofMonth DayOfWeek Cancelled ## <int> <int> <int> <int> <int> ## 1 2011 1 1 6 0 ## 2 2011 1 2 7 0 ## 3 2011 1 3 1 0 ## 4 2011 1 4 2 0
select()
Also similar to R's own subset()
function, but dplyr()
no longers need to write a long list of c("colname1", "colname2")
or which(colname(data) == "colname3")
, even does not need to find the column number.
There are a number of helper functions we can use within select()
, like starts_with()
, ends_with()
, matches()
and contains()
, which can let us quickly match larger blocks of variables that meet some criterion.
select()
dplyr::select()
has the ability to do renaming of columns with a very straightforward syntax: new_col_name = old_name
. Let's select the first three columns we began with and rename them:
select(hflights_df, Yr = Year, Mth = Month, Tm = AirTime) %>% head(2)
## # A tibble: 2 x 3 ## Yr Mth Tm ## <int> <int> <int> ## 1 2011 1 40 ## 2 2011 1 45
select()
Notice how the new dataframe only contains the three columns we've just created. To keep the remaining columns, there is a subfunction called matches()
that does the trick.
# rename three columns and select all select(hflights_df,Yr = Year,Mth = Month,Tm = AirTime,matches("."))%>%head(2)
## # A tibble: 2 x 21 ## Yr Mth Tm DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier ## <int> <int> <int> <int> <int> <int> <int> <chr> ## 1 2011 1 40 1 6 1400 1500 AA ## 2 2011 1 45 2 7 1401 1501 AA ## # ... with 13 more variables: FlightNum <int>, TailNum <chr>, ## # ActualElapsedTime <int>, ArrDelay <int>, DepDelay <int>, Origin <chr>, ## # Dest <chr>, Distance <int>, TaxiIn <int>, TaxiOut <int>, ## # Cancelled <int>, CancellationCode <chr>, Diverted <int>
mutate()
We can create new columns with mutate()
. It has the same effect as plyr::mutate()
, and similar to base::transform()
, but with the advantage that the newly added column can be manipulated in the same statement:
mutate(hflights_df, gain = ArrDelay - DepDelay, gain_per_hour = gain / (AirTime / 60) )
Whereas in the base::transform()
we need to write like this:
transform(hflights, gain = ArrDelay - DepDelay, gain_per_hour = (ArrDelay - DepDelay) / (AirTime / 60) )
mutate()
One cool feature of dplyr::mutate()
is the ability to refer to columns that you just created.
hflights_df %>% mutate(Distance_km = round(Distance*1.6), AirPoints = Distance_km*2) %>% select(Distance_km, AirPoints) %>% arrange(desc(AirPoints)) %>% tail(5)
## # A tibble: 5 x 2 ## Distance_km AirPoints ## <dbl> <dbl> ## 1 203. 406. ## 2 203. 406. ## 3 126. 252. ## 4 126. 252. ## 5 126. 252.
mutate()
# nesting method tail(arrange(select(mutate(hflights_df, Distance_km = round(Distance*1.6), AirPoints = Distance_km*2), Distance_km, AirPoints), desc(AirPoints)), 5)
Chaining
sample_n()
& sample_frac()
The sample
function is used to select random rows from a table. The first line of code randomly selects ten rows from the dataset, and the second line of code randomly selects 10% of the original rows from the dataset.
sample_n(hflights_df, size = 10) sample_frac(hflights_df, size = 0.1)
Use replace = TRUE
to perform a bootstrap sample. If needed, you can weight the sample with the weight
argument.
When we process data, in addition to processing individual tables, we sometimes need to operate on two or more tables. Another advantage of dplyr()
is it incorporates a lot of ideas for data manipulation in SQL databases, which makes it logical and consistent to operate on the data tables (tibbles). If you are accessing the analysis in the database, the steps taken will not be a big change. Using the dplyr package can easily let us use the same function to handle the table in the database.
gene_exp_tidy1
## GeneId sample_name expression ## 1 gen1 Sample1 1.0 ## 2 gen1 Sample2 2.0 ## 3 gen1 Sample3 0.3 ## 4 gen2 Sample1 4.0 ## 5 gen2 Sample2 5.0 ## 6 gen2 Sample3 6.0 ## 7 gen3 Sample1 7.0 ## 8 gen3 Sample2 0.8 ## 9 gen3 Sample3 9.0 ## 10 gen4 Sample1 10.0 ## 11 gen4 Sample2 11.0 ## 12 gen4 Sample3 12.0
gene_exp_tidy2
## GeneId sample_name expression ## 1 gen5 Sample1 2 ## 2 gen5 Sample2 3 ## 3 gen5 Sample3 4
gene_anno
## GeneId annotation ## 1 gen1 a related ## 2 gen2 b related ## 3 gen3 c related
gene_anno1
## Geneid annotation ## 1 gen1 a related ## 2 gen2 b related ## 3 gen3 c related
Use bind_cols(x, y)
to paste tables beside each other as they are.
Use bind_rows(x, y)
to paste tables below each other as they are.
# base::rbind() rbind(gene_exp_tidy1,gene_exp_tidy2)
# dplyr::bind_rows() bind_rows(gene_exp_tidy1,gene_exp_tidy2)
intersect()
Create a data frame consisting of gene1 and gene5 first, then get intersect of gene_exp_tidy1 and gene_exp_tidy3:
gene_exp_tidy3 <- filter(gene_exp_tidy1, GeneId == "gen1") %>% bind_rows(gene_exp_tidy2) intersect(gene_exp_tidy1, gene_exp_tidy3)
## GeneId sample_name expression ## 1 gen1 Sample1 1.0 ## 2 gen1 Sample2 2.0 ## 3 gen1 Sample3 0.3
union()
Find the union of gene_exp_tidy1 and gene_exp_tidy3 (automatically remove duplicates):
union(gene_exp_tidy1, gene_exp_tidy3)
Find the union of two data frames (without removing duplicates)
union_all(gene_exp_tidy1, gene_exp_tidy3)
Find the difference between two data frames
setdiff(gene_exp_tidy1, gene_exp_tidy3)
## GeneId sample_name expression ## 1 gen2 Sample1 4.0 ## 2 gen2 Sample2 5.0 ## 3 gen2 Sample3 6.0 ## 4 gen3 Sample1 7.0 ## 5 gen3 Sample2 0.8 ## 6 gen3 Sample3 9.0 ## 7 gen4 Sample1 10.0 ## 8 gen4 Sample2 11.0 ## 9 gen4 Sample3 12.0
left_join()
: Take the first data frame as the standard and add the relevant information of the second data frame into the first one.
left_join(gene_exp_tidy1, gene_anno, by = "GeneId")
What if the two column names are not exactly the same?
left_join(gene_exp_tidy1, gene_anno1, by =c("GeneId" = "Geneid"))
Similiar,
right_join(gene_exp_tidy1, gene_anno1, by =c("GeneId" = "Geneid")) inner_join(gene_exp_tidy1, gene_anno1, by =c("GeneId" = "Geneid")) full_join(gene_exp_tidy1, gene_anno1, by =c("GeneId" = "Geneid")) semi_join(gene_exp_tidy1, gene_anno1, by =c("GeneId" = "Geneid")) anti_join(gene_exp_tidy1, gene_anno1, by =c("GeneId" = "Geneid"))
Tidy data does not use rownames, which store a variable outside of the columns. To work with the rownames, first move them into a column.
rownames_to_column()
: Move row names into col.
a <- rownames_to_column(gene_exp_tidy2, var = "C") head(a, 3)
## C GeneId sample_name expression ## 1 1 gen5 Sample1 2 ## 2 2 gen5 Sample2 3 ## 3 3 gen5 Sample3 4
column_to_rownames()
: Move col in row names.
column_to_rownames(a, var = "C") %>% head(3)
## GeneId sample_name expression ## 1 gen5 Sample1 2 ## 2 gen5 Sample2 3 ## 3 gen5 Sample3 4
Syntax and function of all these verbs are very similar:
The first argument is a data frame.
The subsequent arguments describe what to do with the data frame. You can refer to columns in the data frame directly without using $.
The result is a new data frame
Together these properties make it easy to chain together multiple simple steps to achieve a complex result.