I've learned that Do function is used when you want to apply a function to each group.
for example, if I want to pull top 2 rows from "A", "C", and "I" categories of variable Index, following syntax can be used.
t <- mydata %>% filter(Index %in% c("A", "C", "I")) %>% group_by(Index) %>% do(head(.,2))I understand that after grouping by index, do function is used to compute head(.,2) for each group.
However, on some occasions, do is not used at all. For example, To compute mean of variable Y2014 grouped by variable Index, I thought that following code should be used.
t <- mydata %>% group_by(Index) %>% do(summarise(Mean_2014 = mean(Y2014)))however, above syntax returns error
Error in mean(Y2014) : object 'Y2014' not foundBut if I remove do from the syntax, it returns what I exactly wanted.
t <- mydata %>% group_by(Index) %>% summarise(Mean_2014 = mean(Y2014))I'm really confused about usage of do function in dplyr. It seems inconsistent to me. When should I use and not use do function? Why should I use do in the first case and not in the second case?
1 Answer
The comments under the question discuss that in many cases you can find an alternative in dplyr or associated packages that avoid the use of do and the examples in the question are of that sort; however, to answer the question directly rather than via alternatives:
Differences between using do and not using it
Within the context of data frames, the key differences between using do and not using do are:
No automatic insertion of dot The code within the
dowill not have dot automatically inserted into the first argument. For example, instead of thedo(summarise(Mean_2014 = mean(Y2014)))code in the question one would have to writedo(summarise(., Mean_2014 = mean(Y2014)))with a dot since the dot is not automatically inserted. This is a consequence ofdobeing the right hand side function of%>%rather thansummarize. Although this is important to understand so that we insert dot when needed if the purpose were simply to avoid automatic insertion of dot into the first argument we could alternately use brace brackets to get that effect:whatever %>% { myfun(arg1, arg2) }would also not automatically insert dot as the first argument of themyfuncall.respecting group_by Only functions specifically written to respect
group_bywill do so. There are two issues here. (1) Only functions specifically written to respectgroup_bywill be run once for each group.mutate,summarizeanddoare examples of functions that run once per group (there are others too). (2) Even if the function is run once for each group there is the question of how dot is handled. We focus on two cases (not a complete list): (i) ifdois not used then if dot is used within a function call within an expression to an argument it will refer to the entire input ignoringgroup_by. Presumably this is a consequence of magrittr's dot substitution rules and it not knowing anything aboutgroup_by. On the other hand (ii) withindodot always refers to the rows of the current group. For example, compare the output of these two and note that dot refers to 3 rows in the first case wheredois used and all 6 rows in the second where it is not. This is despite the fact thatsummarizerespectsgroup_byin that it runs once per group.BOD$g <- c(1, 1, 1, 2, 2, 2) BOD %>% group_by(g) %>% do(summarize(., nr = nrow(.))) ## # A tibble: 2 x 2 ## # Groups: g [2] ## g nr ## <dbl> <int> ## 1 1.00 3 ## 2 2.00 3 BOD %>% group_by(g) %>% summarize(nr = nrow(.)) ## # A tibble: 2 x 2 ## g nr ## <dbl> <int> ## 1 1.00 6 ## 2 2.00 6
See ?do for more information.
Code from Question
Now we go through the code in the question. As mydata was never defined in the question we use the first line of code below to define it to facilitate concrete examples.
mydata <- data.frame(Index = rep(c("A", "C", "I"), each = 3), Y2014 = 1)
mydata %>% filter(Index %in% c("A", "C", "I")) %>% group_by(Index) %>% do(head(., 2))
## # A tibble: 6 x 2
## # Groups: Index [3]
## Index Y2014
## <fctr> <dbl>
## 1 A 1.00
## 2 A 1.00
## 3 C 1.00
## 4 C 1.00
## 5 I 1.00
## 6 I 1.00The code above produces 2 rows for each of the 3 groups giving 6 rows. Had we omitted do then it would disregard group_by and produce only two rows with dot being regarded as the entire 9 rows of input, not just each group at a time. (In this particular case dplyr provides its own alternative to head that avoids these problems but for sake of illustrating the general point we stick to the code in the question.)
The following code from the question generates an error because dot insertion is not done within do and so what ought to be the first argument of summarize, i.e. the data frame input, is missing:
mydata %>% group_by(Index) %>% do(summarise(Mean_2014 = mean(Y2014)))
## Error in mean(Y2014) : object 'Y2014' not foundIf we remove the do in the above code, as in the last line of code in the question, then it works since the dot insertion is performed. Alternately if we add the dot do(summarise(., Mean_2014 = mean(Y2014))) it would also work although do really seems superfluous in this case as summarize already respects group_by so there is no need to wrap it in do.
mydata %>% group_by(Index) %>% summarise(Mean_2014 = mean(Y2014))
## # A tibble: 3 x 2
## Index Mean_2014
## <fctr> <dbl>
## 1 A 1.00
## 2 C 1.00
## 3 I 1.00 8