dplyr is my most used R package, so I’m very familiar with the common verbs like group_by, summarise, mutate, join, rename, dplyr::select…etc. These functions have been tremendously valuable to me as an analyst, so I decided to take a closer look at its features to see if there’s any useful functions I don’t yet know about!
And here are ones I’ll definitely be adding to the toolbox:
min_rank, top_n - Previously, if I had wanted to extract the top x rows, I had just used arrange and filtered on the rows with row_number. This worked fine when I didn’t need to handle ties, but dplyr and base R also has additional ranking functions when you do need to. You can use min_rank when you want to display all tied rows, and you can use the parameter within base R’s rank function to get even more granular. Similarly, top_n returns rows using min_rank.
cume_dist - Now that I think about it, dividing each row’s cumsum with the group’s total sum is quite awkward. I like that with this function you can quickly grab rows that hold the top/bottom x% of some value.
order_by - Essentially the same thing as arrange, but you’re able to perform actions that require sorted rows without actually sorting.
cumany, cumall - Again, these are functions that are dependent on sorting. With cumany, you’re able to grab the first row where a given condition is true as well as the rest that come after it and with cumall, you’re able to keep grabbing rows as long as they satisfy the condition. They are sort of like OR and AND because cumany keeps on looking for that first T whereas cumall gives up after seeing the first F.
intersect, union, setdiff - I’ve mainly relied on using base R’s %in% and which for my set operations, which works but gets a bit wordy, so I’m happy to see substitutes in dplyr. Also worth mentioning is base R’s setequal, which compares the equality of the values in 2 vectors (what’s the alternative? `all(x %in% y)?) Something to note is that setdiff(x,y) is not the same as setdiff(y, x).
distinct - Didn’t know this function existed! This way avoids the awkward ungroup that would need to happen with my original method, which was to group_by then filtering on row_number.
do - THIS FUNCTION!!! I’m pretty sure I was just subsetting and writing a lot of duplicate code to deal with each df partition previously. This function allows you to apply actions to each group and outputs the results to a data frame. The grouping remains after this call so you can extract the needed components afterwards. Gamechanger.
last - More succinct way of selecting the last element by length
case_when - More succint way of doing multiple if-else statements (odd syntax)
Here are the resources I used: