Chapter 5 Introducing the %>% Operator

The pipe operator is an incredibly important component of the modern R workflow.

5.1 What is the %>% Operator

Whenever you see percentage, greater than, percentage (%>%) in R code, you should pronounce it “pipe”. The pipe operator is simply ‘Syntactic Sugar’, and it’s pretty much the workhorse of the tidyverse and htmlwidgets, which is why it’s so important to master for our uses. But what is Syntactic Sugar? Well, it is designed by developers to make code easier to read or to write for humans.

Typical use cases for Syntactic Sugar are: - reducing the number of keystrokes needed to write code - improving the “flow” of writing code, where “flow” simply means the stream of consciousness of the programmer.

We want to minimize the amount we have to think about writing code and instead think about the task we’re trying to achieve. In R, our programming tasks are typically about data manipulation. The ‘pipe’ is excellent Syntactic Sugar for reducing the number of key presses and emphasizing the flow of data in R. So the pipe operator is Syntactic Sugar for chaining operations together.

Now let’s look at a real example in R Studio. Let’s create a simple vector ‘prime’, which stores the prime numbers up to 17. If we wanted to calculate the rolling differences between these numbers, we could write this using standard notation as follows:

prime <- c(1, 3, 5, 7, 11, 13, 17)
diff(prime) # calculates the differences
## [1] 2 2 2 4 2 4

Now we get the rolling differences between the numbers in our vector, but we could just as easily write this with a pipe.

library(tidyverse)
prime %>%
  diff()
## [1] 2 2 2 4 2 4

In order to introduce pipe operator into R, what we really should do at the top of our script file is to load the tidyverse, as this also loads the pipe from magrittr. We can see we get the exact same as we would get.

Now, if we wanted to calculate the mean difference between the primes, it’s really simple to add another operation to our pipe chain.

prime %>%
  diff() %>%
  mean()
## [1] 2.666667

We get the mean differences between the primes between 1 and 17. If we were to write this in standard notation, we would have to rewrite the code.

mean(diff(prime))
## [1] 2.666667

While this is technically fewer key presses, we have had to rewrite our code. We don’t have the steps in the operation we want to perform obvious within our code. So, in traditional R notation, expressions need to be rewritten for new operations to be added. If you need to move to the beginning of the line, and to the end of the line to add a closing parentheses. Whereas with the pipe operator, one can simply continue to chain without having to interrupt yourself to reorganize the code. That’s because the pipe simply chains together operations. Understanding how the pipe operator works is important to master modern R, including the tidyverse and the htmlwidgets.

5.2 Significance of %>%

It’s important to understand the significance of periods, or full stops, in pipe expressions, as they’re used fairly frequently. We’ve established that pipes are very useful syntactic sugar that makes it easy to chain together operations, and that they’re ubiquitous in both the tidyverse and in the construction of htmlwidget visualizations, but sometimes you know better than the pipe operator. You realize the left-hand side of the pipe doesn’t belong in the first argument in the right-hand side of the pipe. It needs to be inserted somewhere else, and that’s what a period allows you to do. So let’s look at an example of that.

library(praise) # includes random texts
rep(praise(), 10)
##  [1] "You are mathematical!" "You are mathematical!" "You are mathematical!"
##  [4] "You are mathematical!" "You are mathematical!" "You are mathematical!"
##  [7] "You are mathematical!" "You are mathematical!" "You are mathematical!"
## [10] "You are mathematical!"

I used the function rep, which repeats the first argument the number of times specified in the second argument. So, I get praise run 10 times. What the praise library does is it randomly generates a piece of praise, so if I run this again, I’ll get something else.

So, how about if I wanted to generate a vector of praise which was as long as the mean of the differences of my vector, prime, just as an example. Well, I could write it like this using native R:

prime <- c(1, 3, 5, 7, 11, 13, 17)
rep(praise(), mean(diff(prime)))
## [1] "You are groovy!" "You are groovy!"

But how about if I wanted to write this using pipes? Well, I first pipe my vector into the operation diff to calculate the differences between my vector arguments, then I pipe in mean, so I get the mean difference, which is two and a third, and then I pipe this into the rep function, but the first argument of rep should be the thing which I’m repeating, as opposed to the number of times I’m repeating it. If I type this, I’m going to get an error.

library(tidyverse)
prime %>%
  diff() %>%
  mean() %>%
  rep(praise())
Error in rep(., praise()) : invalid 'times' argument

I get told that the times argument is invalid. You can have a look at the documentation for rep by selecting the name of the function and pressing F1, or running the following code.

?rep

We see the first argument of rep should be x, the thing that should be repeated, and the second argument should be times, so what I’m effectively getting ‘times = praise’, which is invalid R code. We need to use the period to suck the left-hand side of the pipe into the appropriate position in the right-hand side of the pipe. So the period will suck in the left-hand side of the pipe away from the first argument into the second argument.

prime %>%
  diff() %>%
  mean() %>%
  rep(praise(), .)
## [1] "You are terrific!" "You are terrific!"

We got a vector of two bits of praise. This period is being pulled into the second argument, which is the times argument. If we didn’t want the same praise each time, we could rewrite this as follows with the replicate function, and the argument is no longer times, it’s ‘n’.

prime %>%
  diff() %>%
  mean() %>%
  replicate(praise(), n = .)
## [1] "You are beautiful!" "You are perfect!"

I get different praise each time. So we used a very trivial example to demonstrate the use of periods in pipes, but there’s a very important use case for the period in tidyverse; Extracting data from data frames as vectors.

As an example of midwest dataset, we select from that the column state, we ask for the unique values of the state column.

midwest %>%
  select(state) %>%
  unique()
## # A tibble: 5 x 1
##   state
##   <chr>
## 1 IL   
## 2 IN   
## 3 MI   
## 4 OH   
## 5 WI

This returns me a tibble, or a data frame. If I wanted this returned as a vector, I need to use the period to say this is the thing I want, and I want the first column from that data frame, and that returns me a vector.

midwest %>%
  select(state) %>%
  unique() %>%
  .[[1]]
## [1] "IL" "IN" "MI" "OH" "WI"

If you want to more thoroughly understand the slightly odd indexing behavior of R where indexing is extracting components of an object using square brackets, then I thoroughly recommend that you look into Hadley Wickham’s example with pepper shakers that you can see linked here.