Chapter 11 Tidy evaluation

If you are using Shiny with the tidyverse, you will almost certainly encounter the challenge of programming with tidy evaluation. Tidy evaluation is a tool that the tidyverse uses to faciliate fluid interactive data exploration, but it comes with a cost: it’s hard to refer to variables indirectly, and hence harder to program with.

This chapter will show you how to use Shiny with the two most common forms of tidy evaluation:

  • Data-masking allows you to refer to variables within a data frame without repeating the name of the data frame. For example, this code can use the x, z, carat, and price variables from the diamonds dataset because filter() and aes() use data-masking.

    diamonds %>% filter(x == z)
    
    ggplot(diamonds, aes(carat, price)) + geom_hex()
  • Tidy-selection allows you to concisely select variables by their position, name, or value. For example, the following code selects all variables from iris that start with “sepal”, and all numeric variables from diamonds25:

    iris %>% select(starts_with("sepal"))
    diamonds %>% select(is.numeric)

In this chapter, you’ll learn how to use data-masking and tidy-selection functions from ggplot2 and dplyr with your Shiny app. To learn more about using tidy evaluation in a function or package, you’ll need to turn to other resources like Using ggplot2 in packages or Programming with dplyr.

library(shiny)

library(dplyr, warn.conflicts = FALSE)
library(ggplot2)

11.1 Data-masking

Data-masking allows you to use variables in the “current” data frame without any extra syntax. It’s used in many dplyr functions like arrange(), filter(), group_by(), mutate(), and summarise(), and in ggplot2’s aes(). Data-masking works by blurring the lines between two meanings of “variable”:

  • Environment variables (env-variables for short) are “programming” variables. They are usually created with <-.

  • Data frame variables (data-variables for short) are “statistical” variables that live inside a data frame. They usually come from data loaded by functions like read.csv(), or are created by manipulating other variables.

To make these terms more concrete, take this piece of code:

df <- data.frame(x = runif(3), y = runif(3))
df$x
#> [1] 0.442 0.248 0.653

It creates a env-variable, df, that contains two data-variables, x and y. Then it extracts the data-variable x out of the df using $.

Data-masking is useful because it lets you use data-variables without any additional syntax. Take this filter() call:

filter(diamonds, x == 0 | y == 0)
#> # A tibble: 8 x 10
#>   carat cut       color clarity depth table price     x     y     z
#>   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#> 1  1.07 Ideal     F     SI2      61.6    56  4954     0  6.62     0
#> 2  1    Very Good H     VS2      63.3    53  5139     0  0        0
#> 3  1.14 Fair      G     VS1      57.5    67  6381     0  0        0
#> 4  1.56 Ideal     G     VS2      62.2    54 12800     0  0        0
#> 5  1.2  Premium   D     VVS1     62.1    59 15686     0  0        0
#> 6  2.25 Premium   H     SI2      62.8    59 18034     0  0        0
#> 7  0.71 Good      F     SI2      64.1    60  2130     0  0        0
#> 8  0.71 Good      F     SI2      64.1    60  2130     0  0        0

In most (but not all26) base R functions you need to refer to a data-variable with $, leading to code that repeats the name of the data frame many times:

diamonds[diamonds$x == 0 | diamonds$y == 0, ]
#> # A tibble: 8 x 10
#>   carat cut       color clarity depth table price     x     y     z
#>   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#> 1  1.07 Ideal     F     SI2      61.6    56  4954     0  6.62     0
#> 2  1    Very Good H     VS2      63.3    53  5139     0  0        0
#> 3  1.14 Fair      G     VS1      57.5    67  6381     0  0        0
#> 4  1.56 Ideal     G     VS2      62.2    54 12800     0  0        0
#> 5  1.2  Premium   D     VVS1     62.1    59 15686     0  0        0
#> 6  2.25 Premium   H     SI2      62.8    59 18034     0  0        0
#> 7  0.71 Good      F     SI2      64.1    60  2130     0  0        0
#> 8  0.71 Good      F     SI2      64.1    60  2130     0  0        0

11.1.1 Indirection

The blurring of data-variables and env-variables makes data analysis easier at the cost of making indirection harder. In Shiny27, indirection happens when you have the name of a data-variable stored in an env-variable, and is the key in using data-masking with Shiny apps.

To make the problem concrete, take the following code which uses data-masking to refer to a data-variable (carat) and env-variable (min_carat) in the same way28:

min_carat <- 1
diamonds %>% filter(carat > min_carat)
#> # A tibble: 17,502 x 10
#>    carat cut       color clarity depth table price     x     y     z
#>    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#>  1  1.17 Very Good J     I1       60.2    61  2774  6.83  6.9   4.13
#>  2  1.01 Premium   F     I1       61.8    60  2781  6.39  6.36  3.94
#>  3  1.01 Fair      E     I1       64.5    58  2788  6.29  6.21  4.03
#>  4  1.01 Premium   H     SI2      62.7    59  2788  6.31  6.22  3.93
#>  5  1.05 Very Good J     SI2      63.2    56  2789  6.49  6.45  4.09
#>  6  1.05 Fair      J     SI2      65.8    59  2789  6.41  6.27  4.18
#>  7  1.01 Fair      E     SI2      67.4    60  2797  6.19  6.05  4.13
#>  8  1.04 Premium   G     I1       62.2    58  2801  6.46  6.41  4   
#>  9  1.2  Fair      F     I1       64.6    56  2809  6.73  6.66  4.33
#> 10  1.02 Premium   G     I1       60.3    58  2815  6.55  6.5   3.94
#> # … with 17,492 more rows

Imagine we want to extend this code so that it works for any variable and any value. That is straightforward with base R because you can switch from $ to [[:

var <- "carat"
min <- 1
diamonds[diamonds[[var]] > min, ]

Data-masking solves the indirection problem in a similar way, by introducing an object, .data that you can subset using either $ or [[. To get started we can rewrite our previous filter() call to use .data to make it clear that carat is a data-variable29:

diamonds %>% filter(.data$carat > min)

This form isn’t particularly useful (unless you’re writing a package, in which case it eliminates a pesky R CMD check NOTE), but because we have some object to index into, we can switch from $ to [[:

var <- "carat"
diamonds %>% filter(.data[[var]] > min)

Let’s wrap this up into a simple Shiny app. The UI will have a input to select the variable, an input for the minimum value, and a output to show the data, and the server function will wrap the filter() call in a reactive and render the first few rows.

ui <- fluidPage(
  selectInput("var", "Variable", choices = names(diamonds)),
  numericInput("min", "Minimum", value = 1),
  tableOutput("output")
)
server <- function(input, output, session) {
  data <- reactive(filter(diamonds, .data[[input$var]] > input$min))
  output$output <- renderTable(head(data()))
}

Note that if we didn’t use .data, as below, the app will not work correctly, but also won’t generate an error message. That’s because input$var will be a string like "carat" and "carat" > 0 is valid R expression that evaluates to TRUE, selecting all rows.

server <- function(input, output, session) {
  data <- reactive(filter(diamonds, input$var > input$min))
  output$output <- renderTable(head(data()))
}

Now that you’ve seen the basics, we’ll develop a couple of more realistic, but still simple, Shiny apps.

11.1.2 Example: ggplot2

Lets take a look at a more complicated example where we allow the user to create a plot by selecting the variables to appear on the x and y axes:

# NB: needs ggplot >= 3.3.0 to get nice labels

ui <- fluidPage(
  selectInput("x", "X variable", choices = names(iris)),
  selectInput("y", "Y variable", choices = names(iris)),
  plotOutput("plot")
)
server <- function(input, output, session) {
  output$plot <- renderPlot({
    ggplot(iris, aes(.data[[input$x]], .data[[input$y]])) +
      geom_point(position = ggforce::position_auto())
  })
}

Here I’ve used ggforce::position_auto() so that geom_point() works nicely regardless of whether the x and y variables are continuous or discrete. Alternatively, instead of using position_auto(), we could allow the user to pick the geom. The following app uses a switch() statement to genereate a reactive geom that is later added to the plot.

ui <- fluidPage(
  selectInput("x", "X variable", choices = names(iris)),
  selectInput("y", "Y variable", choices = names(iris)),
  selectInput("geom", "geom", c("point", "smooth", "jitter")),
  plotOutput("plot")
)
server <- function(input, output, session) {
  plot_geom <- reactive({
    switch(input$geom,
      point = geom_point(),
      smooth = geom_smooth(se = FALSE),
      jitter = geom_jitter()
    )
  })
  
  output$plot <- renderPlot({
    ggplot(iris, aes(.data[[input$x]], .data[[input$y]])) +
      plot_geom()
  })
}

This is one of the challenges of programming with user selected variables: your code has to become more complicated to handle all the cases the user might generate.

11.1.3 Example: dplyr

The same technique works for dplyr. The following app extends the previous simple example to allow you to choose a variable to filter, a minimum value to select, and a variable to sort by.

ui <- fluidPage(
  selectInput("var", "Select variable", choices = names(mtcars)),
  sliderInput("min", "Minimum value", 0, min = 0, max = 100),
  selectInput("sort", "Sort by", choices = names(mtcars)),
  tableOutput("data")
)
server <- function(input, output, session) {
  observeEvent(input$var, {
    rng <- range(mtcars[[input$var]])
    updateSliderInput(session, "min", value = rng[[1]], min = rng[[1]], max = rng[[2]])
  })
  
  output$data <- renderTable({
    mtcars %>% 
      filter(.data[[input$var]] > input$min) %>% 
      arrange(.data[[input$sort]])
  })
}

Most other problems can be solved by combining .data with your existing programming skills. For example, what if you wanted to conditionally sort in either ascending or descending order?

ui <- fluidPage(
  selectInput("var", "Sort by", choices = names(mtcars)),
  checkboxInput("desc", "Descending order?"),
  tableOutput("data")
)
server <- function(input, output, session) {
  sorted <- reactive({
    if (input$desc) {
      arrange(mtcars, desc(.data[[input$var]]))
    } else {
      arrange(mtcars, .data[[input$var]])
    }
  })
  output$data <- renderTable(sorted())
}

As you provide more control, you’ll find your code gets more and more complicated, and it becomes harder and harder to create a user interface that is both comprehensive and user friendly. This is why I’ve always focussed on code tools for data analysis: creating good UIs is really really hard!

11.1.4 User supplied data

Before we move on to talk about tidy-selection, there’s one last topic we need to discuss: user supplied data. Take the following app: it allows the user to upload a tsv file, then select a variable and filter by it. It will work for the vast majority of inputs that you might try it with:

ui <- fluidPage(
  fileInput("data", "dataset", accept = ".tsv"),
  selectInput("var", "var", character()),
  numericInput("min", "min", 1, min = 0, step = 1),
  tableOutput("output")
)
server <- function(input, output, session) {
  data <- reactive({
    req(input$data)
    vroom::vroom(input$data$datapath)
  })
  observeEvent(data(), {
    updateSelectInput(session, "var", choices = names(data()))
  })
  observeEvent(input$var, {
    val <- data()[[input$var]]
    updateNumericInput(session, "min", value = min(val))
  })
  
  output$output <- renderTable({
    req(input$var)
    
    data() %>% 
      filter(.data[[input$var]] > input$min) %>% 
      arrange(.data[[input$var]]) %>% 
      head(10)
  })
}

There is a subtle problem with the use of filter() here. Let’s pull the call to filter() so we can play around with it directly, outside of the app:

df <- data.frame(x = 1, y = 2)
input <- list(var = "x", min = 0)

df %>% filter(.data[[input$var]] > input$min)
#>   x y
#> 1 1 2

If you experiment with this code, you’ll find that it appears to work just fine for vast majority of data frames. However, there’s a subtle issue: what happens if the data frame contains a variable called input?

df <- data.frame(x = 1, y = 2, input = 3)
df %>% filter(.data[[input$var]] > input$min)
#> Error in input$min: $ operator is invalid for atomic vectors

We get an error message because filter() is attempting to evaluate df$input$min:

df$input$min
#> Error in df$input$min: $ operator is invalid for atomic vectors

This problem is due to the ambiguity of data-variables and env-variables, and because data-masking prefers to use a data-variable if both are available. We can resolve the problem by using .env30 to tell filter() only look for min in the env-variables:

df %>% filter(.data[[input$var]] > .env$input$min)
#>   x y input
#> 1 1 2     3

Note that you only need to worry about this problem when working with user supplied data; when working with your own data, you can ensure the names of your data-variables don’t clash with the names of your env-variables.

11.1.5 Why not use base R?

At this point you might wonder if you’re better off without filter(), and instead convert your code to use the equivalent base R code:

df[df[[input$var]] > input$min, ]
#>   x y input
#> 1 1 2     3

That’s a totally legitimate position, as long as you’re aware of the work that filter() does for you so you can generate the equivalent base R code. In this case:

  • You’ll need drop = FALSE if df contains a single column (otherwise you’ll get a vector instead of a data frame).

  • You’ll need to use which() or similar to drop any missing values.

  • You can’t do group-wise filtering (e.g. filter(df, n() == 1)).

In general, if you’re using dplyr for very simple cases, you might find it easier to use base R functions that don’t use data-masking. However, in my opinion, one of the advantages of the tidyverse is the careful thought that has been applied to edge cases so that functions work more consistently. I don’t want to oversell this, but at the same time, it’s easy to forget the quirks of specific base R functions, and write code that works 95% of the time, but fails in unusual ways the other 5% of the time.

11.2 Tidy-selection

Tidy-selection provides a concise way of selecting columns by position, name, or type. It’s used in dplyr::select() and dplyr::across(), and in many functions from tidyr, like pivot_longer(), pivot_wider(), separate(), extract(), and unite().

11.2.1 Indirection

To refer to variables indirectly use any_of() or all_of()31: both expect an character vector env-variable containing data-variable names. The only difference is what happens if you supply a variable named that doesn’t exist in the input: all_of() will throw an error, while any_of() will silently ignore it.

The following app lets the user select any number of variables using a multi-select input, along with all_of():

ui <- fluidPage(
  selectInput("vars", "Variables", names(mtcars), multiple = TRUE),
  tableOutput("data")
)

server <- function(input, output, session) {
  output$data <- renderTable({
    req(input$vars)
    mtcars %>% select(all_of(input$vars))
  })
}

11.2.2 Tidy-selection and data-masking

Working with multiple variables is trivial when you’re working with a function that uses tidy-selection: you can just pass a character vector of variable names in to any_of() or all_of(). Wouldn’t it be nice if we could do that in data-masking functions too? That’s the idea of the across() function, which was implemented in dplyr 1.0.0. It allows you to access tidy-selection inside of the data-masking functions.

across() is typically used with either one or two arguments. The first argument selects variables, and is useful in functions like group_by() or distinct(). For example, the following app allows you to select any number of variables and count their unique values.

ui <- fluidPage(
  selectInput("vars", "Variables", names(mtcars), multiple = TRUE),
  tableOutput("count")
)

server <- function(input, output, session) {
  output$count <- renderTable({
    req(input$vars)
    
    mtcars %>% 
      group_by(across(all_of(input$vars))) %>% 
      summarise(n = n())
  })
}

The second argument is a function (or list of functions) that’s applied to each selected column. That makes it a good fit for mutate() and summarise() where you typically want to transform each variable in some way. For example, the following code lets the user select any number of grouping variables, and any number of variables to summarise with their means.

ui <- fluidPage(
  selectInput("vars_g", "Group by", names(mtcars), multiple = TRUE),
  selectInput("vars_s", "Summarise", names(mtcars), multiple = TRUE),
  tableOutput("data")
)

server <- function(input, output, session) {
  output$data <- renderTable({
    mtcars %>% 
      group_by(across(all_of(input$vars_g))) %>% 
      summarise(across(all_of(input$vars_s), mean), n = n())
  })
}

If you need your code to work with older version of dplyr, you’ll need a slightly different approach. In older versions of dplyr, every data-masking function is paired with a tidy-selection variant that has the suffix _at. That approach yields the following code for the two server functions above:

server <- function(input, output, session) {
  output$count <- renderTable({
    req(input$vars)
    
    mtcars %>% 
      group_by_at(input$vars) %>% 
      summarise(n = n())
  })
}
server <- function(input, output, session) {
  output$data <- renderTable({
    mtcars %>% 
      group_by_at(input$vars_g) %>% 
      summarise_at(input$vars_s, mean)
  })
}

11.3 parse() + eval()

Before we go, it’s worth a brief comment about paste() + parse() + eval(). If you have no idea what this combination is, you can skip this section, but if you have used it, I’d like to pass on a small note of caution.

It’s a tempting approach because it requires learning very few new ideas. But it has some major downsides: because you are pasting strings together, it’s very easy to accidentally create invalid code, or code that can be abused to do something that you didn’t want. This isn’t super important if its a Shiny app that only you use, but it isn’t a good habit to get into — otherwise it’s very easy to accidentally create a security hole in an app that you share more widely.

(You shouldn’t feel bad if this is the only way you can figure out to solve a problem, but when you have a bit more mental space, I’d recommend spending some time figuring out how to do it without string manipulation. This will help you to become a better R programmer.)


  1. Selecting variables based on their type requires dplyr 1.0.0, which is currently in development.↩︎

  2. dplyr::filter() is inspired by base::subset(). subset() uses data-masking, but not through tidy evaluation, so unfortunately the techniques discussed in this chapter don’t apply to it.↩︎

  3. There’s another form of indirection that happens when you’re write functions which is solved using {{ x }}, called embracing. You can learn more about that in Programming with dplyr.↩︎

  4. Note that an expression like carat > min_carat has to look for three things: carat, min_carat, and >. That’s because R uses the same rules to look for functions and objects. If filter() didn’t also look in the environment, it wouldn’t be able to find any functions.↩︎

  5. .data is paired with .env, which is usually not needed, but we’ll come back to it later in Section 11.1.4.↩︎

  6. You might wonder if the same problem applies to variables called .data and .env. In the unlikely event of having columns with those names you’ll need to refer to them with explicitly .data$.data and .data$.env.↩︎

  7. In older versions of tidyselect and dplyr, you’ll need to use one_of(). It has the same semantics as any_of(), but a less informative name.↩︎