Using non-vectorized function in dplyr::mutate and how to associate column names for purrr::pmap

map family functions of the purrr package are very useful for using non-vectorized functions in dplyr::mutate chain (see GitHub - jennybc/row-oriented-workflows: Row-oriented workflows in R with the tidyverse or Beware of Vectorize · Jim Hester's blog).
I encounter the needs for this especially when dealing with nested data frames.

One of the drawbacks is that name/input argument assignments become confusing when you want to use more than two columns of your data frames (and using pmap family) for the function of interest. This post first briefly review how mutate works in combination with map or map2, then provide two approaches to avoid confusions around name assignments when using pmap.

How mutate works with vectorized functions

In most cases, the processes you want to do in mutate is vectorized and there is no need to use map family function. This works because the output from the function of interest (c in the example below) has the same length as the original data frame, and mutate only need to append one column to the data frame.

library(tidyverse)

df0 <- tibble(a = 1:3, b = 4:6)

df0 %>% mutate(c = a + b)

Non-vectorized function with one or two input arguments (map or map2)

Imagine that we want to create a new column containing arithmetic progressions in each row [ref (in Japanese)]. Since seq function is not vectorized, we cannot directly use this in mutate chain.

df1 <- tibble(a = c(1, 2), b = c(3, 6), c = c(8, 10))

df1 %>% mutate(d = seq(a, b))
# Error in mutate_impl(.data, dots) : Evaluation error: 'from' must be of length 1.

Instead, we can use map family function here. map family function take list(s) as input arguments and apply the function of interest using each element of the given lists. Because each column of data frames in R is a list, map works very well in combination.

In this example, we want to provide two input arguments to the seq function, from and to. map2 is the appropriate function for this.

df2 <- df1 %>% mutate(d = map2(a, b, seq))

as.data.frame(df2)
#  a b  c             d
#1 1 3  8       1, 2, 3
#2 2 6 10 2, 3, 4, 5, 6

The figure below shows how map function handles this process in mutate chain.

f:id:yoshidk6:20180806153334p:plain

Like you do with map function outside mutate, we can use map_dbl or map_chr to create columns with double or character types.

If we want to explicitly specify names of the argument, .x and .y can be used. See what happens with this:

df2 <- mutate(df1, d = map2(a, b, ~seq(.y, .x)))

as.data.frame(df2)

Non-vectorized function with three or more input arguments (pmap)

Assignment of column names become confusing when using three or more columns, because we don't have shorthand like .x or .y any more. Let's take a look at the following example using rnorm function.

Case example

Generate a list of random numbers for each row with rnorm function. Each row of the original data frame contain different value of mean, sd, n.

We will first prepare a data frame with columns corresponding to mean, sd, n, and apply rnorm function for each row using pmap. Each element of the new column data contains a vector of random samples *1. This type of structure is called as "nested data frames" and there are many resources on this, such as R for Data Science.

A simple case

If your data frame has the exact same names and numbers of columns to the input arguments of the function of interest, a simple syntax like the one below works *2.

df4 <- 
  tribble(~mean, ~sd, ~n,
          1,  0.03, 2,
          10, 0.1,  4,
          5,  0.1,  4)

df4.2 <- 
  df4 %>% 
  mutate(data = pmap(., rnorm)) 

as.data.frame(df4.2)

One caution is that the syntax like the one below doesn't work. pmap thinks that you are calling rnorm(df4$n, df4$mean, df4$sd) for each row, and each element of the new column contain three random samples from the same list of mean and sd. *3

df4 %>% mutate(data = pmap(., ~rnorm(n, mean, sd))) %>% as.data.frame() # Wrong answer

Number of columns > Number of input arguments

In most cases, however, you will have more columns than the input arguments. pmap complains in this case, saying that you have unused argument.

df5 <- 
  tribble(~mean, ~sd, ~dummy, ~n,
          1,  0.03, "a", 2,
          10, 0.1,  "b", 4,
          5,  0.1,  "c", 4)

df5 %>% mutate(data = pmap(., rnorm))  # Error

There are two ways to avoid this error.

Make a small list on the fly

The first method is to create a small list that only contains the necessary columns (Ref: Dplyr: Alternatives to rowwise - tidyverse - RStudio Community )

df5.2 <- 
  df5 %>% 
  mutate(data = pmap(list(n=n, mean=mean, sd=sd), rnorm))

as.data.frame(df5.2)

Here, list(n=n, mean=mean, sd=sd) create a new list with three vectors named n, mean, and sd, which serves the same purpose as the df4 data frame in the above example.

Mind that if you don't give names to the elements of the new list, the order of the list items will be used to associate with input arguments of rnorm. My recommendation is to always assign names to the list elements.

df5 %>% mutate(data = pmap(list(n, mean, sd), rnorm)) # Correct but not recommended
df5 %>% mutate(data = pmap(list(mean, sd, n), rnorm)) # Wrong answer

Use ... to ignore unused columns

The second method is to absorb unused columns with ... (Ref: Map over multiple inputs simultaneously. — map2 • purrr). A syntax like the one below works because pmap automatically associate names of the input list and names in function(). In other word, columns names of the data frame must match the variable names in the function().

df5.3 <- 
  df5 %>% 
  mutate(data = pmap(., function(n, mean, sd, ...) rnorm(n=n, mean=mean, sd=sd))) 

as.data.frame(df5.3)

Input arguments of function() and rnorm() are not automatically associated with names. It is recommended to explicitly associate input argument name for the function of interest (rnorm in this case).

df5 %>% mutate(data = pmap(., function(n, mean, sd, ...) rnorm(n, mean, sd))) # Correct but not recommended
df5 %>% mutate(data = pmap(., function(n, mean, sd, ...) rnorm(mean, sd, n))) # Wrong answer
df5 %>% mutate(data = pmap(., function(mean, sd, n, ...) rnorm(mean, sd, n))) # Wrong answer

A syntax like the one below gives unexpected outputs, as you saw in the df4 example.

df5 %>% mutate(data = pmap(., function(...) rnorm(n=n, mean=mean, sd=sd))) # Wrong answer

Column names are different from the input argument names

You can use either of the two approaches above.

df6 <- 
  tribble(~mean1, ~sd1, ~dummy, ~n1,
          1,  0.03, "a", 2,
          10, 0.1,  "b", 4,
          5,  0.1,  "c", 4)

df6.2 <-
  df6 %>% mutate(data = pmap(list(mean=mean1, sd=sd1, n=n1), rnorm)) 
as.data.frame(df6.2)

df6.3 <- 
  df6 %>% mutate(data = pmap(., function(n1, mean1, sd1, ...) rnorm(n=n1, mean=mean1, sd=sd1)))
as.data.frame(df6.3)

*1:In the examples below (and above), we further use as.data.frame function to exposure actual numbers of vectors

*2:This works even if the order of the columns is different from the order of input arguments

*3:This happens because rnorm is actually vectorized. See ?rnorm: If length(n) > 1, the length is taken to be the number required.