Using purrr's map family functions in dplyr::mutate
map
family functions of the purrr
package are very useful for using non-vectorized functions in dplyr::mutate
chain (see GitHub - jennybc/row-oriented-workflows: Row-oriented workflows in R with the tidyverse or https://www.jimhester.com/2018/04/12/vectorize/).
I encounter the needs for this especially when dealing with nested data frames.
One of the drawbacks is that name/input argument assignments become confusing when you want to use more than two columns of your data frames (and using pmap
family) for the function of interest.
This post first briefly review how mutate
works in combination with map
or map2
, then provide two approaches to avoid confusions around name assignments when using pmap
.
- How mutate works with vectorized functions
- Non-vectorized function with one or two input arguments (map or map2)
- Non-vectorized function with three or more input arguments (pmap)
How mutate
works with vectorized functions
In most cases, the processes you want to do in mutate
is vectorized and there is no need to use map
family function.
This works because the output from the function of interest (c
in the example below) has the same length as the original data frame, and mutate
only need to append one column to the data frame.
library(tidyverse) df0 <- tibble(a = 1:3, b = 4:6) df0 %>% mutate(c = a + b)
Non-vectorized function with one or two input arguments (map
or map2
)
Imagine that we want to create a new column containing arithmetic progressions in each row [ref (in Japanese)].
Since seq
function is not vectorized, we cannot directly use this in mutate
chain.
df1 <- tibble(a = c(1, 2), b = c(3, 6), c = c(8, 10)) df1 %>% mutate(d = seq(a, b)) # Error in mutate_impl(.data, dots) : Evaluation error: 'from' must be of length 1.
Instead, we can use map
family function here.
map
family function take list(s) as input arguments and apply the function of interest using each element of the given lists.
Because each column of data frames in R is a list, map
works very well in combination.
In this example, we want to provide two input arguments to the seq
function, from
and to
.
map2
is the appropriate function for this.
df2 <- df1 %>% mutate(d = map2(a, b, seq)) as.data.frame(df2) # a b c d #1 1 3 8 1, 2, 3 #2 2 6 10 2, 3, 4, 5, 6
The figure below shows how map
function handles this process in mutate
chain.
Like you do with map
function outside mutate
, we can use map_dbl
or map_chr
to create columns with double
or character
types.
If we want to explicitly specify names of the argument, .x
and .y
can be used.
See what happens with this:
df2 <- mutate(df1, d = map2(a, b, ~seq(.y, .x))) as.data.frame(df2)
Non-vectorized function with three or more input arguments (pmap
)
Assignment of column names become confusing when using three or more columns, because we don't have shorthand like .x
or .y
any more.
Let's take a look at the following example using rnorm
function.
Case example
Generate a list of random numbers for each row with
rnorm
function. Each row of the original data frame contain different value ofmean
,sd
,n
.
We will first prepare a data frame with columns corresponding to mean
, sd
, n
, and apply rnorm
function for each row using pmap
.
Each element of the new column data
contains a vector of random samples *1.
This type of structure is called as "nested data frames" and there are many resources on this, such as 25 Many models | R for Data Science.
A simple case
If your data frame has the exact same names and numbers of columns to the input arguments of the function of interest, a simple syntax like the one below works *2.
df4 <- tribble(~mean, ~sd, ~n, 1, 0.03, 2, 10, 0.1, 4, 5, 0.1, 4) df4.2 <- df4 %>% mutate(data = pmap(., rnorm)) as.data.frame(df4.2)
One caution is that the syntax like the one below doesn't work.
pmap
thinks that you are calling rnorm(df4$n, df4$mean, df4$sd)
for each row, and each element of the new column contain three random samples from the same list of mean
and sd
.
*3
df4 %>% mutate(data = pmap(., ~rnorm(n, mean, sd))) %>% as.data.frame() # Wrong answer
Number of columns > Number of input arguments
In most cases, however, you will have more columns than the input arguments.
pmap
complains in this case, saying that you have unused argument.
df5 <- tribble(~mean, ~sd, ~dummy, ~n, 1, 0.03, "a", 2, 10, 0.1, "b", 4, 5, 0.1, "c", 4) df5 %>% mutate(data = pmap(., rnorm)) # Error
There are two ways to avoid this error.
Make a small list on the fly
The first method is to create a small list that only contains the necessary columns (Ref: Dplyr: Alternatives to rowwise - tidyverse - RStudio Community )
df5.2 <- df5 %>% mutate(data = pmap(list(n=n, mean=mean, sd=sd), rnorm)) as.data.frame(df5.2)
Here, list(n=n, mean=mean, sd=sd)
create a new list with three vectors named n
, mean
, and sd
, which serves the same purpose as the df4
data frame in the above example.
Mind that if you don't give names to the elements of the new list, the order of the list items will be used to associate with input arguments of rnorm
.
My recommendation is to always assign names to the list elements.
df5 %>% mutate(data = pmap(list(n, mean, sd), rnorm)) # Correct but not recommended df5 %>% mutate(data = pmap(list(mean, sd, n), rnorm)) # Wrong answer
Use ...
to ignore unused columns
The second method is to absorb unused columns with ...
(Ref: Map over multiple inputs simultaneously. — map2 • purrr).
A syntax like the one below works because pmap
automatically associate names of the input list and names in function()
.
In other word, columns names of the data frame must match the variable names in the function()
.
df5.3 <- df5 %>% mutate(data = pmap(., function(n, mean, sd, ...) rnorm(n=n, mean=mean, sd=sd))) as.data.frame(df5.3)
Input arguments of function()
and rnorm()
are not automatically associated with names. It is recommended to explicitly associate input argument name for the function of interest (rnorm
in this case).
df5 %>% mutate(data = pmap(., function(n, mean, sd, ...) rnorm(n, mean, sd))) # Correct but not recommended df5 %>% mutate(data = pmap(., function(n, mean, sd, ...) rnorm(mean, sd, n))) # Wrong answer df5 %>% mutate(data = pmap(., function(mean, sd, n, ...) rnorm(mean, sd, n))) # Wrong answer
A syntax like the one below gives unexpected outputs, as you saw in the df4
example.
df5 %>% mutate(data = pmap(., function(...) rnorm(n=n, mean=mean, sd=sd))) # Wrong answer
Column names are different from the input argument names
You can use either of the two approaches above.
df6 <- tribble(~mean1, ~sd1, ~dummy, ~n1, 1, 0.03, "a", 2, 10, 0.1, "b", 4, 5, 0.1, "c", 4) df6.2 <- df6 %>% mutate(data = pmap(list(mean=mean1, sd=sd1, n=n1), rnorm)) as.data.frame(df6.2) df6.3 <- df6 %>% mutate(data = pmap(., function(n1, mean1, sd1, ...) rnorm(n=n1, mean=mean1, sd=sd1))) as.data.frame(df6.3)
*1:In the examples below (and above), we further use as.data.frame function to exposure actual numbers of vectors
*2:This works even if the order of the columns is different from the order of input arguments
*3:This happens because rnorm is actually vectorized. See ?rnorm: If length(n) > 1, the length is taken to be the number required.