Adding a Progress Bar in R
作者:互联网
1. Introduction
I have a R program reads and revises data with dplyr::mutate(). Normally this will not be a problem. But my data frame is very large and the process logic is somehow complicated, so each time it runs will take 1-2 minutes.
From time to time I find myself don't know if the progress is going all right or the program is stuck and dead. I start to think a progress bar will be useful, just like readr::read_csv() will show us when we are reading a large data frame.
Hopefully mutate() can have an argument like progress_bar=TRUE in the future, but that's not the case today.
Our goal today is to add a progress bar to section as below.
library(tidyverse) diamonds <- diamonds %>% mutate(sumxyz=x+y+z)
# A tibble: 53,940 x 11 carat cut color clarity depth table price x y z sumxyz <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 10.4 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 10.0 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 10.4 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63 11.1 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 11.4 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 10.4 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 10.4 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 10.7 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 10.1 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 10.4 # ... with 53,930 more rows
2. Preliminary
There was a function called dplyr::progress_estimated() which can add a progress bar, and you may see many articles on it. But it has been deprecated so I will aviod it. Instead, we will use package progress, it is recommened by dplyr. It can be easily install with install.packages("progress") but most of time it is already there by default. Also, there are a functon called txtProgressBar(), but judging by it's name I believe it is quite out of date.
A progress bar can be created as follow. But before we can actually use it, we still have something to prepare.
library(progress) pb <- progress_bar$new()
3. Using For loop
progress::progress_bar() is designed to work with for loop. It's idea is the bar grows a little bit and prints at console at each loop.
For loop is not slow for years so the speed is not a problem to concern. However, using a for loop explicitly may not be suitable in mutate() function.
So firstly we will have to use for loop. After we figure out the idea we will see how it can work with mutate().
One other note is, when we loop over a data frame, we will loop over it's columns one by one. This prevent us from using different columns at the same time. So instead we will munally loop over a data frame row by row.
my_sum <- function(row) { row$x + row$y + row$z } pb <- progress_bar$new(total=nrow(diamonds)) sumxyz2 <- vector("double", nrow(diamonds)) for(i in 1:nrow(diamonds)) { pb$tick() sumxyz2[[i]] <- my_sum(diamonds[i,]) } diamonds$sumxyz2 <- sumxyz2 diamonds
A beautiful progress bar will show below when our code runs.
4. Using Map
map() is only a wrapper around for loop. So if we can do it in for loop, we can do it in map(). Besides, map_*() family from purrr is written in C, so it will slightly faster than our self written code.
The drawback of map_*() functions is that they will bite you if you are not familiar with them. They are not as clear as for loop showing each step. I myself is not a big fan of map() so sometime it will spend me a while to figure it out.
Here we will use pmap_*() because we will use different columns at the same time. But map_*() can only loop from one column to another.
One other important thing is, function to be mapped by pmap() will look for arguments by their name strictly. So we will use a named list to organize input variables. If you don't select and name them beforehand, pmap() will tell you number of argument is not correct.
my_sum <- function(x, y, z, .pb=NULL) { .pb$tick() x + y + z } args <- list(x=diamonds$x, y=diamonds$y, z=diamonds$z) pb <- progress_bar$new(total=nrow(diamonds)) sumxyz3 <- pmap_dbl(args, my_sum, .pb=pb) diamonds$sumsyz3 <- sumxyz3
A beautiful progress bar will show below when our code runs.
5. Back to mutate()
Now that we have a solution with pmap() and manually assign the output to our data frame. We can do the assignment job with mutate(). In a mutate() function, we still need to select columns we want.
my_sum <- function(x, y, z, .pb=NULL) { .pb$tick() x + y + z } pb <- progress_bar$new(total=nrow(diamonds)) diamonds <- diamonds %>% mutate(sumxyz4=pmap_dbl( list(x, y, z), my_sum, .pb=pb ))
标签:mutate,Adding,Good,Bar,bar,will,Progress,progress,loop 来源: https://www.cnblogs.com/drvongoosewing/p/14191824.html