Re: [心得] dplyr filter and slice

作者: celestialgod (天)   2016-01-01 22:18:13
※ 引述《memphis (讓你喜歡這世界~)》之銘言:
: 當你有一個 data.frame (如下)
: ID col_a col_b
: 01 01 2
: 01 02 1
: 02 05 3
: 02 NA 4
: 03 NA 2
: 03 NA 3
: ID=c('01','01','02','02','03','03')
: col_a <- c('01','02','05',NA,NA,NA)
: col_b <- c(2,1,3,4,2,3)
: m <- data.frame(ID, col_a, col_b, stringsAsFactors=F)
: ####
: 1. 想要挑每組 col_a 最小
: m %>% group_by(ID) %>% summarize(min_a = min(col_a, na.rm=T))
: ID min_a
: (chr) (chr)
: 1 01 01
: 2 02 05
: 3 03 NA
: Warning message:
: In min(c(NA_character_, NA_character_), na.rm = TRUE) :
: no non-missing arguments, returning NA
: ####
: 2. 想要挑每組 col_a 最小時的 col_b
: m %>% group_by(ID) %>% filter(col_a = min(col_a))
: Error: filter condition does not evaluate to a logical vector.
: m %>% group_by(ID) %>% filter(rank(col_a, ties.method='first')==1)
: ID col_a col_b
: (chr) (chr) (dbl)
: 1 01 01 2
: 2 02 05 3
: 3 03 NA 2
: ####
: 3. 想要挑每組 col_a 最小時的 col_b (較快)
: m %>% group_by(ID) %>% slice(which.min(col_a))
: ID col_a col_b
: (chr) (chr) (dbl)
: 1 01 01 2
: 2 02 05 3
: ####
: 歡迎討論各種例外狀況
: 有些時候取大取小, 只接受數字, 有時候又可以自己轉換
: 有些時候文字可以比大小, 其中有些格子是空格會無法比, 要轉成NA
library(data.table)
library(plyr)
library(dplyr)
library(purrr)
library(magrittr)
library(microbenchmark)
set.seed(10)
N = 1e5
numGroup = 100
maxID = 5000
DT = data.table(group = sample(sprintf('A%02i', 1:numGroup), N, TRUE),
value = sample(1:1000, N, TRUE)) %>% tbl_dt(FALSE) %>%
mutate(ID = sprintf('%02i', sample(1:maxID, N, TRUE)),
group = factor(group, levels = sprintf('A%02i', 1:numGroup))) %>%
select(ID, group, value) %>% distinct(ID, group)
# distinct 是為了避免有同ID下有同樣組別之情形
dplyr_slice = function() DT %>% group_by(ID) %>% slice(which.min(group))
dplyr_filter_rank = function() DT %>% group_by(ID) %>%
filter(rank(group, ties.method='first')==1)
dplyr_filter_row_number = function() DT %>% group_by(ID) %>%
filter(row_number(group) == 1)
dplyr_arrange = function() DT %>% arrange(ID, group, value) %>%
group_by(ID) %>% summarise(group = group[1], value = value[1])
purrr_split_map = function() DT %>% split(.$ID) %>%
map(~ .[which.min(.$group), ]) %>% bind_rows
plyr_ddply = function() ddply(DT, .(ID), function(x) x[which.min(x$group), ])
plyr_ddply_filter = function() ddply(DT, .(ID), function(x){
x %>% filter(row_number(group) == 1)
})
all.equal(dplyr_slice() %>% arrange(ID),
dplyr_filter_rank() %>% arrange(ID)) # TRUE
all.equal(dplyr_slice() %>% arrange(ID),
dplyr_filter_row_number() %>% arrange(ID)) # TRUE
all.equal(dplyr_slice() %>% arrange(ID),
dplyr_arrange() %>% arrange(ID)) # TRUE
all.equal(dplyr_slice() %>% arrange(ID),
purrr_split_map() %>% arrange(ID)) # TRUE
all.equal(dplyr_slice() %>% arrange(ID),
plyr_ddply() %>% arrange(ID)) # TRUE
all.equal(dplyr_slice() %>% arrange(ID),
plyr_ddply_filter() %>% arrange(ID)) # TRUE
microbenchmark(dplyr_slice(), dplyr_filter_rank(),
dplyr_filter_row_number(), dplyr_arrange(), purrr_split_map(),
plyr_ddply(), plyr_ddply_filter(), times = 20L
)
# Unit: milliseconds
# expr min lq mean
# dplyr_slice() 1347.66226 1401.43846 1464.60871
# dplyr_filter_rank() 449.52015 453.61900 471.02026
# dplyr_filter_row_number() 433.67680 468.01415 480.40588
# dplyr_arrange() 41.23724 42.51137 46.32983
# purrr_split_map() 3198.05461 3274.40538 3403.10413
# plyr_ddply() 1344.17546 1397.38130 1455.63575
# plyr_ddply_filter() 3894.99410 4013.59103 4117.50118
# Unit: milliseconds
# expr median uq max neval
# dplyr_slice() 1435.9503 1516.38167 1665.08106 20
# dplyr_filter_rank() 463.6700 480.17833 526.91304 20
# dplyr_filter_row_number() 480.6615 492.46276 531.02535 20
# dplyr_arrange() 46.4951 50.18027 52.71638 20
# purrr_split_map() 3372.4479 3488.43514 3755.94195 20
# plyr_ddply() 1460.6096 1500.80851 1587.33094 20
# plyr_ddply_filter() 4114.9551 4196.85084 4461.57974 20
這樣測試下來,直接排序,再取第一個是最快的方法
rank跟row_number差不多快
slice跟ddply差不多快
作者: k75715 (風可以這麼大的嗎)   2016-01-02 08:49:00
仔細的測試只好給推了!
作者: memphis (讓你喜歡這世界~)   2016-01-03 00:02:00
唔 跟我的經驗不合 我再看看我的 難道是有沒有NA有差?還是文字數字日期會有差..我來跑跑我這邊的資料

Links booklink

Contact Us: admin [ a t ] ucptt.com