r - Using dplyr to make sample from data frame -
i have large data frame (150.000.000 rows) format this:
df = data.frame(pnr = rep(500+2*(1:15),each=3), x = runif(3*15))
pnr person id , x data. sample 10% of persons. there fast way in dplyr?
the following solution, slow because of merge-statement
prns = as.data.frame(unique(df$prn)) names(prns)[1] = "prn" prns$s = rbinom(nrow(prns),1,0.1) df = merge(df,prns) df2 = df[df$s==1,]
i suggest "data.table" package on "dplyr" this. here's example big-ish sample data (not smaller own 15 million rows).
i'll show right , wrong ways things :-)
here's sample data.
library(data.table) library(dplyr) library(microbenchmark) set.seed(1) mydf <- dt <- data.frame(person = sample(10000, 1e7, true), value = runif(1e7))
we'll create "data.table" , set key "person". creating "data.table" takes no significant time, setting key can.
system.time(setdt(dt)) # user system elapsed # 0.001 0.000 0.001 ## setting key takes time, worth system.time(setkey(dt, person)) # user system elapsed # 0.620 0.025 0.646
i can't think of more efficient way select "person" values following, i've removed these benchmarks--they common approaches.
## common tests... <- unique(mydf$person) b <- sample(a, ceiling(.1 * length(a)), false)
for convenience, different tests presented functions...
## base r #1 fun1a <- function() { mydf[mydf$person %in% b, ] } ## base r #2--sometimes using `which` makes things quicker fun1b <- function() { mydf[which(mydf$person %in% b), ] } ## `filter` "dplyr" fun2 <- function() { filter(mydf, person %in% b) } ## "wrong" way "data.table" fun3a <- function() { dt[which(person %in% b)] } ## "right" (i think) way "data.table" fun3b <- function() { dt[j(b)] }
now, can benchmark:
## benchmarking microbenchmark(fun1a(), fun1b(), fun2(), fun3a(), fun3b(), times = 20) # unit: milliseconds # expr min lq median uq max neval # fun1a() 382.37534 394.27968 396.76076 406.92431 494.32220 20 # fun1b() 401.91530 413.04710 416.38470 425.90150 503.83169 20 # fun2() 381.78909 394.16716 395.49341 399.01202 417.79044 20 # fun3a() 387.35363 397.02220 399.18113 406.23515 413.56128 20 # fun3b() 28.77801 28.91648 29.01535 29.37596 42.34043 20
look @ performance using "data.table" right way! other approaches impressively fast though.
summary
shows results same. (the row order "data.table" solution different since has been sorted.)
summary(fun1a()) # person value # min. : 16 min. :0.000002 # 1st qu.:2424 1st qu.:0.250988 # median :5075 median :0.500259 # mean :4958 mean :0.500349 # 3rd qu.:7434 3rd qu.:0.749601 # max. :9973 max. :1.000000 summary(fun2()) # person value # min. : 16 min. :0.000002 # 1st qu.:2424 1st qu.:0.250988 # median :5075 median :0.500259 # mean :4958 mean :0.500349 # 3rd qu.:7434 3rd qu.:0.749601 # max. :9973 max. :1.000000 summary(fun3b()) # person value # min. : 16 min. :0.000002 # 1st qu.:2424 1st qu.:0.250988 # median :5075 median :0.500259 # mean :4958 mean :0.500349 # 3rd qu.:7434 3rd qu.:0.749601 # max. :9973 max. :1.000000
Comments
Post a Comment