R summarizing multiple columns with data.table -

February 15, 2013

i'm trying use data.table speed processing of large data.frame (300k x 60) made of several smaller merged data.frames. i'm new data.table. code far follows

library(data.table) = data.table(index=1:5,a=rnorm(5,10),b=rnorm(5,10),z=rnorm(5,10)) b = data.table(index=6:10,a=rnorm(5,10),b=rnorm(5,10),c=rnorm(5,10),d=rnorm(5,10)) dt = merge(a,b,by=intersect(names(a),names(b)),all=t) dt$category = sample(letters[1:3],10,replace=t)

and wondered if there more efficient way following summarize data.

summ = dt[i=t,j=list(a=sum(a,na.rm=t),b=sum(b,na.rm=t),c=sum(c,na.rm=t),                      d=sum(d,na.rm=t),z=sum(z,na.rm=t)),by=category]

i don't want type 50 column calculations hand , eval(paste(...)) seems clunky somehow.

i had @ example below seems bit complicated needs. thanks

how summarize data.table across multiple columns

you can use simple lapply statement .sd

dt[, lapply(.sd, sum, na.rm=true), by=category ]     category index               b        z         c        d 1:        c    19 51.13289 48.49994 42.50884  9.535588 11.53253 2:        b     9 17.34860 20.35022 10.32514 11.764105 10.53127 3:           27 25.91616 31.12624  0.00000 29.197343 31.71285

if want summarize on columns, can add .sdcols argument

#  note .sdcols allows reordering of columns dt[, lapply(.sd, sum, na.rm=true), by=category, .sdcols=c("a", "c", "z") ]      category                c        z 1:        c 51.13289  9.535588 42.50884 2:        b 17.34860 11.764105 10.32514 3:        25.91616 29.197343  0.00000

this of course, not limited sum , can use function lapply, including anonymous functions. (ie, it's regular lapply statement).

lastly, there no need use i=t , j= <..>. personally, think makes code less readable, style preference.

edit: documentation

you find documentation .sdand several other special variables under the
section of ?"[.data.table" (in arguments section, under info by).

also have @ data.table faq 2.1

http://datatable.r-forge.r-project.org/datatable-faq.pdf

Search This Blog

DIs

R summarizing multiple columns with data.table -

edit: documentation

Comments

Post a Comment

Popular posts from this blog

php - cannot display multiple markers in google maps v3 from traceroute result -

css - Text drops down with smaller window -

php - Boolean search on database with 5 million rows, very slow -