Guest post by Jake Russ
For a recent project I needed to make a simple sum calculation on a rather large data frame (0.8 GB, 4+ million rows, and ~80,000 groups). As an avid user of Hadley Wickham’s packages, my first thought was to use plyr
. However, the job took plyr
roughly 13 hours to complete.
plyr
is extremely efficient and user friendly for most problems, so it was clear to me that I was using it for something it wasn’t meant to do, but I didn’t know of any alternative screwdrivers to use.
I asked for some help on the manipulator Google group , and their feedback led me to data.table
and dplyr
, a new, and still in progress, package project by Hadley.
What follows is a speed comparison of these three packages incorporating all the feedback from the manipulator folks. They found it informative, so Tal asked me to write it up as a reproducible example.
Continue reading “A speed test comparison of plyr, data.table, and dplyr”