A few hours ago, Jens Oehlschlägel has announced on the R-help mailing list of the release of a new version of the ff package.
The ff package provides data structures that are stored on disk but behave (almost) as if they were in RAM by transparently mapping only a section (pagesize) in main memory – the effective virtual memory consumption per ff object.
Here are the new features of ff, as Jens wrote in his announcement:
—-
Dear R community,
The next release of package ff is available on CRAN. With kind help of Brian Ripley it now supports the Win64 and Sun versions of R. It has three major functional enhancements:
a) new fast in-memory sorting and ordering functions (single-threaded)
b) ff now supports on-disk sorting and ordering of ff vectors and ffdf dataframes
c) ff integer vectors now can be used as subscripts of ff vectors and ffdf dataframes
a) is achieved by careful implementation of NA-handling and exploiting context information
b) although permanently stored, sorting and ordering of ff objects can be faster than the standard routines in R
c) applying an order to ff vectors and ffdf dataframes is substantially slower than in pure R because it involves disk-access AND sorting index positions (to avoid random access).
There is still room for improvement, however, the current status should already be useful. I run some comparisons with SAS (see end of mail):
– both could sort German census size (81e6 rows) on a 3GB notebook
– ff sorts and orders faster on single columns
– sorting big multicolumn-tables is faster in SAS
Win64 binaries and version 2.2.1 supporting Sun should appear during the next days on CRAN. For the impatient: checkout from r-forge with revision 67 or higher.
Non-Windows users: please note that you need to set appropriate values for options ‘ffbatchbytes’ and ‘ffmaxbytes’ yourself.
Note that virtual window support is deprecated now because it leads to too complex code. Let us know if you urgently need this and why.
Feedback, ideas and contributions appreciated. To those who offered code during the last months: please forgive us that integrating and documenting was not possible with this release.
Jens & Daniel
P.S. NEWS
CHANGES IN ff VERSION 2.2.0
NEW FEATURES
o ff now supports the 64 bit Windows and Sun versions of R
(thanks to Brian Ripley)
o ff now supports sorting and ordering of ff vectors and dataframes (see ramsort, ffsort, ffdfsort, ramorder, fforder, ffdforder)
o ff now supports ff vectors as subscripts of ff objects (currently positive integers only, booleans are planned)
o New option ‘ffmaxbytes’ which allows certain ff procedures like sorting using larger limit of RAM than ‘ffbatchbytes’ in chunked processing. Such higher limit is useful for (single-R-process) sorting compared to some multi-R-process chunked processing. It is a good idea to reduce ‘ffmaxbytes’ on slaves or avoid ff sorting there completely.
o New generic ‘pagesize’ with method ‘pagesize.ff’ which returns the current pagesize as defined on opening the ff object.
USER VISIBLE CHANGES
o [.ff now returns with the same vmode as the ff-object
o Certain operations are faster now because we worked around unnecessary copying triggered by many of R’s assignment functions. For example reading a factor from a (well-cached) file is now 20% faster and thus as fast as just creating this factor in-RAM using levels()<- and class()<- assignments. (consider this tuning temporary, hoping for a generic fix in base R)
o ff() can now open files larger than .Machine$integer.max elements (but gives access only to the first .Machine$integer.max elements)
o ff now has default pattern NULL translating to the pattern in ‘filename’ (and only to the previous default ‘ff’ if no filename is given)
o ff now sets the pattern in synch with a requested ‘filename’
o clone.ff now always creates a file consistent with the previous pattern
o clone.ff now always creates a finalizer consistent with the file location
o clone.ffdf has a new argument ‘nrow’ which allows to create an empty copy
with a different number of rows (currently requires ‘initdata=NULL’)
o clone.default now deep-copies lists and atomic vectors
DEPRECATED
o virtual window support is deprecated. Let us know if you urgently need this and why.
BUG FIXES
o read.table.ffdf now also works if transFUN filters and returns less rows
Older version changes can be viewed in the package’s NEWS/changelog.
P.P.S. Below are some timings in seconds at 3e6, 9e6, 27e6 and 81e6 elements from a Lenovo 410s notebook
(3GB RAM, i5 m520, 2 real cores, 4 hyperthreaded cores, SSD drive, Windows7 32bit)
Legend for software
ram: new in-ram inplace operations receiving enough RAM to optimize for speed, not for memory
ff: new on-disk operations limiting RAM for this operation at ~500GB
R: timings from standard sort() and order()
SAS: timings from SAS 9.2 allowing for multithreaded sorting
Legend for type of random data
rboolean: bi-boolean with 50% FALSE and TRUE
rlogical: tri-boolean with 33% NA, FALSE and TRUE
rubyte: integers from 0..255
rbyte: 33% NA and 67% -127..127
rushort: integers from 0..65535
rshort: 33% NA and 67% -32767..32767
ruinteger: 50% NA and 50% integers
rinteger: random integers
rusingle: 50% NA and 50% singles
rsingle: random singles
rudouble: 50% NA and 50% doubles
rdouble: doubles
rfactor: factor with 64 levels of length 66 (being different at bytes 65 and 66)
rchar: 64 strings of length 66 (being different at bytes 65 and 66)
Legend for abbreviations
OOM: out of memory
OOD: out of disk
NT: not timed because too slow
NA: not available
Results for sorting a single column
=====================================
, , 3e6
rboolean rlogical rubyte rbyte rushort rshort ruinteger rinteger rusingle rsingle rudouble rdouble rfactor rchar
ram 0.02 0.03 0.02 0.04 0.02 0.02 0.17 0.11 0.66 0.36 0.66 0.36 0.03 NA
ff 0.25 0.33 0.22 0.25 0.28 0.26 0.38 0.30 1.02 0.65 0.92 0.67 0.39 NA
R NA 0.35 NA NA NA NA 0.83 0.54 NA NA 1.28 0.90 64.83 51.20
SAS NA NA NA NA NA NA 1.61 1.32 NA NA 1.57 1.29 NA 17.01
, , 9e6
rboolean rlogical rubyte rbyte rushort rshort ruinteger rinteger rusingle rsingle rudouble rdouble rfactor rchar
ram 0.04 0.07 0.03 0.08 0.03 0.07 0.50 0.31 1.88 0.97 1.87 0.97 0.04 NA
ff 0.72 0.93 0.61 0.73 0.84 0.75 1.08 0.86 2.68 1.62 2.57 1.67 0.78 NA
R NA 0.90 NA NA NA NA 2.84 1.78 NA NA 3.51 2.12 NA NT
SAS NA NA NA NA NA NA 4.99 3.90 NA NA 4.91 4.48 NA 62.76
, , 27e6
rboolean rlogical rubyte rbyte rushort rshort ruinteger rinteger rusingle rsingle rudouble rdouble rfactor rchar
ram 0.10 0.24 0.09 0.23 0.11 0.23 1.58 1.00 6.06 3.15 6.00 3.23 0.16 NA
ff 2.19 2.98 1.92 2.21 2.56 2.31 3.22 2.68 8.49 5.18 8.10 5.35 2.58 NA
R NA 2.72 NA NA NA NA 9.69 5.80 NA NA 12.34 6.97 NA NT
SAS NA NA NA NA NA NA 17.02 12.67 NA NA 17.05 14.07 NA 176.63
, , 81e6
rboolean rlogical rubyte rbyte rushort rshort ruinteger rinteger rusingle rsingle rudouble rdouble rfactor rchar
ram 0.27 0.67 0.28 0.67 0.33 0.72 5.58 3.23 NA NA NA NA 0.49 NA
ff 6.56 9.06 5.93 6.88 8.52 7.15 10.70 8.54 51.35 28.98 70.20 44.13 7.91 NA
R OOM OOM OOM OOM OOM OOM OOM OOM OOM OOM OOM OOM OOM OOM
SAS NA NA NA NA NA NA 61.45 44.94 NA NA 63.14 46.56 NA OOD
Results for calculating the order on a single column
====================================================
, , 3e6
rboolean rlogical rubyte rbyte rushort rshort ruinteger rinteger rusingle rsingle rudouble rdouble rfactor rchar
ram 0.05 0.07 0.04 0.07 0.09 0.11 0.92 0.53 1.46 0.81 1.31 0.64 0.06 NA
ff 0.14 0.19 0.77 0.58 0.87 0.67 1.04 0.60 1.66 0.81 1.43 0.85 0.74 NA
R NA 3.23 NA NA NA NA 4.57 4.07 NA NA 5.27 4.61 4.59 193.75
SAS NA NA NA NA NA NA 1.86 1.48 NA NA 1.63 1.39 NA 16.83
, , 9e6
rboolean rlogical rubyte rbyte rushort rshort ruinteger rinteger rusingle rsingle rudouble rdouble rfactor rchar
ram 0.16 0.21 0.17 0.20 0.30 0.28 3.07 1.61 4.24 2.16 4.22 2.19 0.19 NA
ff 0.48 0.51 2.45 1.84 2.91 2.15 3.38 1.92 4.72 2.48 4.54 2.45 1.91 NA
R NA 12.31 NA NA NA NA 17.02 15.56 NA NA 16.96 15.47 NT NT
SAS NA NA NA NA NA NA 6.71 5.97 NA NA 6.25 5.41 NA 59.27
, , 27e6
rboolean rlogical rubyte rbyte rushort rshort ruinteger rinteger rusingle rsingle rudouble rdouble rfactor rchar
ram 0.51 0.67 0.5 0.69 0.92 0.94 9.89 5.31 15.13 7.69 15.15 7.70 0.58 NA
ff 1.33 1.51 7.6 5.77 9.25 6.79 10.72 6.12 15.98 8.53 15.96 8.92 5.80 NA
R NA 46.37 NA NA NA NA 65.57 59.17 NA NA 63.74 58.37 NT NT
SAS NA NA NA NA NA NA 21.41 18.77 NA NA 20.22 18.84 NA 182.74
, , 81e6
rboolean rlogical rubyte rbyte rushort rshort ruinteger rinteger rusingle rsingle rudouble rdouble rfactor rchar
ram 1.49 2.03 1.5 2.06 3.15 2.98 34.33 17.89 NA NA NA NA 1.90 NT
ff 3.98 4.65 22.9 17.42 30.33 21.82 36.68 20.36 77.16 49.55 125.01 59.27 17.39 NT
R OOM OOM OOM OOM OOM OOM OOM OOM OOM OOM OOM OOM OOM OOM
SAS NA NA NA NA NA NA 86.24 70.32 NA NA 84.40 68.66 NA NA
Results for sorting all columns of a table with m columns of random double data (without NAs)
=============================================================================================
, , 3e6
ncol 1 2 5 10 20
SAS 1.65 1.83 3.71 6.90 14.06
ff 1.97 2.37 3.75 6.21 10.86
R 4.70 5.67 5.65 6.46 8.06
, , 9e6
ncol 1 2 5 10 20
SAS 5.18 6.70 14.02 19.25 41.65
ff 6.38 7.96 12.12 19.58 45.43
R 18.86 19.20 20.58 OOM OOM
, , 27e6
ncol 1 2 5 10 20
SAS 17.79 19.52 35.03 83.30 142.09
ff 22.68 25.79 46.25 87.55 157.62
R 65.56 OOM OOM OOM OOM
, , 81e6
ncol 1 2 5 10 20
SAS 64.78 83.39 143.59 242.23 408.72
ff 167.52 220.03 324.03 502.42 884.03
R OOM OOM OOM OOM OOM