Managing a statistical analysis project – guidelines and best practices

Tal Galili

13 years ago

In the past two years, a growing community of R users (and statisticians in general) have been participating in two major Question-and-Answer websites:

The R tag page on Stackoverflow, and
Stat over flow (which will soon move to a new domain, no worries, I’ll write about it once it happens)

In that time, several long (and fascinating) discussion threads where started, reflecting on tips and best practices for managing a statistical analysis project. They are:

On the last thread in the list, the user chl, has started with trying to compile all the tips and suggestions together. And with his permission, I am now republishing it here. I encourage you to contribute from your own experience (either in the comments, or by answering to any of the threads I’ve linked to)

From here on is what “chl” wrote:

These guidelines where compiled from SO (as suggested by @Shane), Biostar (hereafter, BS), and SE. I tried my best to acknowledge ownership for each item, and to select first or highly upvoted answer. I also added things of my own, and flagged items that are specific to the [R] environment.

Data management

create a project structure for keeping all things at the right place (data, code, figures, etc., giovanni/BS)
never modify raw data files (ideally, they should be read-only), copy/rename to new ones when making transformations, cleaning, etc.
check data consistency (whuber /SE)

Coding

organize source code in logical units or building blocks (Josh Reich/hadley/ars /SO; giovanni/Khader Shameer /BS)
separate source code from editing stuff, especially for large project — partly overlapping with previous item and reporting
document everything, with e.g. [R]oxygen (Shane /SO) or consistent self-annotation in the source file
[R] custom functions can be put in a dedicated file (that can be sourced when necessary), in a new environment (so as to avoid populating the top-level namespace, Brendan OConnor /SO), or a package (Dirk Eddelbuettel/Shane /SO)

Analysis

don’t forget to set/record the seed you used when calling RNG or stochastic algorithms (e.g. k-means)
for Monte Carlo studies, it may be interesting to store specs/parameters in a separate file (sumatramay be a good candidate, giovanni /BS)
don’t limit yourself to one plot per variable, use multivariate (Trellis) displays and interactive visualization tools (e.g. GGobi)

Versioning

use some kind of CVS for easy tracking/export, e.g. Git (Sharpie/VonC/JD Long /SO) — this follows from nice questions asked by @Jeromy and @Tal
backup everything, on a regular basis (Sharpie/JD Long /SO)
keep a log of your ideas, or rely on an issue tracker, like ditz (giovanni /BS) — partly redundant with the previous item since it is available in Git

Editing/Reporting

[R] Sweave (Matt Parker /SO)
[R] brew (Shane /SO)
[R] [R2HTML]20 or ascii