In the past two years, a growing community of R users (and statisticians in general) have been participating in two major Question-and-Answer websites:
- The R tag page on Stackoverflow, and
- Stat over flow (which will soon move to a new domain, no worries, I’ll write about it once it happens)
In that time, several long (and fascinating) discussion threads where started, reflecting on tips and best practices for managing a statistical analysis project. They are:
- “Workflow for statistical analysis and report writing”
- “Organizing R Source Code”
- “How to organize large R programs?”
- “R and version control for the solo data analyst”
- “How does software development compare with statistical programming/analysis ?”
- “How do you combine “Revision Control” with “WorkFlow” for R?”
- How to efficiently manage a statistical analysis project?
On the last thread in the list, the user chl, has started with trying to compile all the tips and suggestions together. And with his permission, I am now republishing it here. I encourage you to contribute from your own experience (either in the comments, or by answering to any of the threads I’ve linked to)
From here on is what “chl” wrote:
These guidelines where compiled from SO (as suggested by @Shane), Biostar (hereafter, BS), and SE. I tried my best to acknowledge ownership for each item, and to select first or highly upvoted answer. I also added things of my own, and flagged items that are specific to the [R] environment.
Data management
- create a project structure for keeping all things at the right place (data, code, figures, etc., giovanni/BS)
- never modify raw data files (ideally, they should be read-only), copy/rename to new ones when making transformations, cleaning, etc.
- check data consistency (whuber /SE)
Coding
- organize source code in logical units or building blocks (Josh Reich/hadley/ars /SO; giovanni/Khader Shameer /BS)
- separate source code from editing stuff, especially for large project — partly overlapping with previous item and reporting
- document everything, with e.g. [R]oxygen (Shane /SO) or consistent self-annotation in the source file
- [R] custom functions can be put in a dedicated file (that can be sourced when necessary), in a new environment (so as to avoid populating the top-level namespace, Brendan OConnor /SO), or a package (Dirk Eddelbuettel/Shane /SO)
Analysis
- don’t forget to set/record the seed you used when calling RNG or stochastic algorithms (e.g. k-means)
- for Monte Carlo studies, it may be interesting to store specs/parameters in a separate file (sumatramay be a good candidate, giovanni /BS)
- don’t limit yourself to one plot per variable, use multivariate (Trellis) displays and interactive visualization tools (e.g. GGobi)
Versioning
- use some kind of CVS for easy tracking/export, e.g. Git (Sharpie/VonC/JD Long /SO) — this follows from nice questions asked by @Jeromy and @Tal
- backup everything, on a regular basis (Sharpie/JD Long /SO)
- keep a log of your ideas, or rely on an issue tracker, like ditz (giovanni /BS) — partly redundant with the previous item since it is available in Git
Editing/Reporting
- [R] Sweave (Matt Parker /SO)
- [R] brew (Shane /SO)
- [R] [R2HTML]20 or ascii