Managing a statistical analysis project – guidelines and best practices

In the past two years, a growing community of R users (and statisticians in general) have been participating in two major Question-and-Answer websites:

  1. The R tag page on Stackoverflow, and
  2. Stat over flow (which will soon move to a new domain, no worries, I’ll write about it once it happens)

In that time, several long (and fascinating) discussion threads where started, reflecting on tips and best practices for managing a statistical analysis project.  They are:

On the last thread in the list, the user chl, has started with trying to compile all the tips and suggestions together.  And with his permission, I am now republishing it here.  I encourage you to contribute from your own experience (either in the comments, or by answering to any of the threads I’ve linked to)

Continue reading “Managing a statistical analysis project – guidelines and best practices”

R syntax highlighting for bloggers on WordPress.com

Announcing the ability to highlight R syntax in WordPress.com blogs, thanks to the recent work of Yihui Xie, Yoav Farhi and Andrew Redd.

Good news for R bloggers who are using WordPress.com to host their blog.

This week, the good people running WordPress.com (special thanks goes to Yoav Farhi), have added the ability for all the users of the WordPress.com platform to be able to highlight their R code inside posts.

Basically you’ll need to wrap the code in your post like this:

[sourcecode language="r"]
test.function = function(r) {
    return(pi * r^2)
}
test.function(1)
[/sourcecode]

(Which will then look like this:
r syntax highlighted code example
)

Further details (and other supported languages) can be read about on this WordPress.com support page.

This new feature was possible thanks to the work of Yihui Xie (who create the famous cool animation package for R), who created a R syntax brush for the syntaxhighlighter WordPress plugin (the plugin used by WordPress.com for sytnax highlighting) . And thanks should also go to Andrew Redd, the creator of NppToR (which connects between notepad++ to R). He both made some good suggestions, and was game to take on the brush creation in case there would be problems, which thankfully so far there aren’t any)

p.s: If you are a WordPress.org users (e.g: have a self hosted WordPress blog) and want to enable R syntax highlighting for your blog, I would recommend the use of the WP-Syntax plugin (enhanced with GeSHi version 1.0.8.6) which can be downloaded here.

Open source and money – why paying R developers might not always help the project

This post can be summed up by one two sentences: We can’t buy love.” “Starting to pay for love could make it disappear” while at the same time “We need money to live and love”. These two conflicting forces, with relation to open source, are the topic of this post.

This post is directed to the community of R users but is relevant to people of all open source projects. It deals with the question of open source projects and funding. Specifically, should a community of open source developers and users, once it exists, want to start raising/donating money to the main code contributers?

The conflict arises when, on the one side, we intuitively wish to repay the people who have helped us but worry of the implications of behavioral studies that suggests that doing so might destroy the motivation of the developers to continue working without contently getting payed, and that making the shift from doing something for one reason (whatever it is) to doing it for money, might not easily be turned back.
On the other side, developers needs to make a (good) living, and we (as a community) should strive for them to be well payed.
How can these two be reconciled?

This article won’t offer a decisive conclusions – and my hope is to invite discussion on the matter (from both amatures and professionals in the field of open source and behavioral economics) so to give more ideas for people to base their opinions on.

Update: this post was substantially updated from it’s original version, thanks to responses both in the comments, and especially in the e-mails. I apologies for writing a post that had needed so many corrections, and at the same time I am grateful for all the people who took the time to shed light in places where I was wrong.

* * * *

Motivation: R has issues – how do we get them fixed?

In the past two weeks there has been a raging debate regarding the future of R (hint: “what is R“). Without going deeper into the topic (I already wrote about it here, where you too can go and respond), I’ll sum up the issue with a quote from Ross Ihaka (one of the two founders of R) who recently wrote:

I’ve been worried for some time that R isn’t going to provide the base that we’re going to need for statistical computation in the future. (It may well be that the future is already upon us.) There are certainly efficiency problems (speed and memory use), but there are more fundamental issues too. Some of these were inherited from S and some are peculiar to R.

After this, several discussion threads where started around the web (for example: 0, 1, 2, 3, 4 ,5, 6 ), but then a comment was made in the R-help mailing list by Jaroslaw Piskorski who wrote:

A few days ago Tal Galili posted a message about some controversies concerning the future of R. Having read the discussions, especially those following Ross Ihaka’s post, I have come to the conclusion, that, as usual, the problem is money. I doubt there would be discussions about dropping R in its present form if the R-Foundation were properly funded and could hire computer scientists, programmers and statisticians. If a commercial company is able to provide big-database and multicore solutions, then so would a properly founded R-Foundation.

To which my response is that: I strongly disagree with this statement..
That is, I do agree that money could help with things. It could be that money could be a part of the solution. But I doubt that the core of this problem is money. Nor that it would be solved if we could only now hire “computer scientists, programmers and statisticians” (although that could be part of the solution).

And the reason I am doubtful stems from two sources:

Continue reading “Open source and money – why paying R developers might not always help the project”

Dumping functions from the global environment into an R script file

Looking at a project you didn’t touch for years poses many challenges. The less documentation and organization you had in your files, the more time you’ll have to spend tracing back what you did back when the code was written.

I just opened up such a project, that was before I ever knew to split my .r files to “data.r”, “functions.r”, “do.r”. All I have are several versions of an old .RData file and many .r files with a mix of functions and commands (oh the shame!)

One idea I had for the tracing back was to take the latest version of .RData I had, and see what functions I had in it’s environment. simply typing ls() wouldn’t work. Also, I wanted to have a list of all the functions that where defined in my .RData environment. Thanks to the code recently published by Richie Cotton, I was able to create the “save.functions.from.env”. This function will go through all your defined functions and write them into “d:\temp.r”.

I hope this might be useful to one of you in the future, here is the code to do it:

save.functions.from.env <- function(file = "d:\temp.r")
{
	# This function will go through all your defined functions and write them into "d:\temp.r"
	# let's get all the functions from the envoirnement:
	funs <- Filter(is.function, sapply(ls( ".GlobalEnv"), get))

	# Let's
	for(i in seq_along(funs))
	{
		cat(	# number the function we are about to add
			paste("n" , "#------ Function number ", i , "-----------------------------------" ,"n"),
			append = T, file = file
			)

		cat(	# print the function into the file
			paste(names(funs)[i] , "<-", paste(capture.output(funs[[i]]), collapse = "n"), collapse = "n"),
			append = T, file = file
			)

		cat(
			paste("n" , "#-----------------------------------------" ,"n"),
			append = T, file = file
			)
	}

	cat( # writing at the end of the file how many new functions where added to it
		paste("# A total of ", length(funs), " Functions where written into", file),
		append = T, file = file
		)
	print(paste("A total of ", length(funs), " Functions where written into", file))
}

# save.functions.from.env() # this is how you run it

Update: Joshua Ulrich gave on stackoverflow another solution for this challenge:

	newEnv <- new.env()
	load("myFunctions.Rdata", newEnv)
	dump(c(lsf.str(newEnv)), file="normalCodeFile.R", envir=newEnv)

And also suggested to look into ?prompt (which creates documentation files for objects) and / or ?package.skeleton.

Using the {plyr} (1.2) package parallel processing backend with windows

Hadley Wickham has just announced the release of a new R package “reshape2” which is (as Hadley wrote) “a reboot of the reshape package”. Alongside, Hadley announced the release of plyr 1.2.1 (now faster and with support to parallel computation!).
Both releases are exciting due to a significant speed increase they have now gained.

Yet in case of the new plyr package, an even more interesting new feature added is the introduction of the parallel processing backend.

    Reminder what is the `plyr` package all about

    (as written in Hadley’s announcement)

    plyr is a set of tools for a common set of problems: you need to __split__ up a big data structure into homogeneous pieces, __apply__ a function to each piece and then __combine__ all the results back together. For example, you might want to:

    • fit the same model each patient subsets of a data frame
    • quickly calculate summary statistics for each group
    • perform group-wise transformations like scaling or standardising

    It’s already possible to do this with base R functions (like split and the apply family of functions), but plyr makes it all a bit easier with:

    • totally consistent names, arguments and outputs
    • convenient parallelisation through the foreach package
    • input from and output to data.frames, matrices and lists
    • progress bars to keep track of long running operations
    • built-in error recovery, and informative error messages
    • labels that are maintained across all transformations

    Considerable effort has been put into making plyr fast and memory efficient, and in many cases plyr is as fast as, or faster than, the built-in functions.

    You can find out more at http://had.co.nz/plyr/, including a 20 page introductory guide, http://had.co.nz/plyr/plyr-intro.pdf.  You can ask questions about plyr (and data-manipulation in general) on the plyr mailing list. Sign up at http://groups.google.com/group/manipulatr

    What’s new in `plyr` (1.2.1)

    The exiting news about the release of the new plyr version is the added support for parallel processing.

    l*ply, d*ply, a*ply and m*ply all gain a .parallel argument that when TRUE, applies functions in parallel using a parallel backend registered with the
    foreach package.

    The new package also has some minor changes and bug fixes, all can be read here.

    In the original announcement by Hadley, he gave an example of using the new parallel backend with the doMC package for unix/linux.  For windows (the OS I’m using) you should use the doSMP package (as David mentioned in his post earlier today). However, this package is currently only released for “REvolution R” and not released yet for R 2.11 (see more about it here).  But due to the kind help of Tao Shi there is a solution for windows users wanting to have parallel processing backend to plyr in windows OS.

    All you need is to install the doSMP package, according to the instructions in the post “Parallel Multicore Processing with R (on Windows)“, and then use it like this:


    require(plyr) # make sure you have 1.2 or later installed
    x <- seq_len(20) wait <- function(i) Sys.sleep(0.1) system.time(llply(x, wait)) # user system elapsed # 0 0 2 require(doSMP) workers <- startWorkers(2) # My computer has 2 cores registerDoSMP(workers) system.time(llply(x, wait, .parallel = TRUE)) # user system elapsed # 0.09 0.00 1.11

    Update (03.09.2012): the above code will no longer work with updated versions of R (R 2.15 etc.)

    Trying to run it will result in the error massage:

    Loading required package: doSMP
    Warning message:
    In library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE,  :
      there is no package called ‘doSMP’
    

    Because trying to install the package will give the error massage:

    > install.packages("doSMP")
    Installing package(s) into ‘D:/R/library’
    (as ‘lib’ is unspecified)
    Warning message:
    package ‘doSMP’ is not available (for R version 2.15.0)
    

    You can fix this be replacing the use of {doSMP} package with the {doParallel}+{foreach} packages. Here is how:

    if(!require(foreach)) install.packages("foreach")
    if(!require(doParallel)) install.packages("doParallel")
    # require(doSMP) # will no longer work...
    library(foreach)
    library(doParallel)
    workers <- makeCluster(2) # My computer has 2 cores
    registerDoParallel(workers)
    
    x <- seq_len(20)
    wait <- function(i) Sys.sleep(0.3)
    system.time(llply(x, wait)) # 6 sec
    system.time(llply(x, wait, .parallel = TRUE)) # 3.53 sec
    

    Tips for the R beginner (a 5 page overview)

    In this post I publish a PDF document titled “A collection of tips for R in Finance”.
    It is a basic 5 page introduction to R in finances by Arnaud Amsellem (linked in profile).

    The article offers tips related to the following points:

    • Code Editor
    • Organizing R code
    • Update packages
    • Getting external data into R
    • Communicating with external applications
    • Optimizing R code

    This article is well articulated, and offers a perspective of someone who is experienced in the field and touches points that I can imagine beginners might otherwise overlook. I hope publishing it here will be of use to some readers out there.

    Update: as some readers have noted to me (by e-mail, and by commenting), this document touches very lightly on the topic of “finances” in R. I therefore decided to update the title from “R in finance – some tips for beginners”, to it’s current form.

    Lastly: if you (a reader of this blog) feel you have an article (“post”) to contribute, but don’t feel like starting your own blog, feel welcome to contact me, and I’ll be glad to post what you have to say on my blog (and subsequently, also on R bloggers).

    Here is the article:
    Continue reading “Tips for the R beginner (a 5 page overview)”

    Rose plot using Deducers ggplot2 plot builder

    The (excellent!) LearnR blog had a post today about making a rose plot in
    ggplot2.

    Following today’s announcement, by Ian Fellows, regarding the release of the new version of Deducer (0.4) offering a strong support for ggplot2 using a GUI plot builder, Ian also sent an e-mail where he shows how to create a rose plot using the new ggplot2 GUI included in the latest version of Deducer. After the template is made, the plot can be generated with 4 clicks of the mouse.

    Here is a video tutorial (Ian published) to show how this can be used:

    The generated template file is available at:
    http://neolab.stat.ucla.edu/cranstats/rose.ggtmpl

    I am excited about the work Ian is doing, and hope to see more people publish use cases with Deducer.

    ggplot2 plot builder is now on CRAN! (through Deducer 0.4 GUI for R)

    Ian fellows, a hard working contributer to the R community (and a cool guy), has announced today the release of Deducer (0.4) to CRAN (scheduled to update in the next day or so).
    This major update also includes the release of a new plug-in package (DeducerExtras), containing additional dialogs and functionality.

    Following is the e-mail he sent out with all the details and demo videos.

    Continue reading “ggplot2 plot builder is now on CRAN! (through Deducer 0.4 GUI for R)”

    ggplot2 gui: Major feature set complete

    (Written by Ian Fellows) There has been quite a bit of progress on the ggplot2 graphical user interface since the last post. All of the major features have been implemented, and are outlined in the vlog links below. What remains is to fix bugs, improve interface elements, and listen to feedback from users (that’s you). […]

    (Written by Ian Fellows)

    There has been quite a bit of progress on the ggplot2 graphical user interface since the last post. All of the major features have been implemented, and are outlined in the vlog links below. What remains is to fix bugs, improve interface elements, and listen to feedback from users (that’s you). Please give it a try by installing the development version of Deducer
    install.packages(“Deducer”,,”http://www.rforge.net“,type=”source”) . It is best used with the R console JGR which you can find at http://rforge.net/JGR/ .

    Feature tour:
    http://neolab.stat.ucla.edu/cranstats/vlog4.mov

    Development and extension:
    http://neolab.stat.ucla.edu/cranstats/vlog5.mov

    Ian

    Blogging about R – presentation and audio

    At the useR!2010 conference I had the honor of giving a (~15 minute) talk titled “Blogging about R”. The following is the abstract I submited, followed by the slides of the talk and the audio file of a recording I made of the talk (I am sad it got a bit of “hall echo”, but it’s still listenable…)

    P.S: this post does not absolve me from writing up something (with many thanks and links to people) about the useR2010 conference, but I can see it taking a bit longer till I do that.

    —————–

    Abstract of the talk

    This talk is a basic introduction to blogs: why to blog, how to blog, and the importance of the R blogosphere to the R community.

    Because R is an open-source project, the R community members rely (mostly) on each other’s help for statistical guidance, generating useful code, and general moral support.

    Current online tools available for us to help each other include the R mailing lists, the community R-wiki, and the R blogosphere. The emerging R blogosphere is the only source, besides the R journal, that provides our community with articles about R. While these articles are not peer reviewed, they do come in higher volume (and often are of very high quality).

    According to the meta-blog R-bloggers.com, the (English) R blogosphere has produced, in January 2010, about 115 “articles” about R. There are (currently) a bit over 50 bloggers (now about 100) who write about R, with about 1000 (now ~2200) subscribers who read them daily (through e-mails or RSS). These numbers allow me to believe that there is a genuine interest in our community for more people – perhaps you? – to start (and continue) blogging about R.

    In this talk I intend to share knowledge about blogging so that more people are able to participate (freely) in the R blogosphere – both as readers and as writers. The talk will have three main parts:

    • What is a blog
    • How to blog – using the (free) blogging service WordPress.com (with specific emphasis on R)
    • How to develop readership – integration with other social media/networks platforms, SEO, and other best practices

    * * *
    Tal Galili founded www.R-bloggers.com and blogs on www.R-statistics.com
    * * *

    Audio recording of the talk

    Continue reading “Blogging about R – presentation and audio”