stackoverflow | R-statistics blog

How to load the {rJava} package after the error "JAVA_HOME cannot be determined from the Registry"

In case you tried loading a package that depends on the {rJava} package (by Simon Urbanek), you might came across the following error:

Loading required package: rJava
library(rJava)
Error : .onLoad failed in loadNamespace() for ‘rJava’, details:
call: fun(libname, pkgname)
error: JAVA_HOME cannot be determined from the Registry

The error tells us that there is no entry in the Registry that tells R where Java is located. It is most likely that Java was not installed (or that the registry is corrupt).

This error is often resolved by installing a Java version (i.e. 64-bit Java or 32-bit Java) that fits to the type of R version that you are using (i.e. 64-bit R or 32-bit R). This problem can easily effect Windows 7 users, since they might have installed a version of Java that is different than the version of R they are using.

Note that it is necessary to ‘manually download and install’ the 64 bit version of JAVA. By default, the download page gives a 32 bit version .

You can pick the exact version of Java you wish to install from this link. If you might (for some reason) work on both versions of R, you can install both version of Java (Installing the “Java Runtime Environment” is probably good enough for your needs).
(Source: Uwe Ligges)

Other possible solutions is trying to re-install rJava.

If that doesn’t work, you could also manually set the directory of your Java location by setting it before loading the library:

Sys.setenv(JAVA_HOME='C:\\Program Files\\Java\\jre7') # for 64-bit version
Sys.setenv(JAVA_HOME='C:\\Program Files (x86)\\Java\\jre7') # for 32-bit version
library(rJava)

(Source: “nograpes” from Stackoverflow, which also describes the find.java in the rJava:::.onLoad function)

StackOverFlow and MetaOptimize are battling to be the #1 "Statistical Analysis Q&A website” – to whom would you signup?

A new statistical analysis Q&A website launched

While the proposal for a statistical analysis Q&A website on area51 (stackexchange) is taking it’s time, and the website is still collecting people who will commit to it,
Joseph Turian, who seems a nice guy from his various comments online, seem to feel this website is not what the community needs and that we shouldn’t hold up on our questions for the website to go online. Therefore, Joseph is pushing with all his might his newest creation “MetaOptimize QA“, a StackOverFlow like website for (long list follows): machine learning, natural language processing, artificial intelligence, text analysis, information retrieval, search, data mining, statistical modeling, and data visualization.
With all the bells and whistles that the OSQA framework (an open source stackoverflow clone, and more, system) can offer (you know, rankings, badges and so on).

Is this new website better then the area51 website? Will all the people go to just one of the two websites. or will we end up with two places that attracts more people then we had to begin with? These are the questions that come to mind when faced with the story in front of us.

My own suggestion is to try both websites (the stackoverflow statistical analysis website to come and “MetaOptimize QA“) and let time tell.

More info on this story bellow.

MetaOptimize online impact so far

The need for such a Q&A site is clearly evident. With just several days after being promoted online, MetaOptimize has claimed the eyes of almost 300 users, submitting 59 questions and 129 answers.
Already many bloggers in the statistical community have contributed their voices with encouraging posts, here is just a collection of the post I was able to find with some googling:

But is it goos to have two websites?

But wait, didn’t we just start pushing forward another statistical Q&A website two weeks ago? I am talking about the Stack Exchange Q&A site proposal: Statistical Analysis.

So what should we (the community of statistical minded people) to do the next time we have a question?

Should we wait for Stack Exchange offer for a new website to start? Or should we start using MetaOptimize?

Update: after lengthy e-mail exchange with Joseph (the person who founded MetaOptimize), I decided to erase what I originally wrote as my doubts, and instead give a Q&A session that him and I have had in the e-mails exchange. It is a bit edited from what was originally, and some of the content will probably get updated – so if you are into this subject, check in again in a few hours 🙂

Honestly, I am split in two (and Joseph, I do hope you’ll take this in a positive way, since personally I feel confident you are a good guy). I very strongly believe in the need and value of such a Q&A website. Yet I am wondering how I feel about such a website being hosted as MetaOptimize and outside the hands of the stackoverflow guys.
On the one hand, open source lovers (like myself) tend to like decentralization and reliance on OSS (open source software) solutions (such as the one OSQA framework offers). On the other hand, I do believe that the stackoverflow people have (much) more experience in handling such websites then Joseph. I can very easily trust them to do regular database backups, share the websites database dumps with the general community, smoothly test and upgrade to provide new features, and generally speaking perform in a more experienced way with the online Q&A community.
It doesn’t mean that Joseph won’t do a great job, personally I hope he will.

Q&A session with Joseph Turian (MetaOptimize founder)

Tal: Let’s start with the easy question, should I worry about technical issues in the website (like, for example, backups)?

Joseph:

The OSQA team (backed by DZone) have got my back. They have been very helpful since day one to all OSQA users, and have given me a lot of support. Thanks, especially Rick and Hernani!

They provide email and chat support for OSQA users.

I will commit to putting up regular automatic database dumps, whenever the OSQA team implements it:
http://meta.osqa.net/questions/3120/how-do-i-offer-database-dumps
If, in six months, they don’t have this feature as part of their core, and someone (e.g. you) emails me reminding me that they want a dump, I will manually do a database dump and strip the user table.

Also, I’ve got a scheduled daily database dump that is mirrored to Amazon S3.

Tal: Why did you start MetaOptimize instead of supporting the area51 proposal?
Joseph:

On Area51, people asked to have AI merged with ML, and ML merged with statistical analysis, but their requests seemed to be ignored. This seemed like a huge disservice to these communities.
Area 51 didn’t have academics in ML + NLP. I know from experience it’s hard to get them to buy in to new technology. So why would I risk my reputation getting them to sign up for Area 51, when I know that I will get a 1% conversion? They aren’t early adopters interested in the process, many are late adopters who won’t sign up for something until they have too.
If the Area 51 sites had a strong newbie bent, which is what it seemed like the direction was going, then the academic experts definitely wouldn’t waste their time. It would become a support
community for newbies, without core expert discussion. So basically, I know that I and a lot of my colleagues wanted the site I built. And I felt like area 51 was shaping the communities really incorrectly in several respects, and was also taking a while. I could have fought an institutional process and maybe gotten half the results above and it took a few months, or I could just build the site and invite my friends, and shape the community correctly.

Besides that, there are also personal motives:

I wanted the recognition for having a good vision for the community, and driving forward something they really like.
I wanted to experiment with some NLP and ML extensions for the Q+A software, to help organize the information better. Not possible on a closed platform.

Tal: Me (and maybe some other people) fear that this might fork the people in the field to two websites, instead of bringing them together. What are your thoughts about that?
Joseph:
How am I forking the community? I’m bringing a bunch of people in who wouldn’t have even been part of the Area 51 community.
Area 51 was going to fork it into five communities: stat analysis, ML, NLP, AI, and data mining. And then a lot fewer people would have been involved.

Tal: What are the things that people who support your website are saying?
Joseph:
Here are some quotes about my site:

Philip Resnick (UMD): “Looking at the questions being asked, the people responding, and the quality of the discussion, I can already see this becoming the go-to place for those ‘under the hood’ details
you rarely see in the textbooks or conference papers. This site is going to save a lot of people an awful lot of time and frustration.”
Aria Haghighi (Berkeley): “Both NLP and ML have a lot of folk wisdom about what works and what doesn’t. A site like this is crucial for facilitating the sharing and validation of this collective knowledge.”
Alexandre Passos (Unicamp): “Really thank you for that. As a machine learning phd student from somewhere far from most good research centers (I’m in brazil, and how many brazillian ML papers have you
seen in NIPS/ICML recently?), I struggle a lot with this folk wisdom. Most professors around here haven’t really interacted enough with the international ML community to be up to date”
(http://news.ycombinator.com/item?id=1476247)
Ryan McDonald (Google): “A tool like this will help disseminate and archive the tricks and best practices that are common in NLP/ML, but are rarely written about at length in papers.”
esoom on Reddit: “This is awesome. I’m really impressed by the quality of some of the answers, too. Within five minutes of skimming the site, I learned a neat trick that isn’t widely discussed in the literature.”
(http://www.reddit.com/r/MachineLearning/comments/ckw5k/stackoverflow_for_machine_learning_and_natural/c0tb3gc)
Tal: In order to be fair to area51 work, they have gotten wonderful responses for the “statistical analysis” proposal as well (see it here)
I have also contacted area51 directly and asked them and invited them to come and join the discussion. I’ll update this post with their reply.

So what’s next?

I don’t know.
If the Stack Exchange website where to launch today, I would probably focus on using it and hint to the site for MetaOptimize (for the reasons I just mentioned, and also for some that Rob Hyndman maintained when he first wrote on the subject).
If the stack exchange version of the website where to start in a few weeks, I would probably sit on the fence and see if people are using it. I suspect that by that time, there wouldn’t be many people left to populate it (but I could always be wrong).
And what if the website where to start in a week, what then? I have no clue.
Good question.
My current feeling is that I am glad to let this play out.
It seems this is a good case study for some healthy competition between platforms and models (OSQA vs stackoverflow/area51-system) – one that I hope will generate more good features from both companies. And also will make both parties work hard to get people to participate.
It also seems that this situation is getting many people in our field to be approached with the same idea (Q&A website). After Joseph input on the subject, I am starting to think that maybe at the end of the day this will benefit all of us. Instead of forking one community into two, maybe what we’ll end up with is getting more (experienced) people online (into two locations) that would otherwise would have stayed in the shadows.

The verdict is still out, but I am a bit more optimistic than I was when first writing this post. I’ll update this post after getting more input from people.

And as always – I would love to know your thoughts on the subject.

A new Q&A website for Data-Analysis (based on StackOverFlow engine) – is waiting for you

The bottom line of this post is for you to go to:
Stack Exchange Q&A site proposal: Statistical Analysis
And commit yourself to using the website for asking and answering questions. 144 peoples already committed to using the website, we need 356 more… 🙂
If you are looking for the reasons to do so – read on…

What is the StackOverFlow Q&A website about?

StackOverFlow.com (“SO” for short) is a programming Q & A site that’s free. Free to ask questions, free to answer questions, free to read. Free, And fast.

For the R community, SO offers a growing database of R related questions and answer (click the link to check them out).

You might be asking yourself what’s so special about SO over other available resources such as R mailing lists, R blogs, R wiki and so on?
That is a great question.

The answer is that SO succeeds in doing a great job synthesizing aspects of Wikis, Blogs, Forums, and Digg/Reddit to offer a very powerful Q&A website.

In SO, the new questions are like forum/blog posts (A main text with comments/answers). After someone answers a question, other users can give a thumb-up or a thumb-down to the answer (like digg/reddit). And all content can be edited, like a wiki page, by the users (provided the user has enough “karma points”).
You also get badges (“awards”) for a bunch of actions (like coming to the website every day for a month. Giving an answer that got X amount of thumb-ups and so on). The awards allows someone who is asking a question to see how much the person who had answered him has good reputation (in terms of acceptance/appreciation of his answers by other SO members).
It also offers a small (but effective) ego-boost for the person who gives answers.

So if StackOverFlow is so great – what is this new website you wrote about in the title?

Well, StackOverFlow has one limitation. It deals ONLY with programming questions. Other questions like:

Which of the following three graphics best displays this data set? Why?
Can you give an example of where I might prefer to use a z-test vs a t-test?
What is the relationship between Bayesian and neural networks?

Will not be answered, and the threads will get closed as being “off topic”. Why? because such questions are dealing with: statistics, data analysis, data mining, data visualization – But in no means in programming.

So there is no StackOverFlow-like Q&A website for data analysis… Until now!

In the past few weeks, Rob Hyndman and other users, have made much effort to push the creation of a new website, based on the StackOverFlow engine, to allow for statistically related Q&A.
His proposal for a new website is almost complete. All it need is for you (yes you), to go to the following link:
Stack Exchange Q&A site proposal: Statistical Analysis
And commit yourself to the website (that is, click the button called “commit” – so to declare that you will have interest in reading, asking and answering questions on such a website)

Once a ~~few more tens~~ 379 more people will commit – the website will go online!

Hope to see you there.

Correlation scatter-plot matrix for ordered-categorical data

When analyzing a questionnaire, one often wants to view the correlation between two or more Likert questionnaire item’s (for example: two ordered categorical vectors ranging from 1 to 5).

When dealing with several such Likert variable’s, a clear presentation of all the pairwise relation’s between our variable can be achieved by inspecting the (Spearman) correlation matrix (easily achieved in R by using the “cor.test” command on a matrix of variables).
Yet, a challenge appears once we wish to plot this correlation matrix. The challenge stems from the fact that the classic presentation for a correlation matrix is a scatter plot matrix – but scatter plots don’t (usually) work well for ordered categorical vectors since the dots on the scatter plot often overlap each other.

There are four solution for the point-overlap problem that I know of:

Jitter the data a bit to give a sense of the “density” of the points
Use a color spectrum to represent when a point actually represent “many points”
Use different points sizes to represent when there are “many points” in the location of that point
Add a LOWESS (or LOESS) line to the scatter plot – to show the trend of the data

In this post I will offer the code for the a solution that uses solution 3-4 (and possibly 2, please read this post comments). Here is the output (click to see a larger image):

And here is the code to produce this plot:

Continue reading “Correlation scatter-plot matrix for ordered-categorical data”

R Flashmob

Today I noticed a call for R users to gather around a single campfire for one hour and share their questions and answers.

The campfire name is stackoverflow.com, a site dedicated for handling programming questions. The event details are bellow:

From: The R Flashmob Project
Subject: R Flashmob #2
You are invited to take part in R Flashmob, the project that makes the
world a better place by posting helpful questions and answers about the
R statistical language to the programmer’s Q & A site stackoverflow.com
Please forward this to other people you know who might like to join.
FAQ
Q. Why would I want to join an inexplicable R mob?
A. Tons of other people are doing it.
Q. Why else?
A. Stackoverflow was built specifically for handling programming questions.
It’s a better mousetrap. It offers search (and is well indexed by search engines),
tagging, voting, the ability to choose the “best” answer to a question, and the ability to
edit questions and answers as technology progresses. It has a karma system to
reward people who are happy to help and discourage MLJs (mailing list jerks).
Q. Do the organizers of this MOB have any commercial interest in stackoverflow?
A. None at all. We’re just convinced it is the best way to help and promote R. All
the content submitted to stackoverflow is protected by a Creative Commons
CC-Wiki License, meaning anyone is free to copy, distribute, transmit, and
remix the information on stackoverflow. All the content on stackoverflow is
regularly made available for download by the public.
INSTRUCTIONS – R MOB #2
Location: stackoverflow.com
Start Date: Tuesday, September 8th, 2009
Start Time:
10:04 AM – US Pacific
11:04 AM – US Mountain
12:04 PM – US Central
1:04 PM – US Eastern
6:04 PM – UK
7:04 PM – Continental W. Europe
5:04 AM (Weds) – New Zealand (birthplace of R)
Duration: 50 minutes
(1) At some point during the day on September 8th, synchronize your watch to
http://timeanddate.com/worldclock/personal.html?cities=137,75,64,179,136,37,22
(2) The mob should form at precisely 4 minutes past the hour and not beforehand.
(3) At 4 minutes past the hour, you should arrive at stackoverflow.com, log in,
and post 3 R questions. Be sure to tag the questions “R”. See the posting
guidelines at http://stackoverflow.com/faq to understand what makes a good
question.
(4) Follow R Flashmob updates at http://twitter.com/rstatsmob
(5) Post twitter messages tagged #rstats and #rstatsmob during the mob,
providing links to your questions.
(6) During the R MOB, you can chat with other participants on the #R channel
on IRC (freenode). To do this, install the Chatzilla extension on Firefox.
Click “freenode” on the main screen. Then type /join #R in the field at the
bottom of the screen. Then chat.
(7) If you finish posting your three questions within the 50 minutes, stick
around to answer questions and give “up votes” to good questions and answers.
(8) IMPORTANT: After posting, sign the R Flashmob guestbook at
https://bit.ly/6F8B2
(9) Return to what you would otherwise have been doing. Await
instructions for R MOB #3.

This invitation already gained exposure from 3 blogs:

I am waiting to see who else will join the fun.