The audio files of the full talk by Richard Stallman are attached to the end of this post.
—————–
Videos of all the invited talks of the useR! 2010 conference can be viewed on the R User Group blog
—————–
Last week I had the honor of attending the talk given by Richard Stallman, the last keynote speaker on the useR 2010 conference. In this post I will give a brief context for the talk, and then give the audio files of the talk, with some description of what was said in the talk.
Context for the talk
Richard Stallman can be viewed as (one of) the fathers of free software (free as in speech, not as in beer).
He is the man who led the GNU project for the creation of a free (as in speech, not as in beer) operation systems on the basis of which GNU-Linux, with its numerous distributions, was created. Richard also developed a number of pieces of widely used software, including the original Emacs,[4] the GNU Compiler Collection,[5], the GNU Debugger[6], and many tools in the GNU Coreutils
Richard also initiated the free software movement and in October 1985 he also founded it’s formal foundation and co-founded the League for Programming Freedom in 1989.
Stallman pioneered the concept of “copyleft” and he is the main author of several copyleft licenses including the GNU General Public License, the most widely used free software license.
You can read about him in the wiki article titles “Richard Stallman”
The useR 2010 conferenceis an annual 4 days conference of the community of people using R. R is a free open source software for data analysis and statistical computing (Here is a bit more about what is R).
The conference this year was truly a wonderful experience for me. I had the pleasure of giving two talks (about which I will blog later this month), listened to numerous talks on the use of R, and had a chance to meet many (many) kind and interesting people.
Richard Stallmans talk
The talk took place on July 23rd 2010 at NIST U.S. and was the concluding talk for the useR2010 conference. The talk consisted of a two hour lecture followed by a half-hour question and answer session.
On a personal note, I was very impressed by Richards talk. Richard is not a shy computer geek, but rather a serious leader and thinker trying to stir people to action. His speech was a sermon on free software, the history of GNU-Linux, the various versions of GPL, and his own history involving them.
I believe this talk would be of interest to anyone who cares about social solidarity, free software, programming and the hope of a better world for all of us.
I am eager for your thoughts in the comments (but please keep a kind tone).
(And also consider giving the contender, MetaOptimize a visit)
* * * *
Statistical analysis Q&A website is about to go into BETA
A month ago I invited readers of this blog to commit to using a new Q&A website for Data-Analysis (based on StackOverFlow engine), once it will open (the site was originally proposed by Rob Hyndman). And now, a month later, I am happy to write that over 500 people have shown interest in the website, and choose to commit themselves. This means we we have reached 100% completion of the website proposal process, and in the next few days we will move to the next step.
The next step is that the website will go into closed BETA for about a week. If you want to be part of this – now is the time to join (<--- call for action people).
From being part in some other closed BETA of similar projects, I can attest that the enthusiasm of the people trying to answer questions in the BETA is very impressive, so I strongly recommend the experience. If you won't make it by the time you see this post, then no worries - about a week or so after the website will go online, it will be open to the wide public. (p.s: thanks Romunov for pointing out to me that the BETA is about to open)
p.s: MetaOptimize
I would like to finish this post with mentioning MetaOptimize. This is a Q&A website which is of a more “machine learning” then a “statistical” community. It also started out some short while ago, and already it has around 700 users who have submitted ~160 questions with ~520 answers given. From my experience on the site so far, I have enjoyed the high quality of the questions and answers. When I first came by the website, I feared that supporting this website will split the R community of users between this website and the area 51 StackExchange website. But after a lengthy discussion (published recently as a post) with MetaOptimize founder, Joseph Turian, I came to have a more optimistic view of the competition of the two websites. Where at first I was afraid, I am now hopeful that each of the two website will manage to draw a tiny bit of different communities of people (that would otherwise wouldn’t be present in the other website) – thus offering all of us a wider variety of knowledge to tap into.
(Written by Ian Fellows) The RForge build error has been fixed. the package can now be tried with: install.packages("Deducer",,"http://www.rforge.net",type="source")
(Written by Ian Fellows)
The RForge build error has been fixed. the package can now be tried with: install.packages("Deducer",,"http://www.rforge.net",type="source")
As prolific as the CRAN website is of packages, there are several packages to R that succeeds in standing out for their wide spread use (and quality), Hadley Wickhams ggplot2 and plyr are two such packages.
And today (through twitter) Hadley has updates the rest of us with the news:
just released new versions of plyr and ggplot2. source versions available on cran, compiled will follow soon #rstats
Going to the CRAN website shows that plyr has gone through the most major update, with the last update (before the current one) taking place on 2009-06-23. And now, over a year later, we are presented with plyr version 1, which includes New functions, New features some Bug fixes and a much anticipated Speed improvements. ggplot2, has made a tiny leap from version 0.8.7 to 0.8.8, and was previously last updated on 2010-03-03.
Me, and I am sure many R users are very thankful for the amazing work that Hadley Wickham is doing (both on his code, and with helping other useRs on the help lists). So Hadley, thank you!
Letter in the conversation, Achim Zeileis, has surprised us (well, me) saying the following
I’ve thought about adding a plot() method for the coeftest() function in the “lmtest” package. Essentially, it relies on a coef() and a vcov() method being available – and that a central limit theorem holds. For releasing it as a general function in the package the code is still too raw, but maybe it’s useful for someone on the list. Hence, I’ve included it below.
(I allowed myself to add some bolds in the text)
So for the convenience of all of us, I uploaded Achim’s code in a file for easy access. Here is an example of how to use it:
source("https://www.r-statistics.com/wp-content/uploads/2010/07/coefplot.r.txt")
data("Mroz", package = "car")
fm <- glm(lfp ~ ., data = Mroz, family = binomial)
coefplot(fm, parm = -1)
Here is the resulting graph:
I hope Achim will get around to improve the function so he might think it worthy of joining his"lmtest" package. I am glad he shared his code for the rest of us to have something to work with in the meantime 🙂
* * *
Update (07.07.10): Thanks to a comment by David Atkins, I found out there is a more mature version of this function (called coefplot) inside the {arm} package. This version offers many features, one of which is the ability to easily stack several confidence intervals one on top of the other.
It works for baysglm, glm, lm, polr objects and a default method is available which takes pre-computed coefficients and associated standard errors from any suitable model.
Example: (Notice that the Poisson model in comparison with the binomial models does not make much sense, but is enough to illustrate the use of the function)
library("arm")
data("Mroz", package = "car")
M1<- glm(lfp ~ ., data = Mroz, family = binomial)
M2<- bayesglm(lfp ~ ., data = Mroz, family = binomial)
M3<- glm(lfp ~ ., data = Mroz, family = binomial(probit))
coefplot(M2, xlim=c(-2, 6), intercept=TRUE)
coefplot(M1, add=TRUE, col.pts="red", intercept=TRUE)
coefplot(M3, add=TRUE, col.pts="blue", intercept=TRUE, offset=0.2)
(hat tip goes to Allan Engelhardt for help improving the code, and for Achim Zeileis in extending and improving the narration for the example)
Resulting plot
* * * Lastly, another method worth mentioning is the Nomogram, implemented by Frank Harrell'a rms package.
Competition with prizes are an amazing thing. If you are not sure of that, I urge you to listened to Peter Diamandis talk about his experience with the X prize (start listening at minute 11:40):
At short – prizes can give up to 1 to 50 ratio of return on investment of the people giving funding to the prize. The money is spent only when results are achieved. And there is a lot of value in terms of public opinion and publicity. And the best of all (for the promoter of the competition) – prizes encourage people to take risks (at their own expense) in order to get results done.
All of that said, I look at prize baring competition as something worth spreading, especially in cases where the results of the winning team will be shared with the public.
About the IEEE ICDM Contest
The IEEE ICDM Contest (“Road Traffic Prediction for Intelligent GPS Navigation”), seems to be one of those cases. Due to a polite request, I am republishing here the details of this new competition, in the hope that some of my R colleagues will bring the community some pride 🙂 Continue reading “Contest: Road Traffic Prediction for Intelligent GPS Navigation”
(Written by Ian Fellows) Below is a link to the first of a weekly (or bi-weekly) screen-cast vlog of my progress building a GUI for the ggplot2 package. http://neolab.stat.ucla.edu/cranstats/gsoc_vlog1.mov comments and suggestions are more than welcome…
(Written by Ian Fellows)
Below is a link to the first of a weekly (or bi-weekly) screen-cast vlog of my progress building a GUI for the ggplot2 package.
The bottom line of this post is for you to go to: Stack Exchange Q&A site proposal: Statistical Analysis And commit yourself to using the website for asking and answering questions. 144 peoples already committed to using the website, we need 356 more… 🙂 If you are looking for the reasons to do so – read on…
What is the StackOverFlow Q&A website about?
StackOverFlow.com (“SO” for short) is a programming Q & A site that’s free. Free to ask questions, free to answer questions, free to read. Free, And fast.
You might be asking yourself what’s so special about SO over other available resources such as R mailing lists, R blogs, R wiki and so on? That is a great question. The answer is that SO succeeds in doing a great job synthesizing aspects of Wikis, Blogs, Forums, and Digg/Reddit to offer a very powerful Q&A website.
In SO, the new questions are like forum/blog posts (A main text with comments/answers). After someone answers a question, other users can give a thumb-up or a thumb-down to the answer (like digg/reddit). And all content can be edited, like a wiki page, by the users (provided the user has enough “karma points”). You also get badges (“awards”) for a bunch of actions (like coming to the website every day for a month. Giving an answer that got X amount of thumb-ups and so on). The awards allows someone who is asking a question to see how much the person who had answered him has good reputation (in terms of acceptance/appreciation of his answers by other SO members). It also offers a small (but effective) ego-boost for the person who gives answers.
So if StackOverFlow is so great – what is this new website you wrote about in the title?
Well, StackOverFlow has one limitation. It deals ONLY with programming questions. Other questions like:
Which of the following three graphics best displays this data set? Why?
Can you give an example of where I might prefer to use a z-test vs a t-test?
What is the relationship between Bayesian and neural networks?
Will not be answered, and the threads will get closed as being “off topic”. Why? because such questions are dealing with: statistics, data analysis, data mining, data visualization – But in no means in programming.
So there is no StackOverFlow-like Q&A website for data analysis… Until now!
In the past few weeks, Rob Hyndman and other users, have made much effort to push the creation of a new website, based on the StackOverFlow engine, to allow for statistically related Q&A. His proposal for a new website is almost complete. All it need is for you (yes you), to go to the following link: Stack Exchange Q&A site proposal: Statistical Analysis And commit yourself to the website (that is, click the button called “commit” – so to declare that you will have interest in reading, asking and answering questions on such a website)
Once a few more tens 379 more people will commit – the website will go online!
In hierarchical cluster analysis dendrogram graphs are used to visualize how clusters are formed. I propose an alternative graph named “clustergram” to examine how cluster members are assigned to clusters as the number of clusters increases. This graph is useful in exploratory analysis for non-hierarchical clustering algorithms like k-means and for hierarchical cluster algorithms when the number of observations is large enough to make dendrograms impractical.
A similar article was later written and was (maybe) published in “computational statistics”.
Both articles gives some nice background to known methods like k-means and methods for hierarchical clustering, and then goes on to present examples of using these methods (with the Clustergarm) to analyse some datasets.
Personally, I understand the clustergram to be a type of parallel coordinates plot where each observation is given a vector. The vector contains the observation’s location according to how many clusters the dataset was split into. The scale of the vector is the scale of the first principal component of the data.
Clustergram in R (a basic function)
After finding out about this method of visualization, I was hunted by the curiosity to play with it a bit. Therefore, and since I didn’t find any implementation of the graph in R, I went about writing the code to implement it.
The code only works for kmeans, but it shows how such a plot can be produced, and could be later modified so to offer methods that will connect with different clustering algorithms.
How does the function work: The function I present here gets a data.frame/matrix with a row for each observation, and the variable dimensions present in the columns. The function assumes the data is scaled. The function then goes about calculating the cluster centers for our data, for varying number of clusters. For each cluster iteration, the cluster centers are multiplied by the first loading of the principal components of the original data. Thus offering a weighted mean of the each cluster center dimensions that might give a decent representation of that cluster (this method has the known limitations of using the first component of a PCA for dimensionality reduction, but I won’t go into that in this post). Finally all of our data points are ordered according to their respective cluster first component, and plotted against the number of clusters (thus creating the clustergram).
My thank goes to Hadley Wickham for offering some good tips on how to prepare the graph.
source("https://www.r-statistics.com/wp-content/uploads/2012/01/source_https.r.txt") # Making sure we can source code from github
source_https("https://raw.github.com/talgalili/R-code-snippets/master/clustergram.r")
data(iris)
set.seed(250)
par(cex.lab = 1.5, cex.main = 1.2)
Data <- scale(iris[,-5]) # notice I am scaling the vectors)
clustergram(Data, k.range = 2:8, line.width = 0.004) # notice how I am using line.width. Play with it on your problem, according to the scale of Y.
Here is the output:
Looking at the image we can notice a few interesting things. We notice that one of the clusters formed (the lower one) stays as is no matter how many clusters we are allowing (except for one observation that goes way and then beck). We can also see that the second split is a solid one (in the sense that it splits the first cluster into two clusters which are not "close" to each other, and that about half the observations goes to each of the new clusters). And then notice how moving to 5 clusters makes almost no difference. Lastly, notice how when going for 8 clusters, we are practically left with 4 clusters (remember - this is according the mean of cluster centers by the loading of the first component of the PCA on the data)
If I where to take something from this graph, I would say I have a strong tendency to use 3-4 clusters on this data.
But wait, did our clustering algorithm do a stable job? Let's try running the algorithm 6 more times (each run will have a different starting point for the clusters)
source("https://www.r-statistics.com/wp-content/uploads/2012/01/source_https.r.txt") # Making sure we can source code from github
source_https("https://raw.github.com/talgalili/R-code-snippets/master/clustergram.r")
set.seed(500)
Data <- scale(iris[,-5]) # notice I am scaling the vectors)
par(cex.lab = 1.2, cex.main = .7)
par(mfrow = c(3,2))
for(i in 1:6) clustergram(Data, k.range = 2:8 , line.width = .004, add.center.points = T)
Resulting with: (press the image to enlarge it)
Repeating the analysis offers even more insights. First, it would appear that until 3 clusters, the algorithm gives rather stable results. From 4 onwards we get various outcomes at each iteration. At some of the cases, we got 3 clusters when we asked for 4 or even 5 clusters.
Reviewing the new plots, I would prefer to go with the 3 clusters option. Noting how the two "upper" clusters might have similar properties while the lower cluster is quite distinct from the other two.
By the way, the Iris data set is composed of three types of flowers. I imagine the kmeans had done a decent job in distinguishing the three.
Limitation of the method (and a possible way to overcome it?!)
It is worth noting that the current way the algorithm is built has a fundamental limitation: The plot is good for detecting a situation where there are several clusters but each of them is clearly "bigger" then the one before it (on the first principal component of the data).
For example, let's create a dataset with 3 clusters, each one is taken from a normal distribution with a higher mean:
The image shows a clear distinction between three ranks of clusters. There is no doubt (for me) from looking at this image, that three clusters would be the correct number of clusters.
But what if the clusters where different but didn't have an ordering to them? For example, look at the following 4 dimensional data:
In this situation, it is not clear from the location of the clusters on the Y axis that we are dealing with 4 clusters. But what is interesting, is that through the growing number of clusters, we can notice that there are 4 "strands" of data points moving more or less together (until we reached 4 clusters, at which point the clusters started breaking up). Another hope for handling this might be using the color of the lines in some way, but I haven't yet figured out how.
Clustergram with ggplot2
Hadley Wickham has kindly played with recreating the clustergram using the ggplot2 engine. You can see the result here: http://gist.github.com/439761 And this is what he wrote about it in the comments:
I’ve broken it down into three components: * run the clustering algorithm and get predictions (many_kmeans and all_hclust) * produce the data for the clustergram (clustergram) * plot it (plot.clustergram) I don’t think I have the logic behind the y-position adjustment quite right though.
Conclusions (some rules of thumb and questions for the future)
In a first look, it would appear that the clustergram can be of use. I can imagine using this graph to quickly run various clustering algorithms and then compare them to each other and review their stability (In the way I just demonstrated in the example above).
The three rules of thumb I have noticed by now are:
Look at the location of the cluster points on the Y axis. See when they remain stable, when they start flying around, and what happens to them in higher number of clusters (do they re-group together)
Observe the strands of the datapoints. Even if the clusters centers are not ordered, the lines for each item might (needs more research and thinking) tend to move together - hinting at the real number of clusters
Run the plot multiple times to observe the stability of the cluster formation (and location)
Yet there is more work to be done and questions to seek answers to:
The code needs to be extended to offer methods to various clustering algorithms.
How can the colors of the lines be used better?
How can this be done using other graphical engines (ggplot2/lattice?) - (Update: look at Hadley's reply in the comments)
What to do in case the first principal component doesn't capture enough of the data? (maybe plot this graph to all the relevant components. but then - how do you make conclusions of it?)
What other uses/conclusions can be made based on this graph?
I am looking forward to reading your input/ideas in the comments (or in reply posts).