In this post I will provide R code that implement’s the combination of repeated running quantile with the LOESS smoother to create a type of “quantile LOESS” (e.g: “Local Quantile Regression”).
This method is useful when the need arise to fit robust and resistant (Need to be verified) a smoothed line for a quantile (an example for such a case is provided at the end of this post).
If you wish to use the function in your own code, simply run inside your R console the following line:
source("https://www.r-statistics.com/wp-content/uploads/2010/04/Quantile.loess_.r.txt")
Background
I came a cross this idea in an article titled “High throughput data analysis in behavioral genetics” by Anat Sakov, Ilan Golani, Dina Lipkind and my advisor Yoav Benjamini. From the abstract:
In recent years, a growing need has arisen in different fields, for the development of computational systems for automated analysis of large amounts of data (high-throughput). Dealing with non-standard noise structure and outliers, that could have been detected and corrected in manual analysis, must now be built into the system with the aid of robust methods. […] we use a non-standard mix of robust and resistant methods: LOWESS and repeated running median.
The motivation for this technique came from “Path data” (of mice) which is
prone to suffer from noise and outliers. During progression a tracking system might lose track of the animal, inserting (occasionally very large) outliers into the data. During lingering, and even more so during arrests, outliers are rare, but the recording noise is large relative to the actual size of the movement. The statistical implications are that the two types of behavior require different degrees of smoothing and resistance. An additional complication is that the two interchange many times throughout a session. As a result, the statistical solution adopted needs not only to smooth the data, but also to recognize, adaptively, when there are arrests. To the best of our knowledge, no single existing smoothing technique has yet been able to fulfill this dual task. We elaborate on the sources of noise, and propose a mix of LOWESS (Cleveland, 1977) and the repeated running median (RRM; Tukey, 1977) to cope with these challenges
If all we wanted to do was to perform moving average (running average) on the data, using R, we could simply use the rollmean function from the zoo package.
But since we wanted also to allow quantile smoothing, we turned to use the rollapply function.
R function for performing Quantile LOESS
Here is the R function that implements the LOESS smoothed repeated running quantile (with implementation for using this with a simple implementation for using average instead of quantile):
# This code relies on the rollapply function from the "zoo" package. My thanks goes to Achim Zeileis and Gabor Grothendieck for their work on the package.
Quantile.loess<- function(Y, X = NULL,
number.of.splits = NULL,
window.size = 20,
percent.of.overlap.between.two.windows = NULL,
the.distance.between.each.window = NULL,
the.quant = .95,
window.alignment = c("center"),
window.function = function(x) {quantile(x, the.quant)},
# If you wish to use this with a running average instead of a running quantile, you could simply use:
# window.function = mean,
...)
{
# input: Y and X, and smothing parameters
# output: new y and x
# Extra parameter "..." goes to the loess
# window.size == the number of observation in the window (not the window length!)
# "number.of.splits" will override "window.size"
# let's compute the window.size:
if(!is.null(number.of.splits)) {window.size <- ceiling(length(Y)/number.of.splits)}
# If the.distance.between.each.window is not specified, let's make the distances fully distinct
if(is.null(the.distance.between.each.window)) {the.distance.between.each.window <- window.size}
# If percent.of.overlap.between.windows is not null, it will override the.distance.between.each.window
if(!is.null(percent.of.overlap.between.two.windows))
{
the.distance.between.each.window <- window.size * (1-percent.of.overlap.between.two.windows)
}
# loading zoo
if(!require(zoo))
{
print("zoo is not installed - please install it.")
install.packages("zoo")
}
if(is.null(X)) {X <- index(Y)} # if we don't have any X, then Y must be ordered, in which case, we can use the indexes of Y as X.
# creating our new X and Y
zoo.Y <- zoo(x = Y, order.by = X)
#zoo.X <- attributes(zoo.Y)$index
new.Y <- rollapply(zoo.Y, width = window.size,
FUN = window.function,
by = the.distance.between.each.window,
align = window.alignment)
new.X <- attributes(new.Y)$index
new.Y.loess <- loess(new.Y~new.X, family = "sym",...)$fitted
return(list(y = new.Y, x = new.X, y.loess = new.Y.loess))
}
More on the math of the algorithm can be found in the original article.
Example: Predicting "worst case scenario" Ozone levels using temperature
The following example uses the "airquality" dataset which gives us "Daily air quality measurements in New York, May to September 1973." With several variables, we will only look at Ozone level and Temperature.
Since high Ozone levels reduces the air quality we breath, I would like to give a prediction of the predicted "worst case" Ozone level (e.g: 95% Ozone level) using to the temperature of the same day.
How would you try to do something like that?
The first solution would be to use the "rq" function from the Quantile Regression R package, but if we where to look at the data, we would see that fitting a straight line is not suitable for our data (since we have a sharp change in slope around the temperature of 80 degrees).
This is a situation where Quantile LOESS (of 95%) might prove to be useful. Here is the code to produce the above plot.
data(airquality)
attach(airquality)
no.na <- !(is.na(Ozone) | is.na(Temp))
Ozone.2 <- Ozone[no.na]
Temp.2 <- jitter(Temp[no.na])
plot(Ozone ~ Temp, main = "Predicting the 95% Ozone level according to Temperature")
# fitting the Quantile regression
require(quantreg)
abline(rq(Ozone ~ Temp, tau = .95), col = "red")
# fitting the Quantile LOESS
source("https://www.r-statistics.com/wp-content/uploads/2010/04/Quantile.loess_.r.txt")
QL <- Quantile.loess(Y = Ozone.2, X = Temp.2,
the.quant = .95,
window.size = 10,
window.alignment = c("center"))
points(QL$y.loess ~ QL$x, type = "l", col = "green")
legend("topleft",legend = c("95% Quantile regression", "95% Quantile LOESS"), fill = c("red","green"))
Update: I changed in the article's name from LOWESS to LOESS
After A considerate e-mail from Dirk Eddelbuettel I corrected myself from using LOWESS to LOESS throughout the article. Here's an explanation to why I did it and also why I corrected it -
Dirk wrote to me:
You have a post entitled 'quantile lowess' but you then (correctly) use loess. Do you understand that there are two functions lowess() and loess()?
The former is sort-of a predecessor but nobody but really old books still talks about it. Google for (maybe) 'Brian Ripley lowess loess' as he drove
that point home a few times on r-help.
My answer was:
Thanks Dirk, [...]
Regarding the loess != lowess, I noticed that this is indeed the case when I first wrote the post but I was in a predicament:
On the one hand, LOESS is the more modern approach (and what I used in the script). But on the other hand, LOWESS is what the original article's authors where using. I ended up deciding I would call it the way I did, but after reading what you wrote, I realized I made a mistake.
I went through the article and corrected the lowess to loess, while also adding a paragraph for explain my reasoning.
Update: regarding the method being robust
After Nicholas's comment I went checking and came across a R-help thread by
Martin Maechler explaining how to update my code from above so that the system will be robust. Martin wrote (My notes are added in []):
One gotcha [when comparing lowess to loess is]-- particularly if you were used to the fact that lowess() by default is resistant to outliers {well, in many cases at least} :
- lowess() per default has "iter = 3" which means it uses 3 ``robustifying'' (also called "huberizing" for Huber (1960)) iterations .
- loess() on the other hand has an argument `family' with possible values "gaussian" and "symmetric" (can be abbreviated) where the *first* one is the default (unfortunately, in my opinion).
I.e., loess() by default is not resistant/robust where as lowess() is. [...] I would however recommend using loess(....., family = "sym") routinely.
* * *
If you find this code useful, please let me know about it in the comments.