Merging two data.frame objects while preserving the rows’ order

Tal Galili

11 years ago

Update (2017-02-03) the dplyr package offers a great solution for this issue, see the document Two-table verbs for more details.

Merging two data.frame objects in R is very easily done by using the merge function. While being very powerful, the merge function does not (as of yet) offer to return a merged data.frame that preserved the original order of, one of the two merged, data.frame objects.
In this post I describe this problem, and offer some easy to use code to solve it.

Let us start with a simple example:

    x <- data.frame(
           ref = c( 'Ref1', 'Ref2' )
         , label = c( 'Label01', 'Label02' )
         )
    y <- data.frame(
          id = c( 'A1', 'C2', 'B3', 'D4' )
        , ref = c( 'Ref1', 'Ref2' , 'Ref3','Ref1' )
        , val = c( 1.11, 2.22, 3.33, 4.44 )
        )
 
#######################
# having a look at the two data.frame objects:
> x
   ref   label
1 Ref1 Label01
2 Ref2 Label02
> y
  id  ref  val
1 A1 Ref1 1.11
2 C2 Ref2 2.22
3 B3 Ref3 3.33
4 D4 Ref1 4.44

If we will now merge the two objects, we will find that the order of the rows is different then the original order of the “y” object. This is true whether we use “sort =T” or “sort=F”. You can notice that the original order was an ascending order of the “val” variable:

> merge( x, y, by='ref', all.y = T, sort= T)
   ref   label id  val
1 Ref1 Label01 A1 1.11
2 Ref1 Label01 D4 4.44
3 Ref2 Label02 C2 2.22
4 Ref3    <NA> B3 3.33
> merge( x, y, by='ref', all.y = T, sort=F )
   ref   label id  val
1 Ref1 Label01 A1 1.11
2 Ref1 Label01 D4 4.44
3 Ref2 Label02 C2 2.22
4 Ref3    <NA> B3 3.33

This is explained in the help page of ?merge:

The rows are by default lexicographically sorted on the common columns, but for ‘sort = FALSE’ are in an unspecified order.

Or put differently: sort=FALSE doesn’t preserve the order of any of the two entered data.frame objects (x or y); instead it gives us an
unspecified (potentially random) order.

However, it can so happen that we want to make sure the order of the resulting merged data.frame objects ARE ordered according to the order of one of the two original objects. In order to make sure of that, we could add an extra “id” (row index number) sequence on the dataframe we wish to sort on. Then, we can merge the two data.frame objects, sort by the sequence, and delete the sequence. (this was previously mentioned on the R-help mailing list by Bart Joosen).

Following is a function that implements this logic, followed by an example for its use:

############## function:
	merge.with.order <- function(x,y, ..., sort = T, keep_order)
	{
		# this function works just like merge, only that it adds the option to return the merged data.frame ordered by x (1) or by y (2)
		add.id.column.to.data <- function(DATA)
		{
			data.frame(DATA, id... = seq_len(nrow(DATA)))
		}
		# add.id.column.to.data(data.frame(x = rnorm(5), x2 = rnorm(5)))
		order.by.id...and.remove.it <- function(DATA)
		{
			# gets in a data.frame with the "id..." column.  Orders by it and returns it
			if(!any(colnames(DATA)=="id...")) stop("The function order.by.id...and.remove.it only works with data.frame objects which includes the 'id...' order column")
 
			ss_r <- order(DATA$id...)
			ss_c <- colnames(DATA) != "id..."
			DATA[ss_r, ss_c]
		}
 
		# tmp <- function(x) x==1; 1	# why we must check what to do if it is missing or not...
		# tmp()
 
		if(!missing(keep_order))
		{
			if(keep_order == 1) return(order.by.id...and.remove.it(merge(x=add.id.column.to.data(x),y=y,..., sort = FALSE)))
			if(keep_order == 2) return(order.by.id...and.remove.it(merge(x=x,y=add.id.column.to.data(y),..., sort = FALSE)))
			# if you didn't get "return" by now - issue a warning.
			warning("The function merge.with.order only accepts NULL/1/2 values for the keep_order variable")
		} else {return(merge(x=x,y=y,..., sort = sort))}
	}
 
######### example:
>     merge( x.labels, x.vals, by='ref', all.y = T, sort=F )
   ref   label id  val
1 Ref1 Label01 A1 1.11
2 Ref1 Label01 D4 4.44
3 Ref2 Label02 C2 2.22
4 Ref3    <NA> B3 3.33
>     merge.with.order( x.labels, x.vals, by='ref', all.y = T, sort=F ,keep_order = 1)
   ref   label id  val
1 Ref1 Label01 A1 1.11
2 Ref1 Label01 D4 4.44
3 Ref2 Label02 C2 2.22
4 Ref3    <NA> B3 3.33
>     merge.with.order( x.labels, x.vals, by='ref', all.y = T, sort=F ,keep_order = 2) # yay - works as we wanted it to...
   ref   label id  val
1 Ref1 Label01 A1 1.11
3 Ref2 Label02 C2 2.22
4 Ref3    <NA> B3 3.33
2 Ref1 Label01 D4 4.44

Here is a description for how to use the keep_order parameter:

keep_order can accept the numbers 1 or 2, in which case it will make sure the resulting merged data.frame will be ordered according to the original order of rows of the data.frame entered to x (if keep_order=1) or to y (if keep_order=2). If keep_order is missing, merge will continue working as usual. If keep_order gets some input other then 1 or 2, it will issue a warning that it doesn’t accept these values, but will continue working as merge normally would. Notice that the parameter “sort” is practically overridden when using keep_order (with the value 1 or 2).

The same code can be used to modify the original merge.data.frame function in base R, so to allow the use of the keep_order, here is a link to the patched merge.data.frame function (on github). If you can think of any ways to improve the function (or happen to notice a bug) please let me know either on github or in the comments. (also saying that you found the function to be useful will be fun to know about )

Update: Thanks to KY’s comment, I noticed the ?join function in the {plyr} library. This function is similar to merge (with less features, yet faster), and also automatically keeps the order of the x (first) data.frame used for merging, as explained in the ?join help page:

Unlike merge, (join) preserves the order of x no matter what join type is used. If needed, rows from y will be added to the bottom. Join is often faster than merge, although it is somewhat less featureful – it currently offers no way to rename output or merge on different variables in the x and y data frames.