r - Subset data frame based on unique combination of multiple conditions -

February 15, 2015

i cant seem find answer through search on so. i'm trying select subset of data.frame based on 4 conditions (lon1, lon2, lat1 , lat2). have huge dissimilarity matrix has been vectorized , sites (lon1, lon2, lat1 , lat2) cbind it. here example data frame:

out1 <- data.frame(lon1 = sample(1:10), lon2 = sample(1:10),                     lat1 = sample(1:10), lat2 = sample(1:10),                     dissimilarity = sample(seq(0,1,.1),10)) > out1      lon1   lon2    lat1 lat2 dissimilarity 1     2      6      4      4           0.6 2     4      2      1      3           1.0 3    10      9      2      6           0.0 4     3      1     10      8           0.5 5     9      5      9      1           0.8 6     5      7      5      9           0.9 7     1      8      6      7           0.2 8     8      3      8      5           0.7 9     7      4      3     10           0.3 10    6     10      7      2           0.1  out2 <- out1[c(2,5,6,8),]     lon1 lon2 lat1 lat2 dissimilarity 1     4   2   1      3           1.0 2     9   5   9      1           0.8 3     5   7   5      9           0.9 4     8   3   8      5           0.7

i tried using %in% function few times in manner:

test <- out1[(out1$lon1 %in% out2$lon1) & (out1$lon2 %in% out2$lon2) &               (out1$lat1 %in% out2$lat1) & (out1$lat2 %in% out2$lat2), ]

this seems work basic example provide here. but, when apply huge data frame (with many lat , lons repeated) larger subset unqiue combinations require. assume because match function in %in% can match vector. it's matching condition1 & condition2 & condition3 & condition4 , returning results gives subset same orginal out1. want case when 4 values same row. way i'll subset of data pairwise dissimilarities i'm interested in.

any ideas on how subset row based on unique combination of 4 variables appreciated.

i think you're looking for. want duplicated function returns you're expecting.

out1[duplicated(rbind(out2, out1)[, 1:4])[-seq_len(nrow(out2))], ]

how work? first rbind out2 , out1. call duplicated on it. columns in out2 , in out1 marked true in out1. because first occurrence on out2 , not duplicated there. second time finds entry, in out1 , it'll know there has been row before. so, it'll mark duplicated. have duplicated entries. subset elements of out1 removing first n elements n = nrow(out1). subset using logical vector on out1.

you can go through explanation , run code step step follow-up. here's break-down version working out logic.

tt <- rbind(out2, out1) tt.dup <- duplicated(tt[, 1:4)] # marks duplicate rows in out1 1st 4 cols tt.dup <- tt.dup[-seq_len(nrow(out2))] # remove out2 entries (first n) out1[tt.dup, ] # index true/duplicated elements out1

Search This Blog

DIs

r - Subset data frame based on unique combination of multiple conditions -

Comments

Post a Comment

Popular posts from this blog

css - Text drops down with smaller window -

php - cannot display multiple markers in google maps v3 from traceroute result -

php - Boolean search on database with 5 million rows, very slow -