r - Subset data frame based on unique combination of multiple conditions -
i cant seem find answer through search on so. i'm trying select subset of data.frame
based on 4 conditions (lon1, lon2, lat1 , lat2). have huge dissimilarity matrix has been vectorized , sites (lon1, lon2, lat1 , lat2) cbind
it. here example data frame:
out1 <- data.frame(lon1 = sample(1:10), lon2 = sample(1:10), lat1 = sample(1:10), lat2 = sample(1:10), dissimilarity = sample(seq(0,1,.1),10)) > out1 lon1 lon2 lat1 lat2 dissimilarity 1 2 6 4 4 0.6 2 4 2 1 3 1.0 3 10 9 2 6 0.0 4 3 1 10 8 0.5 5 9 5 9 1 0.8 6 5 7 5 9 0.9 7 1 8 6 7 0.2 8 8 3 8 5 0.7 9 7 4 3 10 0.3 10 6 10 7 2 0.1 out2 <- out1[c(2,5,6,8),] lon1 lon2 lat1 lat2 dissimilarity 1 4 2 1 3 1.0 2 9 5 9 1 0.8 3 5 7 5 9 0.9 4 8 3 8 5 0.7
i tried using %in%
function few times in manner:
test <- out1[(out1$lon1 %in% out2$lon1) & (out1$lon2 %in% out2$lon2) & (out1$lat1 %in% out2$lat1) & (out1$lat2 %in% out2$lat2), ]
this seems work basic example provide here. but, when apply huge data frame (with many lat
, lons
repeated) larger subset unqiue combinations require. assume because match function in %in%
can match vector. it's matching condition1 &
condition2 &
condition3 &
condition4 , returning results gives subset same orginal out1
. want case when 4 values same row. way i'll subset of data pairwise dissimilarities i'm interested in.
any ideas on how subset row based on unique combination of 4 variables appreciated.
i think you're looking for. want duplicated
function returns you're expecting.
out1[duplicated(rbind(out2, out1)[, 1:4])[-seq_len(nrow(out2))], ]
how work? first rbind
out2
, out1
. call duplicated
on it. columns in out2
, in out1
marked true
in out1
. because first occurrence on out2
, not duplicated there. second time finds entry, in out1
, it'll know there has been row before. so, it'll mark duplicated. have duplicated entries. subset elements of out1
removing first n
elements n = nrow(out1)
. subset using logical vector on out1
.
you can go through explanation , run code step step follow-up. here's break-down version working out logic.
tt <- rbind(out2, out1) tt.dup <- duplicated(tt[, 1:4)] # marks duplicate rows in out1 1st 4 cols tt.dup <- tt.dup[-seq_len(nrow(out2))] # remove out2 entries (first n) out1[tt.dup, ] # index true/duplicated elements out1
Comments
Post a Comment