memory - Errors with PCoA in R due to large dataset -
for workproject have perform pcoa (principal coordinate analysis aka multidimensional scaling). when using r perform analysis run few problems.
the function cmdscale accepts matrix or dist input, dist function gives error:
error: cannot allocate vector of size 4.2 gb in addition: warning messages: 1: in dist(mydata[c(3, 4)], method = "euclidian", diag = false, upper = false) : reached total allocation of 4020mb: see help(memory.size) 2: in dist(mydata[c(3, 4)], method = "euclidian", diag = false, upper = false) : reached total allocation of 4020mb: see help(memory.size) 3: in dist(mydata[c(3, 4)], method = "euclidian", diag = false, upper = false) : reached total allocation of 4020mb: see help(memory.size) 4: in dist(mydata[c(3, 4)], method = "euclidian", diag = false, upper = false) : reached total allocation of 4020mb: see help(memory.size) and when use matrix changes input this:
[,1] [1,] integer,33741 [2,] integer,33741 the contents of dataset cannot posted online can give dimensions: dataset 33741 rows long , 11 columns wide first column being id , other 10 values need used pcoa.
as can see in error use 2 columns , memory error.
now questions:
possible either manipulate data in such way can manage memory limit dist function?
doing wrong matrix function changes vectors 2 column 2 row output?
what have tried: clearing garbage collection, restarting gui, restarting system.
system: windows 7 x64 i7 920qm 1.8ghz 4gb ddr3 ram
code used:
mydata <- read.table(file, header=true) mydist <- dist(mydata[c(3,4)], method="euclidian", diag=false, upper=false) mymatrix <- matrix(mydata[c(3,4)], byrow=false) mymatrix <- matrix(cbind(mydata[c(3,4)])) mycmdscale <- cmdscale(mydist, k=2, eig=false, add=false, x.ret=false) mycmdscale <- cmdscale(mymatrix, k=2, eig=false, add=false, x.ret=false) plot(mycmdscale) of course did not run code in order code contains methods have tried load data.
thanks in advance replies.
you have far little memory operation in r, holds objects in memory. may not have exact calculation quite correct (i forget size of r's objects) hold dissimilarity matrix you'll need ~9gb of ram.
> print(object.size(matrix(0, ncol = 34000, nrow = 34000)), units = "gb") 8.6 gb dist away less in internal representation storing 0.5 * (nr * (nr - 1)) doubles (nr number of rows in input data):
> print(object.size(numeric(length = 0.5 * 34000 * 33999)), units = "gb") 4.3 gb [which error seeing coming from]
realistically you'll need upwards of 20-30gb of ram useful dissimilarity matrix once you've computed it. if compute them, eigenvectors of pcoa solution need ~ 9gb of ram, on own.
so more pertinent question is; hope c. 34000 samples/observations?
to matrix mydata[3:4] can use
as.matrix(mydata[3:4]) or, if have factors , want preserve numeric interpretation
data.matrix(mydata[3:4])
Comments
Post a Comment