Represent a vector in R for sampling that is too large for existing memory -


i generating data vector sample sample without replacement.

if dataset generating large enough, vector exceeds limits of r.

how can represent these data in such way can sample without replacement can still handle huge datasets?

generating vector of counts:

counts <- vector() (i in 1:1024) {     counts <- c(counts, rep(i, times=data[i,]$readcount)) } 

sampling:

trial_fn <- function(counts) {    replicate(num_trials, sample(counts, size=trial_size, replace=f), simplify=f) }  trials <- trial_fn(counts)  error: cannot allocate vector of size 32.0 mb 

is there more sparse or compressed way can represent , still able sample without replacement?

if understand correctly, data has 1024 rows different readcount. vector build has first readcount value repeated once, second readcount repeated twice , on.

then want sample vector without replacement. basically, you're sampling first readcount probability of 1 / sum(1:1024), second readcount probability of 2 / sum(1:1024) , on, , each time extract 1 value, removed set.

of course fastest , easier approach yours, can less memory losing speed (significantly). can done giving probabilities of extraction sample function, extracting 1 value @ time , manually "removing" extracted value.

here's example :

# example of data data <- data.frame(readcount=1:1024)  # custom function sample mysample <- function(values, size, nelementspervalue){   nelementspervalue <- as.integer(nelementspervalue)   if(sum(nelementspervalue) < size)     stop("total number of elements per value lower sample size")   if(length(values) != length(nelementspervalue))     stop("nelementspervalue must have same length of values")   if(any(nelementspervalue < 0))     stop("nelementspervalue cannot contain negative numbers")    # remove values having 0 elements inside   nelementspervalue <- nelementspervalue[which(nelementspervalue > 0)]   values <- values[which(nelementspervalue > 0)]    # pre-allocate result vector   res <- rep.int(0.0,size)   for(i in 1:size){     idx <- sample(1:length(values),size=1,replace=f,prob=nelementspervalue)     res[i] <- values[idx]     # remove sampled value nelementspervalue     nelementspervalue[idx] <- nelementspervalue[idx] - 1     # if 0 elements remove values     if(nelementspervalue[idx] == 0){       values <- values[-idx]       nelementspervalue <- nelementspervalue[-idx]     }   }   return(res) }  # reproducibility set.seed(123)  # sample 100k values readcount system.time(   <- mysample(data$readcount, 100000, 1:1024),    gcfirst=t)  # on machine gives : #   user  system elapsed  #  10.63    0.00   10.67 

Comments

Popular posts from this blog

php - Submit Form Data without Reloading page -

linux - Rails running on virtual machine in Windows -