Represent a vector in R for sampling that is too large for existing memory -

August 15, 2014

i generating data vector sample sample without replacement.

if dataset generating large enough, vector exceeds limits of r.

how can represent these data in such way can sample without replacement can still handle huge datasets?

generating vector of counts:

counts <- vector() (i in 1:1024) {     counts <- c(counts, rep(i, times=data[i,]$readcount)) }

sampling:

trial_fn <- function(counts) {    replicate(num_trials, sample(counts, size=trial_size, replace=f), simplify=f) }  trials <- trial_fn(counts)  error: cannot allocate vector of size 32.0 mb

is there more sparse or compressed way can represent , still able sample without replacement?

if understand correctly, data has 1024 rows different readcount. vector build has first readcount value repeated once, second readcount repeated twice , on.

then want sample vector without replacement. basically, you're sampling first readcount probability of 1 / sum(1:1024), second readcount probability of 2 / sum(1:1024) , on, , each time extract 1 value, removed set.

of course fastest , easier approach yours, can less memory losing speed (significantly). can done giving probabilities of extraction sample function, extracting 1 value @ time , manually "removing" extracted value.

here's example :

# example of data data <- data.frame(readcount=1:1024)  # custom function sample mysample <- function(values, size, nelementspervalue){   nelementspervalue <- as.integer(nelementspervalue)   if(sum(nelementspervalue) < size)     stop("total number of elements per value lower sample size")   if(length(values) != length(nelementspervalue))     stop("nelementspervalue must have same length of values")   if(any(nelementspervalue < 0))     stop("nelementspervalue cannot contain negative numbers")    # remove values having 0 elements inside   nelementspervalue <- nelementspervalue[which(nelementspervalue > 0)]   values <- values[which(nelementspervalue > 0)]    # pre-allocate result vector   res <- rep.int(0.0,size)   for(i in 1:size){     idx <- sample(1:length(values),size=1,replace=f,prob=nelementspervalue)     res[i] <- values[idx]     # remove sampled value nelementspervalue     nelementspervalue[idx] <- nelementspervalue[idx] - 1     # if 0 elements remove values     if(nelementspervalue[idx] == 0){       values <- values[-idx]       nelementspervalue <- nelementspervalue[-idx]     }   }   return(res) }  # reproducibility set.seed(123)  # sample 100k values readcount system.time(   <- mysample(data$readcount, 100000, 1:1024),    gcfirst=t)  # on machine gives : #   user  system elapsed  #  10.63    0.00   10.67

Search This Blog

UIO

Represent a vector in R for sampling that is too large for existing memory -

Comments

Post a Comment

Popular posts from this blog

How to dequeue messages from RabbitMQ in a scheduled time -

Python Kivy ListView: How to delete selected ListItemButton? -

asp.net mvc 4 - A specified Include path is not valid. The EntityType '' does not declare a navigation property with the name '' -