Represent a vector in R for sampling that is too large for existing memory -
i generating data vector sample sample
without replacement.
if dataset generating large enough, vector exceeds limits of r.
how can represent these data in such way can sample without replacement can still handle huge datasets?
generating vector of counts:
counts <- vector() (i in 1:1024) { counts <- c(counts, rep(i, times=data[i,]$readcount)) }
sampling:
trial_fn <- function(counts) { replicate(num_trials, sample(counts, size=trial_size, replace=f), simplify=f) } trials <- trial_fn(counts) error: cannot allocate vector of size 32.0 mb
is there more sparse or compressed way can represent , still able sample without replacement?
if understand correctly, data
has 1024 rows different readcount
. vector build has first readcount
value repeated once, second readcount
repeated twice , on.
then want sample vector without replacement. basically, you're sampling first readcount probability of 1 / sum(1:1024)
, second readcount probability of 2 / sum(1:1024)
, on, , each time extract 1 value, removed set.
of course fastest , easier approach yours, can less memory losing speed (significantly). can done giving probabilities of extraction sample
function, extracting 1 value @ time , manually "removing" extracted value.
here's example :
# example of data data <- data.frame(readcount=1:1024) # custom function sample mysample <- function(values, size, nelementspervalue){ nelementspervalue <- as.integer(nelementspervalue) if(sum(nelementspervalue) < size) stop("total number of elements per value lower sample size") if(length(values) != length(nelementspervalue)) stop("nelementspervalue must have same length of values") if(any(nelementspervalue < 0)) stop("nelementspervalue cannot contain negative numbers") # remove values having 0 elements inside nelementspervalue <- nelementspervalue[which(nelementspervalue > 0)] values <- values[which(nelementspervalue > 0)] # pre-allocate result vector res <- rep.int(0.0,size) for(i in 1:size){ idx <- sample(1:length(values),size=1,replace=f,prob=nelementspervalue) res[i] <- values[idx] # remove sampled value nelementspervalue nelementspervalue[idx] <- nelementspervalue[idx] - 1 # if 0 elements remove values if(nelementspervalue[idx] == 0){ values <- values[-idx] nelementspervalue <- nelementspervalue[-idx] } } return(res) } # reproducibility set.seed(123) # sample 100k values readcount system.time( <- mysample(data$readcount, 100000, 1:1024), gcfirst=t) # on machine gives : # user system elapsed # 10.63 0.00 10.67
Comments
Post a Comment