r - Splitting Dataframe into Confirmatory and Exploratory Samples -


i have large dataframe (n = 107,251), wish split relatively equal halves (~53,625). however, split done such 3 variables kept in equal proportion in 2 sets (pertaining gender, age category 6 levels, , region 5 levels).

i can generate proportions variables independently (e.g., via prop.table(xtabs(~dat$gender))) or in combination (e.g., via prop.table(xtabs(~dat$gender + dat$region + dat$age)), i'm not sure how utilise information sampling.

sample dataset:

set.seed(42) gender <- sample(c("m", "f"), 1000, replace = true) region <- sample(c("1","2","3","4","5"), 1000, replace = true) age <- sample(c("1","2","3","4","5","6"), 1000, replace = true) x1 <- rnorm(1000) dat <- data.frame(gender, region, age, x1) 

probabilities:

round(prop.table(xtabs(~dat$gender)), 3)  # 48.5% female; 51.5% male round(prop.table(xtabs(~dat$age)), 3)     # 16.8, 18.2, ..., 16.0% round(prop.table(xtabs(~dat$region)), 3)  # 21.5%, 17.7, ..., 21.9% # multidimensional probabilities: round(prop.table(xtabs(~dat$gender + dat$age + dat$region)), 3) 

the end goal dummy example 2 data frames ~500 observations in each (completely independent, no participant appearing in both), , approximately equivalent in terms of gender/region/age splits. in real analysis, there more disparity between age , region weights, doing single random split-half isn't appropriate. in real world applications, i'm not sure if every observation needs used or if better splits more even.

i have been reading on documentation package:sampling i'm not sure designed require.

you can check out my stratified function, should able use this:

set.seed(1) ## can reproduce  ## take first group sample1 <- stratified(dat, c("gender", "region", "age"), .5)  ## select remainder sample2 <- dat[!rownames(dat) %in% rownames(sample1), ]  summary(sample1) #  gender  region  age          x1           #  f:235   1:112   1:84   min.   :-2.82847   #  m:259   2: 90   2:78   1st qu.:-0.69711   #          3: 94   3:82   median :-0.03200   #          4: 97   4:80   mean   :-0.01401   #          5:101   5:90   3rd qu.: 0.63844   #                  6:80   max.   : 2.90422 summary(sample2) #  gender  region  age          x1           #  f:238   1:114   1:85   min.   :-2.76808   #  m:268   2: 92   2:81   1st qu.:-0.55173   #          3: 97   3:83   median : 0.02559   #          4: 99   4:83   mean   : 0.05789   #          5:104   5:91   3rd qu.: 0.74102   #                  6:83   max.   : 3.58466  

compare following , see if within expectations.

x1 <- round(prop.table(   xtabs(~dat$gender + dat$age + dat$region)), 3) x2 <- round(prop.table(   xtabs(~sample1$gender + sample1$age + sample1$region)), 3) x3 <- round(prop.table(   xtabs(~sample2$gender + sample2$age + sample2$region)), 3) 

it should able work fine data of size describe, "data.table" version in works promises more efficient.


update:

stratified has new logical argument "bothsets" lets keep both sets of samples list.

set.seed(1) samples <- stratified(dat, c("gender", "region", "age"), .5, bothsets = true) lapply(samples, summary) # $set1 #  gender  region  age          x1           #  f:235   1:112   1:84   min.   :-2.82847   #  m:259   2: 90   2:78   1st qu.:-0.69711   #          3: 94   3:82   median :-0.03200   #          4: 97   4:80   mean   :-0.01401   #          5:101   5:90   3rd qu.: 0.63844   #                  6:80   max.   : 2.90422   # # $set2 #  gender  region  age          x1           #  f:238   1:114   1:85   min.   :-2.76808   #  m:268   2: 92   2:81   1st qu.:-0.55173   #          3: 97   3:83   median : 0.02559   #          4: 99   4:83   mean   : 0.05789   #          5:104   5:91   3rd qu.: 0.74102   #                  6:83   max.   : 3.58466 

Comments

Popular posts from this blog

php - Submit Form Data without Reloading page -

linux - Rails running on virtual machine in Windows -

php - $params->set Array between square bracket -