r - Splitting Dataframe into Confirmatory and Exploratory Samples -
i have large dataframe (n = 107,251), wish split relatively equal halves (~53,625). however, split done such 3 variables kept in equal proportion in 2 sets (pertaining gender, age category 6 levels, , region 5 levels).
i can generate proportions variables independently (e.g., via prop.table(xtabs(~dat$gender))
) or in combination (e.g., via prop.table(xtabs(~dat$gender + dat$region + dat$age)
), i'm not sure how utilise information sampling.
sample dataset:
set.seed(42) gender <- sample(c("m", "f"), 1000, replace = true) region <- sample(c("1","2","3","4","5"), 1000, replace = true) age <- sample(c("1","2","3","4","5","6"), 1000, replace = true) x1 <- rnorm(1000) dat <- data.frame(gender, region, age, x1)
probabilities:
round(prop.table(xtabs(~dat$gender)), 3) # 48.5% female; 51.5% male round(prop.table(xtabs(~dat$age)), 3) # 16.8, 18.2, ..., 16.0% round(prop.table(xtabs(~dat$region)), 3) # 21.5%, 17.7, ..., 21.9% # multidimensional probabilities: round(prop.table(xtabs(~dat$gender + dat$age + dat$region)), 3)
the end goal dummy example 2 data frames ~500 observations in each (completely independent, no participant appearing in both), , approximately equivalent in terms of gender/region/age splits. in real analysis, there more disparity between age , region weights, doing single random split-half isn't appropriate. in real world applications, i'm not sure if every observation needs used or if better splits more even.
i have been reading on documentation package:sampling
i'm not sure designed require.
you can check out my stratified
function, should able use this:
set.seed(1) ## can reproduce ## take first group sample1 <- stratified(dat, c("gender", "region", "age"), .5) ## select remainder sample2 <- dat[!rownames(dat) %in% rownames(sample1), ] summary(sample1) # gender region age x1 # f:235 1:112 1:84 min. :-2.82847 # m:259 2: 90 2:78 1st qu.:-0.69711 # 3: 94 3:82 median :-0.03200 # 4: 97 4:80 mean :-0.01401 # 5:101 5:90 3rd qu.: 0.63844 # 6:80 max. : 2.90422 summary(sample2) # gender region age x1 # f:238 1:114 1:85 min. :-2.76808 # m:268 2: 92 2:81 1st qu.:-0.55173 # 3: 97 3:83 median : 0.02559 # 4: 99 4:83 mean : 0.05789 # 5:104 5:91 3rd qu.: 0.74102 # 6:83 max. : 3.58466
compare following , see if within expectations.
x1 <- round(prop.table( xtabs(~dat$gender + dat$age + dat$region)), 3) x2 <- round(prop.table( xtabs(~sample1$gender + sample1$age + sample1$region)), 3) x3 <- round(prop.table( xtabs(~sample2$gender + sample2$age + sample2$region)), 3)
it should able work fine data of size describe, "data.table" version in works promises more efficient.
update:
stratified
has new logical argument "bothsets
" lets keep both sets of samples list
.
set.seed(1) samples <- stratified(dat, c("gender", "region", "age"), .5, bothsets = true) lapply(samples, summary) # $set1 # gender region age x1 # f:235 1:112 1:84 min. :-2.82847 # m:259 2: 90 2:78 1st qu.:-0.69711 # 3: 94 3:82 median :-0.03200 # 4: 97 4:80 mean :-0.01401 # 5:101 5:90 3rd qu.: 0.63844 # 6:80 max. : 2.90422 # # $set2 # gender region age x1 # f:238 1:114 1:85 min. :-2.76808 # m:268 2: 92 2:81 1st qu.:-0.55173 # 3: 97 3:83 median : 0.02559 # 4: 99 4:83 mean : 0.05789 # 5:104 5:91 3rd qu.: 0.74102 # 6:83 max. : 3.58466
Comments
Post a Comment