问题描述
我有一个数据框,其中包含大约 20 个数字列,每个列都包含大量的 NA 值.我想选择这些列的一个子集,这将为我提供包含零 NA 值的最多行.详尽的搜索会花费大量的计算时间——有没有更好的方法来获得近似值?
I have a data frame with on the order of 20 numeric columns, each containing significant amounts of NA values. I would like to select a subset of these columns that will give me the most rows containing zero NA values. An exhaustive search would take a lot of computing time--is there a better way to get an approximation?
这是一个数据框较小的示例(完全任意):
Here is an example with a smaller data frame (completely arbitrary):
set.seed(2)
foo = as.data.frame(matrix(rnorm(200), nr = 20))
foo[sapply(foo, function(x) x > abs(x[1]))] = NA
foo = foo[-1, ]
round(foo, 3)
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
2 0.185 -1.200 -1.959 NA -1.696 0.261 0.139 0.410 -0.638 -1.262
3 NA 1.590 -0.842 -0.703 -0.533 -0.314 NA -0.807 -0.268 0.392
4 -1.130 1.955 NA 0.158 -1.372 -0.750 -0.431 0.086 0.360 -1.131
5 -0.080 0.005 NA 0.506 -2.208 -0.862 -1.044 NA -1.313 0.544
6 0.132 -2.452 NA -0.820 NA NA 0.538 -0.654 -0.884 NA
7 0.708 0.477 -0.305 -1.999 -0.653 0.940 -0.670 NA NA 0.025
8 -0.240 -0.597 -0.091 -0.479 -0.285 NA 0.639 0.550 -2.099 0.515
9 NA 0.792 -0.184 0.084 -0.387 -0.421 -1.724 -0.807 -1.239 -0.654
10 -0.139 0.290 -1.199 -0.895 0.387 -0.351 -1.742 -0.997 NA 0.504
11 0.418 0.739 -0.838 -0.921 NA -1.027 0.690 NA NA -1.272
12 NA 0.319 NA 0.330 NA -0.251 0.331 -0.169 NA -0.077
13 -0.393 1.076 -0.562 -0.142 -1.184 0.472 0.871 NA 0.057 -1.345
14 -1.040 -0.284 NA 0.435 -1.358 NA -2.016 -0.844 0.324 -0.266
15 NA -0.777 -1.048 -0.054 -1.513 0.564 1.213 NA -0.905 NA
16 -2.311 -0.596 -1.966 -0.907 -1.253 0.456 1.200 -1.343 -0.652 0.701
17 0.879 -1.726 -0.323 1.304 NA NA 1.032 NA -0.262 -0.443
18 0.036 -0.903 NA 0.772 0.008 NA 0.786 0.464 -0.935 -0.789
19 NA -0.559 NA 1.053 -0.843 0.107 NA 0.268 NA -0.857
20 0.432 -0.247 NA -1.410 -0.601 -0.783 -1.454 NA -1.624 -0.746
dim(na.omit(foo))
[1] 1 10
这是我制定详尽搜索的方式:
Here is how I've formulated an exhaustive search:
best.list = list()
for (i in 5:ncol(foo)) {
# get best subset for each size
collist = combn(ncol(foo), i)
numobs = apply(collist, 2, function(x) nrow(na.omit(foo[, x])))
cat("for subset size", i, "most complete obs is", max(numobs), "
")
best = which(numobs == max(numobs))[1]
best.list = c(best.list, list(collist[, best]))
}
例如,best.list[[1]]
告诉我,如果我保留 5 列,我可以有 12 个完整的观察结果(NA 为零的行),并且第 1、2、4 列, 7 和 10 是我应该选择的.
For example, best.list[[1]]
tells me that if I keep 5 columns I can have 12 complete observations (rows with zero NAs), and that columns 1, 2, 4, 7, and 10 are the ones I should choose.
虽然这适用于非常小的数据帧,但对于较大的数据帧,它很快就会变得令人望而却步.R中有没有办法有效地估计给定大小的最佳子集?我唯一能找到的是 subselect
包,虽然我不知道如何针对手头的问题实现它的方法.
While this works for very small data frames, it quickly becomes prohibitive with larger ones. Is there a way in R to efficiently estimate the best subset of a given size? The only thing I've been able to find is the subselect
package, though I can't figure out how to implement its methods for the problem at hand.
推荐答案
不确定这是否是完整的解决方案,但如果您想要快速的结果,data.table 和阴影矩阵是最可能的成分.
Not sure if this is the complete solution, but if you want fast results, data.table and a shadow matrix are the most probable ingredients.
library(data.table)
df = data.table(foo) # your foo dataframe, converted to data.table
y = sort(df[,lapply(.SD, function(x) sum(is.na(x)))]) # nr of NA in columns, increasing
setcolorder(df, names(y)) # now the columns are ordered - less NA first
df[, idx := rowSums(is.na(df))] # count nr of NA in rows
df = df[order(idx),] # sort by nr of NA in rows
df[, idx := NULL] # idx not needed anymore
# now your data.table is sorted: columns with least NA to the left,
# rows with with least NA on top
# shadow matrix
x= data.table(abs(!is.na(df))) # 0 = NA value
y = as.data.table(t(x))
y = y[,lapply(.SD, cumprod)]
y = as.data.table(t(y))
y[,lapply(.SD, sum)]
# nr of complete cases from column selections:
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
# 1: 19 18 16 14 11 10 7 5 2 1
这篇关于data.frame 列的子集以最大化“完整";观察的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,WP2