data.frame 列的子集以最大化“完整"；观察

本文介绍了data.frame 列的子集以最大化“完整"；观察的处理方法，对大家解决问题具有一定的参考价值

问题描述

我有一个数据框，其中包含大约 20 个数字列，每个列都包含大量的 NA 值.我想选择这些列的一个子集，这将为我提供包含零 NA 值的最多行.详尽的搜索会花费大量的计算时间——有没有更好的方法来获得近似值?

I have a data frame with on the order of 20 numeric columns, each containing significant amounts of NA values. I would like to select a subset of these columns that will give me the most rows containing zero NA values. An exhaustive search would take a lot of computing time--is there a better way to get an approximation?

这是一个数据框较小的示例(完全任意):

Here is an example with a smaller data frame (completely arbitrary):

set.seed(2)
foo = as.data.frame(matrix(rnorm(200), nr = 20))
foo[sapply(foo, function(x) x > abs(x[1]))] = NA
foo = foo[-1, ]

round(foo, 3)

       V1     V2     V3     V4     V5     V6     V7     V8     V9    V10
2   0.185 -1.200 -1.959     NA -1.696  0.261  0.139  0.410 -0.638 -1.262
3      NA  1.590 -0.842 -0.703 -0.533 -0.314     NA -0.807 -0.268  0.392
4  -1.130  1.955     NA  0.158 -1.372 -0.750 -0.431  0.086  0.360 -1.131
5  -0.080  0.005     NA  0.506 -2.208 -0.862 -1.044     NA -1.313  0.544
6   0.132 -2.452     NA -0.820     NA     NA  0.538 -0.654 -0.884     NA
7   0.708  0.477 -0.305 -1.999 -0.653  0.940 -0.670     NA     NA  0.025
8  -0.240 -0.597 -0.091 -0.479 -0.285     NA  0.639  0.550 -2.099  0.515
9      NA  0.792 -0.184  0.084 -0.387 -0.421 -1.724 -0.807 -1.239 -0.654
10 -0.139  0.290 -1.199 -0.895  0.387 -0.351 -1.742 -0.997     NA  0.504
11  0.418  0.739 -0.838 -0.921     NA -1.027  0.690     NA     NA -1.272
12     NA  0.319     NA  0.330     NA -0.251  0.331 -0.169     NA -0.077
13 -0.393  1.076 -0.562 -0.142 -1.184  0.472  0.871     NA  0.057 -1.345
14 -1.040 -0.284     NA  0.435 -1.358     NA -2.016 -0.844  0.324 -0.266
15     NA -0.777 -1.048 -0.054 -1.513  0.564  1.213     NA -0.905     NA
16 -2.311 -0.596 -1.966 -0.907 -1.253  0.456  1.200 -1.343 -0.652  0.701
17  0.879 -1.726 -0.323  1.304     NA     NA  1.032     NA -0.262 -0.443
18  0.036 -0.903     NA  0.772  0.008     NA  0.786  0.464 -0.935 -0.789
19     NA -0.559     NA  1.053 -0.843  0.107     NA  0.268     NA -0.857
20  0.432 -0.247     NA -1.410 -0.601 -0.783 -1.454     NA -1.624 -0.746

dim(na.omit(foo))
[1]  1 10

这是我制定详尽搜索的方式:

Here is how I've formulated an exhaustive search:

best.list = list()
for (i in 5:ncol(foo)) {
    # get best subset for each size
    collist = combn(ncol(foo), i)
    numobs = apply(collist, 2, function(x) nrow(na.omit(foo[, x])))
    cat("for subset size", i, "most complete obs is", max(numobs), "
")
    best = which(numobs == max(numobs))[1]
    best.list = c(best.list, list(collist[, best]))
}

例如，best.list[[1]] 告诉我，如果我保留 5 列，我可以有 12 个完整的观察结果(NA 为零的行)，并且第 1、2、4 列, 7 和 10 是我应该选择的.

For example, best.list[[1]] tells me that if I keep 5 columns I can have 12 complete observations (rows with zero NAs), and that columns 1, 2, 4, 7, and 10 are the ones I should choose.

虽然这适用于非常小的数据帧，但对于较大的数据帧，它很快就会变得令人望而却步.R中有没有办法有效地估计给定大小的最佳子集?我唯一能找到的是 subselect 包，虽然我不知道如何针对手头的问题实现它的方法.

While this works for very small data frames, it quickly becomes prohibitive with larger ones. Is there a way in R to efficiently estimate the best subset of a given size? The only thing I've been able to find is the subselect package, though I can't figure out how to implement its methods for the problem at hand.

推荐答案

不确定这是否是完整的解决方案，但如果您想要快速的结果，data.table 和阴影矩阵是最可能的成分.

Not sure if this is the complete solution, but if you want fast results, data.table and a shadow matrix are the most probable ingredients.

library(data.table)
df = data.table(foo) # your foo dataframe, converted to data.table

y = sort(df[,lapply(.SD, function(x) sum(is.na(x)))]) # nr of NA in columns, increasing
setcolorder(df, names(y)) # now the columns are ordered - less NA first

df[, idx := rowSums(is.na(df))] # count nr of NA in rows
df = df[order(idx),] # sort by nr of NA in rows
df[, idx := NULL] # idx not needed anymore
# now your data.table is sorted: columns with least NA to the left,  
# rows with with least NA on top

# shadow matrix
x= data.table(abs(!is.na(df)))  # 0 = NA value
y = as.data.table(t(x))
y = y[,lapply(.SD, cumprod)]
y = as.data.table(t(y))
y[,lapply(.SD, sum)] 

# nr of complete cases from column selections:
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
# 1: 19 18 16 14 11 10  7  5  2   1

这篇关于data.frame 列的子集以最大化“完整"；观察的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，WP2

data.frame 列的子集以最大化“完整"；观察

问题描述

推荐答案

admin_action_{$_REQUEST[‘action’]}

admin_footer-{$GLOBALS[‘hook_suffix’]}

customize_save_{$this->id_data[‘base’]}

customize_value_{$this->id_data[‘base’]}

get_comment_author_url

network_admin_edit_{$_GET[‘action’]}

network_sites_updated_message_{$_GET[‘updated’]}

pre_wp_is_site_initialized

WordPress 的SEO 教学：如何在网站中加入关键字（Meta Keywords）与Meta 描述（Meta Description）？

谷歌的SEO是什么