Python - Pandas，重新采样数据集以具有平衡的类

本文介绍了Python - Pandas，重新采样数据集以具有平衡的类的处理方法，对大家解决问题具有一定的参考价值

问题描述

使用以下数据框，只有 2 个可能的标签:

With the following data frame, with only 2 possible lables:

   name  f1  f2  label
0     A   8   9      1
1     A   5   3      1
2     B   8   9      0
3     C   9   2      0
4     C   8   1      0
5     C   9   1      0
6     D   2   1      0
7     D   9   7      0
8     D   3   1      0
9     E   5   1      1
10    E   3   6      1
11    E   7   1      1

我编写了一个代码来按名称"列对数据进行分组，并将结果转换为一个 numpy 数组，因此每一行都是特定组的所有样本的集合，而标签是另一个 numpy 数组:

I've written a code to group the data by the 'name' column and pivot the result into a numpy array, so each row is a collection of all the samples of a specific group, and the lables are another numpy array:

数据:

[[8 9] [5 3] [0 0]] # A lable = 1
[[8 9] [0 0] [0 0]] # B lable = 0
[[9 2] [8 1] [9 1]] # C lable = 0
[[2 1] [9 7] [3 1]] # D lable = 0
[[5 1] [3 6] [7 1]] # E lable = 1

标签:

[[1]
 [0]
 [0]
 [0]
 [1]]

代码:

import pandas as pd
import numpy as np


def prepare_data(group_name):
    df = pd.read_csv("../data/tmp.csv")


    group_index = df.groupby(group_name).cumcount()
    data = (df.set_index([group_name, group_index])
            .unstack(fill_value=0).stack())



    target = np.array(data['label'].groupby(level=0).apply(lambda x: [x.values[0]]).tolist())
    data = data.loc[:, data.columns != 'label']
    data = np.array(data.groupby(level=0).apply(lambda x: x.values.tolist()).tolist())
    print(data)
    print(target)


prepare_data('name')

我想从过度代表的类中重新采样并删除实例.

I would like to resample and delete instances from the over-represented class.

即

[[8 9] [5 3] [0 0]] # A lable = 1
[[8 9] [0 0] [0 0]] # B lable = 0
[[9 2] [8 1] [9 1]] # C lable = 0
# group D was deleted randomly from the '0' labels 
[[5 1] [3 6] [7 1]] # E lable = 1

将是一个可接受的解决方案，因为删除 D(标记为0")将产生 2 * 标签1"和 2 * 标签0"的平衡数据集.

would be an acceptable solution, since removing D (labeled '0') will result with a balanced dataset of 2 * label '1' and 2 * label '0'.

推荐答案

前提是每个 name 都由一个 label 标记(例如所有 A 是 1) 你可以使用以下代码:

Provided that each name is labeled by exactly one label (e.g. all A are 1) you can use the following:

按 label 对 name 进行分组，并检查哪个标签多余(就唯一名称而言).
从过度代表的标签类别中随机删除名称，以解决多余的问题.
选择数据框中不包含已删除名称的部分.

Group the names by label and check which label has an excess (in terms of unique names).
Randomly remove names from the over-represented label class in order to account for the excess.
Select the part of the data frame which does not contain the removed names.

代码如下:

labels = df.groupby('label').name.unique()
# Sort the over-represented class to the head.
labels = labels[labels.apply(len).sort_values(ascending=False).index]
excess = len(labels.iloc[0]) - len(labels.iloc[1])
remove = np.random.choice(labels.iloc[0], excess, replace=False)
df2 = df[~df.name.isin(remove)]

这篇关于Python - Pandas，重新采样数据集以具有平衡的类的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，WP2

Python - Pandas，重新采样数据集以具有平衡的类

问题描述

推荐答案

admin_action_{$_REQUEST[‘action’]}

admin_footer-{$GLOBALS[‘hook_suffix’]}

customize_save_{$this->id_data[‘base’]}

customize_value_{$this->id_data[‘base’]}

get_comment_author_url

network_admin_edit_{$_GET[‘action’]}

network_sites_updated_message_{$_GET[‘updated’]}

pre_wp_is_site_initialized

WordPress 的SEO 教学：如何在网站中加入关键字（Meta Keywords）与Meta 描述（Meta Description）？

谷歌的SEO是什么