为简单的一维场景推荐的异常检测技术?

本文介绍了为简单的一维场景推荐的异常检测技术?的处理方法,对大家解决问题具有一定的参考价值

问题描述

我有一个场景,我有几千个数据实例.数据本身表示为单个整数值.我希望能够检测到实例何时是极端异常值.

例如,使用以下示例数据:

a = 10= 14c = 25d = 467e = 12

d 显然是一个异常,我想基于此执行特定操作.

我很想尝试使用我对特定领域的知识来检测异常情况.例如,找出有用的平均值的距离,并根据启发式进行检查.但是,我认为如果我研究更通用、更强大的异常检测技术,这些技术背后有一些理论,可能会更好.

由于我的数学知识有限,我希望找到一种简单的技术,例如使用标准偏差.希望数据的单维性质会使这成为一个非常普遍的问题,但如果需要有关该场景的更多信息,请发表评论,我会提供更多信息.

<小时>

我想我会添加更多关于数据和我尝试过的信息,以防它使一个答案比另一个更正确.

这些值都是正数且非零.我希望这些值会形成正态分布.这种期望是基于对领域的直觉,而不是通过分析,如果这不是一件坏事,请告诉我.在聚类方面,除非也有标准算法来选择 k 值,否则我很难将这个值提供给 k 均值算法.

我想对异常值/异常采取的操作是将其呈现给用户,并建议基本上从数据集中删除数据点(我不会深入了解他们将如何执行此操作,但是它对我的域有意义),因此它不会用作另一个函数的输入.

到目前为止,我已经在有限的数据集上尝试了三西格玛和 IQR 异常值测试.IQR 标记不够极端的值,三西格玛指出更符合我对领域直觉的实例.

<小时>

有关算法、技术或资源链接的信息以了解此特定场景是有效且受欢迎的答案.

对于简单的一维数据,推荐的异常检测技术是什么?

解决方案

查看 建议使用 K-means 找出异常值.除了它主要是一种聚类算法(并不是真正的异常值检测技术)之外,k-means 的问题在于它需要事先知道聚类数 K 的一个好的值.

更适合的技术是 DBSCAN:一种基于密度的聚类算法.基本上它将具有足够高密度的区域生长成簇,这些簇将是密度连接点的最大集合.

DBSCAN 需要两个参数:epsilonminPoints.它从一个未被访问过的任意点开始.然后找到距离起点epsilon内的所有相邻点.

如果邻居数大于或等于minPoints,则形成一个簇.起点和它的邻居被添加到这个簇中,起点被标记为已访问.然后算法递归地对所有邻居重复评估过程.

如果邻居数小于minPoints,则将该点标记为噪声.

如果集群完全扩展(访问范围内的所有点),则算法继续迭代剩余的未访问点,直到它们被耗尽.

最后,所有标记为噪声的点的集合被视为异常值.

I have a scenario where I have several thousand instances of data. The data itself is represented as a single integer value. I want to be able to detect when an instance is an extreme outlier.

For example, with the following example data:

a = 10
b = 14
c = 25
d = 467
e = 12

d is clearly an anomaly, and I would want to perform a specific action based on this.

I was tempted to just try an use my knowledge of the particular domain to detect anomalies. For instance, figure out a distance from the mean value that is useful, and check for that, based on heuristics. However, I think it's probably better if I investigate more general, robust anomaly detection techniques, which have some theory behind them.

Since my working knowledge of mathematics is limited, I'm hoping to find a technique which is simple, such as using standard deviation. Hopefully the single-dimensioned nature of the data will make this quite a common problem, but if more information for the scenario is required please leave a comment and I will give more info.


Edit: thought I'd add more information about the data and what I've tried in case it makes one answer more correct than another.

The values are all positive and non-zero. I expect that the values will form a normal distribution. This expectation is based on an intuition of the domain rather than through analysis, if this is not a bad thing to assume, please let me know. In terms of clustering, unless there's also standard algorithms to choose a k-value, I would find it hard to provide this value to a k-Means algorithm.

The action I want to take for an outlier/anomaly is to present it to the user, and recommend that the data point is basically removed from the data set (I won't get in to how they would do that, but it makes sense for my domain), thus it will not be used as input to another function.

So far I have tried three-sigma, and the IQR outlier test on my limited data set. IQR flags values which are not extreme enough, three-sigma points out instances which better fit with my intuition of the domain.


Information on algorithms, techniques or links to resources to learn about this specific scenario are valid and welcome answers.

What is a recommended anomaly detection technique for simple, one-dimensional data?

解决方案

Check out the three-sigma rule:

mu  = mean of the data
std = standard deviation of the data
IF abs(x-mu) > 3*std  THEN  x is outlier

An alternative method is the IQR outlier test:

Q25 = 25th_percentile
Q75 = 75th_percentile
IQR = Q75 - Q25         // inter-quartile range
IF (x < Q25 - 1.5*IQR) OR (Q75 + 1.5*IQR < x) THEN  x is a mild outlier
IF (x < Q25 - 3.0*IQR) OR (Q75 + 3.0*IQR < x) THEN  x is an extreme outlier

this test is usually employed by Box plots (indicated by the whiskers):


EDIT:

For your case (simple 1D univariate data), I think my first answer is well suited. That however isn't applicable to multivariate data.

@smaclell suggested using K-means to find the outliers. Beside the fact that it is mainly a clustering algorithm (not really an outlier detection technique), the problem with k-means is that it requires knowing in advance a good value for the number of clusters K.

A better suited technique is the DBSCAN: a density-based clustering algorithm. Basically it grows regions with sufficiently high density into clusters which will be maximal set of density-connected points.

DBSCAN requires two parameters: epsilon and minPoints. It starts with an arbitrary point that has not been visited. It then finds all the neighbor points within distance epsilon of the starting point.

If the number of neighbors is greater than or equal to minPoints, a cluster is formed. The starting point and its neighbors are added to this cluster and the starting point is marked as visited. The algorithm then repeats the evaluation process for all the neighbors recursively.

If the number of neighbors is less than minPoints, the point is marked as noise.

If a cluster is fully expanded (all points within reach are visited) then the algorithm proceeds to iterate through the remaining unvisited points until they are depleted.

Finally the set of all points marked as noise are considered outliers.

这篇关于为简单的一维场景推荐的异常检测技术?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,WP2

admin_action_{$_REQUEST[‘action’]}

do_action( "admin_action_{$_REQUEST[‘action’]}" )动作钩子::在发送“Action”请求变量时激发。Action Hook: Fires when an ‘action’ request variable is sent.目录锚点:#说明#源码说明(Description)钩子名称的动态部分$_REQUEST['action']引用从GET或POST请求派生的操作。源码(Source)更新版本源码位置使用被使用2.6.0 wp-admin/admin.php:...

日期:2020-09-02 17:44:16 浏览:1169

admin_footer-{$GLOBALS[‘hook_suffix’]}

do_action( "admin_footer-{$GLOBALS[‘hook_suffix’]}", string $hook_suffix )操作挂钩:在默认页脚脚本之后打印脚本或数据。Action Hook: Print scripts or data after the default footer scripts.目录锚点:#说明#参数#源码说明(Description)钩子名的动态部分,$GLOBALS['hook_suffix']引用当前页的全局钩子后缀。参数(Parameters)参数类...

日期:2020-09-02 17:44:20 浏览:1070

customize_save_{$this->id_data[‘base’]}

do_action( "customize_save_{$this-&gt;id_data[‘base’]}", WP_Customize_Setting $this )动作钩子::在调用WP_Customize_Setting::save()方法时激发。Action Hook: Fires when the WP_Customize_Setting::save() method is called.目录锚点:#说明#参数#源码说明(Description)钩子名称的动态部分,$this->id_data...

日期:2020-08-15 15:47:24 浏览:806

customize_value_{$this->id_data[‘base’]}

apply_filters( "customize_value_{$this-&gt;id_data[‘base’]}", mixed $default )过滤器::过滤未作为主题模式或选项处理的自定义设置值。Filter Hook: Filter a Customize setting value not handled as a theme_mod or option.目录锚点:#说明#参数#源码说明(Description)钩子名称的动态部分,$this->id_date['base'],指的是设置...

日期:2020-08-15 15:47:24 浏览:898

get_comment_author_url

过滤钩子:过滤评论作者的URL。Filter Hook: Filters the comment author’s URL.目录锚点:#源码源码(Source)更新版本源码位置使用被使用 wp-includes/comment-template.php:32610...

日期:2020-08-10 23:06:14 浏览:930

network_admin_edit_{$_GET[‘action’]}

do_action( "network_admin_edit_{$_GET[‘action’]}" )操作挂钩:启动请求的处理程序操作。Action Hook: Fires the requested handler action.目录锚点:#说明#源码说明(Description)钩子名称的动态部分$u GET['action']引用请求的操作的名称。源码(Source)更新版本源码位置使用被使用3.1.0 wp-admin/network/edit.php:3600...

日期:2020-08-02 09:56:09 浏览:876

network_sites_updated_message_{$_GET[‘updated’]}

apply_filters( "network_sites_updated_message_{$_GET[‘updated’]}", string $msg )筛选器挂钩:在网络管理中筛选特定的非默认站点更新消息。Filter Hook: Filters a specific, non-default site-updated message in the Network admin.目录锚点:#说明#参数#源码说明(Description)钩子名称的动态部分$_GET['updated']引用了非默认的...

日期:2020-08-02 09:56:03 浏览:864

pre_wp_is_site_initialized

过滤器::过滤在访问数据库之前是否初始化站点的检查。Filter Hook: Filters the check for whether a site is initialized before the database is accessed.目录锚点:#源码源码(Source)更新版本源码位置使用被使用 wp-includes/ms-site.php:93910...

日期:2020-07-29 10:15:38 浏览:833

WordPress 的SEO 教学:如何在网站中加入关键字(Meta Keywords)与Meta 描述(Meta Description)?

你想在WordPress 中添加关键字和meta 描述吗?关键字和meta 描述使你能够提高网站的SEO。在本文中,我们将向你展示如何在WordPress 中正确添加关键字和meta 描述。为什么要在WordPress 中添加关键字和Meta 描述?关键字和说明让搜寻引擎更了解您的帖子和页面的内容。关键词是人们寻找您发布的内容时,可能会搜索的重要词语或片语。而Meta Description则是对你的页面和文章的简要描述。如果你想要了解更多关于中继标签的资讯,可以参考Google的说明。Meta 关键字和描...

日期:2020-10-03 21:18:25 浏览:1722

谷歌的SEO是什么

SEO (Search Engine Optimization)中文是搜寻引擎最佳化,意思近于「关键字自然排序」、「网站排名优化」。简言之,SEO是以搜索引擎(如Google、Bing)为曝光媒体的行销手法。例如搜寻「wordpress教学」,会看到本站的「WordPress教学:12个课程…」排行Google第一:关键字:wordpress教学、wordpress课程…若搜寻「网站架设」,则会看到另一个网页排名第1:关键字:网站架设、架站…以上两个网页,每月从搜寻引擎导入自然流量,达2万4千:每月「有机搜...

日期:2020-10-30 17:23:57 浏览:1308