18M+ 行表的子查询和 MySQL 缓存

本文介绍了18M+ 行表的子查询和 MySQL 缓存的处理方法,对大家解决问题具有一定的参考价值

问题描述

由于这是我的第一篇文章,我似乎只能发布 1 个链接,所以我在底部列出了我所指的网站.简而言之,我的目标是让数据库更快地返回结果,我试图包含尽可能多的相关信息,以帮助在帖子底部构建问题.

As this is my first post it seems I can only post 1 link so I have listed the sites I'm referring to at the bottom. In a nutshell my goal is to make the database return the results faster, I have tried to include as much relevant information as I could think of to help frame the questions at the bottom of the post.

8 processors
model name      : Intel(R) Xeon(R) CPU           E5440  @ 2.83GHz
cache size      : 6144 KB
cpu cores       : 4 

top - 17:11:48 up 35 days, 22:22, 10 users,  load average: 1.35, 4.89, 7.80
Tasks: 329 total,   1 running, 328 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.0%sy,  0.0%ni, 87.4%id, 12.5%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   8173980k total,  5374348k used,  2799632k free,    30148k buffers
Swap: 16777208k total,  6385312k used, 10391896k free,  2615836k cached

但是,我们正在考虑将 mysql 安装移动到集群中具有 256 GB 内存的另一台机器上

However we are looking at moving the mysql installation to a different machine in the cluster that has 256 GB of ram

我的 MySQL 表看起来像

My MySQL Table looks like

CREATE TABLE ClusterMatches 
(
    id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
    cluster_index INT, 
    matches LONGTEXT,
    tfidf FLOAT,
    INDEX(cluster_index)   
);

它有大约 18M 行,有 1M 唯一 cluster_index 和 6K 唯一匹配.我在 PHP 中生成的 sql 查询看起来像.

It has approximately 18M rows, there are 1M unique cluster_index's and 6K unique matches. The sql query I am generating in PHP looks like.

$sql_query="SELECT `matches`,sum(`tfidf`) FROM 
(SELECT * FROM Test2_ClusterMatches WHERE `cluster_index` in (".$clusters.")) 
AS result GROUP BY `matches` ORDER BY sum(`tfidf`) DESC LIMIT 0, 10;";

其中 $cluster 包含大约 3,000 个逗号分隔的 cluster_index 的字符串.此查询使用大约 50,000 行,运行大约需要 15 秒,当再次运行相同的查询时,运行大约需要 1 秒.

where $cluster contains a string of approximately 3,000 comma separated cluster_index's. This query makes use of approximately 50,000 rows and takes approximately 15s to run, when the same query is run again it takes approximately 1s to run.

  1. 可以假定表格的内容是静态的.
  2. 并发用户数少
  3. 上面的查询是当前唯一将在表上运行的查询

子查询

<小时>

基于这篇文章 [stackoverflow: Cache/Re-Use a Subquery in MySQL][1] 和查询时间的改进我相信我的子查询可以被索引.

Subquery


Based on this post [stackoverflow: Cache/Re-Use a Subquery in MySQL][1] and the improvement in query time I believe my subquery can be indexed.

mysql> EXPLAIN EXTENDED SELECT `matches`,sum(`tfidf`) FROM 
(SELECT * FROM ClusterMatches WHERE `cluster_index` in (1,2,...,3000) 
AS result GROUP BY `matches` ORDER BY sum(`tfidf`) ASC LIMIT 0, 10;

+----+-------------+----------------------+-------+---------------+---------------+---------+------+-------+---------------------------------+
| id | select_type | table                | type  | possible_keys | key           | key_len | ref  | rows  | Extra                           |
+----+-------------+----------------------+-------+---------------+---------------+---------+------+-------+---------------------------------+
|  1 | PRIMARY     |  derived2            | ALL   | NULL          | NULL          | NULL    | NULL | 48528 | Using temporary; Using filesort | 
|  2 | DERIVED     | ClusterMatches       | range | cluster_index | cluster_index | 5       | NULL | 53689 | Using where                     | 
+----+-------------+----------------------+-------+---------------+---------------+---------+------+-------+---------------------------------+

根据 Extra info 中的这篇较早的文章 [Optimizing MySQL: Queries and Indexes][2] - 这里看到的不好的是使用临时"和使用文件排序"

According to this older article [Optimizing MySQL: Queries and Indexes][2] in Extra info - the bad ones to see here are "using temporary" and "using filesort"

查询缓存可用,但实际上已关闭,因为大小当前设置为零

Query cache is available, but effectively turned off as the size is currently set to zero


mysqladmin variables;
+---------------------------------+----------------------+
| Variable_name                   | Value                |
+---------------------------------+----------------------+
| bdb_cache_size                  | 8384512              | 
| binlog_cache_size               | 32768                | 
| expire_logs_days                | 0                    |
| have_query_cache                | YES                  | 
| flush                           | OFF                  |
| flush_time                      | 0                    |
| innodb_additional_mem_pool_size | 1048576              |
| innodb_autoextend_increment     | 8                    |
| innodb_buffer_pool_awe_mem_mb   | 0                    |
| innodb_buffer_pool_size         | 8388608              |
| join_buffer_size                | 131072               |
| key_buffer_size                 | 8384512              |
| key_cache_age_threshold         | 300                  |
| key_cache_block_size            | 1024                 |
| key_cache_division_limit        | 100                  |
| max_binlog_cache_size           | 18446744073709547520 | 
| sort_buffer_size                | 2097144              |
| table_cache                     | 64                   | 
| thread_cache_size               | 0                    | 
| query_cache_limit               | 1048576              |
| query_cache_min_res_unit        | 4096                 |
| query_cache_size                | 0                    |
| query_cache_type                | ON                   |
| query_cache_wlock_invalidate    | OFF                  |
| read_rnd_buffer_size            | 262144               |
+---------------------------------+----------------------+

根据这篇关于[Mysql Database Performance Turning][3]的文章,我认为我需要调整的值是

Based on this article on [Mysql Database Performance turning][3] I believe that the values I need to tweak are

  1. table_cache
  2. key_buffer
  3. 排序缓冲区
  4. read_buffer_size
  5. record_rnd_buffer(用于 GROUP BY 和 ORDER BY 术语)

确定需要改进的领域 - MySQL 查询调整

<小时>

  1. 将匹配的数据类型更改为指向另一个表的 int 索引 [如果 MySQL 包含可变长度字段(如 TEXT 或 BLOB),那么 MySQL 确实会使用动态行格式,在这种情况下,这意味着排序需要在磁盘上完成.解决方案不是避开这些数据类型,而是将这些字段拆分到关联的表中.][4]
  2. 索引新的 match_index 字段,以便 GROUP BY matches 发生得更快,基于语句 [您可能应该为您选择、分组、排序或加入的任何字段创建索引."][5]
  1. Changing the datatype for matches to an index that is an int pointing to another table [MySQL will indeed use a dynamic row format if it contains variable length fields like TEXT or BLOB, which, in this case, means sorting needs to be done on disk. The solution is not to eschew these datatypes, but rather to split off such fields into an associated table.][4]
  2. Indexing the new match_index feild so that the GROUP BY matches occurs faster, based on the statement ["You should probably create indices for any field on which you are selecting, grouping, ordering, or joining."][5]

工具

<小时>

为了调整性能我打算使用

Tools


To tweak perform I plan to use

  1. [解释][6]参考[输出格式][7]
  2. [ab - Apache HTTP 服务器基准测试工具][8]
  3. [分析][9] 与 [日志数据][10]

未来的数据库规模

<小时>

目标是构建一个系统,该系统可以有 1M 唯一的 cluster_index 值、1M 的唯一匹配值、大约 3,000,000,000 个表行,对查询的响应时间约为 0.5 秒(我们可以根据需要添加更多内存并将数据库分布在集群)

Future Database Size


The goal is to build a system that can have 1M unique cluster_index values 1M unique match values, approx 3,000,000,000 table rows with a response time to the query of around 0.5s (we can add more ram as necessary and distribute the database across the cluster)

  1. 我认为我们希望将整个记录集保存在 ram 中,这样查询就不会触及磁盘,如果我们将整个数据库保存在 MySQL 缓存中,是否就不需要 memcachedb?
  2. 试图将整个数据库保存在 MySQL 缓存中是一种糟糕的策略,因为它的设计目的不是持久化吗?像 memcachedb 或 redis 这样的方法会是更好的方法吗?如果是,为什么?
  3. 查询完成时,查询创建的临时表结果"是否会自动销毁?
  4. 我们是否应该从 Innodb 切换到 MyISAM [因为它适合读取大量数据,而 InnoDB 适合写入大量数据][11]?
  5. 我的缓存在我的 [查询缓存配置][12] 中似乎没有设为零,为什么当前查询在我第二次运行时发生得更快?
  6. 我可以重组我的查询以消除使用临时"和使用文件排序"的情况吗?我应该使用联接而不是子查询吗?
  7. 您如何查看 MySQL [Data Cache][13] 的大小?
  8. 您建议将值 table_cache、key_buffer、sort_buffer、read_buffer_size、record_rnd_buffer 的大小作为起点?

链接

<小时>
  • 1:stackoverflow.com/questions/658937/cache-re-use-a-subquery-in-mysql
  • 2:databasejournal.com/features/mysql/article.php/10897_1382791_4/Optimizing-MySQL-Queries-and-Indexes.htm
  • 3:debianhelp.co.uk/mysqlperformance.htm
  • 4:20bits.com/articles/10-tips-for-optimizing-mysql-queries-that-dont-suck/
  • 5:20bits.com/articles/10-tips-for-optimizing-mysql-queries-that-dont-suck/
  • 6:dev.mysql.com/doc/refman/5.0/en/explain.html
  • 7:dev.mysql.com/doc/refman/5.0/en/explain-output.html
  • 8:httpd.apache.org/docs/2.2/programs/ab.html
  • 9:mtop.sourceforge.net/
  • 10:dev.mysql.com/doc/refman/5.0/en/slow-query-log.html
  • 11: 20bits.com/articles/10-tips-for-optimizing-mysql-queries-that-dont-suck/
  • 12:dev.mysql.com/doc/refman/5.0/en/query-cache-configuration.html
  • 13:dev.mysql.com/tech-resources/articles/mysql-query-cache.html
  • 推荐答案

    换桌

    <小时>

    基于这篇文章中的建议 如何为 order by 和 group by 查询选择索引 表格现在看起来像

    CREATE TABLE ClusterMatches 
    (
        cluster_index INT UNSIGNED, 
        match_index INT UNSIGNED,
        id INT NOT NULL AUTO_INCREMENT,
        tfidf FLOAT,
        PRIMARY KEY (match_index,cluster_index,id,tfidf)
    );
    CREATE TABLE MatchLookup 
    (
        match_index INT UNSIGNED NOT NULL PRIMARY KEY,
        image_match TINYTEXT
    );
    

    没有按 SUM(tfidf) 对结果进行排序的查询看起来像

    The query without sorting the results by the SUM(tfidf) looks like

    SELECT match_index, SUM(tfidf) FROM ClusterMatches 
    WHERE cluster_index in (1,2,3 ... 3000) GROUP BY match_index LIMIT 10;

    这消除了使用临时和使用文件排序

    Which eliminates using temporary and using filesort

    explain extended SELECT match_index, SUM(tfidf) FROM ClusterMatches 
    WHERE cluster_index in (1,2,3 ... 3000) GROUP BY match_index LIMIT 10;
    +----+-------------+----------------------+-------+---------------+---------+---------+------+-------+--------------------------+
    | id | select_type | table                | type  | possible_keys | key     | key_len | ref  | rows  | Extra                    |
    +----+-------------+----------------------+-------+---------------+---------+---------+------+-------+--------------------------+
    |  1 | SIMPLE      | ClusterMatches       | range | PRIMARY       | PRIMARY | 4       | NULL | 14938 | Using where; Using index | 
    +----+-------------+----------------------+-------+---------------+---------+---------+------+-------+--------------------------+

    但是,如果我将 ORDER BY SUM(tfdif) 添加到

    However if i add the ORDER BY SUM(tfdif) in

    SELECT match_index, SUM(tfidf) AS total FROM ClusterMatches
    WHERE cluster_index in (1,2,3 ... 3000) GROUP BY match_index 
    ORDER BY total DESC LIMIT 0,10;
    +-------------+--------------------+
    | match_index | total              |
    +-------------+--------------------+
    |         868 |   0.11126546561718 | 
    |        4182 | 0.0238558370620012 | 
    |        2162 | 0.0216601379215717 | 
    |        1406 | 0.0191618576645851 | 
    |        4239 | 0.0168981291353703 | 
    |        1437 | 0.0160425212234259 | 
    |        2599 | 0.0156466849148273 | 
    |         394 | 0.0155945559963584 | 
    |        3116 | 0.0151005545631051 | 
    |        4028 | 0.0149106932803988 | 
    +-------------+--------------------+
    10 rows in set (0.03 sec)

    在这种规模下,结果相当快,但 ORDER BY SUM(tfidf) 意味着它使用临时和文件排序

    The result is suitably fast at this scale BUT having the ORDER BY SUM(tfidf) means it uses temporary and filesort

    explain extended SELECT match_index, SUM(tfidf) AS total FROM ClusterMatches 
    WHERE cluster_index IN (1,2,3 ... 3000) GROUP BY match_index 
    ORDER BY total DESC LIMIT 0,10;
    +----+-------------+----------------------+-------+---------------+---------+---------+------+-------+-----------------------------------------------------------+
    | id | select_type | table                | type  | possible_keys | key     | key_len | ref  | rows  | Extra                                                     |
    +----+-------------+----------------------+-------+---------------+---------+---------+------+-------+-----------------------------------------------------------+
    |  1 | SIMPLE      | ClusterMatches       | range | PRIMARY       | PRIMARY | 4       | NULL | 65369 | Using where; Using index; Using temporary; Using filesort | 
    +----+-------------+----------------------+-------+---------------+---------+---------+------+-------+-----------------------------------------------------------+

    我正在寻找一种不使用临时或文件排序的解决方案,类似于

    Im looking for a solution that doesn't use temporary or filesort, along the lines of

    SELECT match_index, SUM(tfidf) AS total FROM ClusterMatches 
    WHERE cluster_index IN (1,2,3 ... 3000) GROUP BY cluster_index, match_index 
    HAVING total>0.01 ORDER BY cluster_index;

    在哪里我不需要硬编码总阈值,有什么想法吗?

    where I dont need to hardcode a threshold for total, any ideas?

    这篇关于18M+ 行表的子查询和 MySQL 缓存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,WP2

admin_action_{$_REQUEST[‘action’]}

do_action( "admin_action_{$_REQUEST[‘action’]}" )动作钩子::在发送“Action”请求变量时激发。Action Hook: Fires when an ‘action’ request variable is sent.目录锚点:#说明#源码说明(Description)钩子名称的动态部分$_REQUEST['action']引用从GET或POST请求派生的操作。源码(Source)更新版本源码位置使用被使用2.6.0 wp-admin/admin.php:...

日期:2020-09-02 17:44:16 浏览:1179

admin_footer-{$GLOBALS[‘hook_suffix’]}

do_action( "admin_footer-{$GLOBALS[‘hook_suffix’]}", string $hook_suffix )操作挂钩:在默认页脚脚本之后打印脚本或数据。Action Hook: Print scripts or data after the default footer scripts.目录锚点:#说明#参数#源码说明(Description)钩子名的动态部分,$GLOBALS['hook_suffix']引用当前页的全局钩子后缀。参数(Parameters)参数类...

日期:2020-09-02 17:44:20 浏览:1075

customize_save_{$this->id_data[‘base’]}

do_action( "customize_save_{$this-&gt;id_data[‘base’]}", WP_Customize_Setting $this )动作钩子::在调用WP_Customize_Setting::save()方法时激发。Action Hook: Fires when the WP_Customize_Setting::save() method is called.目录锚点:#说明#参数#源码说明(Description)钩子名称的动态部分,$this->id_data...

日期:2020-08-15 15:47:24 浏览:817

customize_value_{$this->id_data[‘base’]}

apply_filters( "customize_value_{$this-&gt;id_data[‘base’]}", mixed $default )过滤器::过滤未作为主题模式或选项处理的自定义设置值。Filter Hook: Filter a Customize setting value not handled as a theme_mod or option.目录锚点:#说明#参数#源码说明(Description)钩子名称的动态部分,$this->id_date['base'],指的是设置...

日期:2020-08-15 15:47:24 浏览:908

get_comment_author_url

过滤钩子:过滤评论作者的URL。Filter Hook: Filters the comment author’s URL.目录锚点:#源码源码(Source)更新版本源码位置使用被使用 wp-includes/comment-template.php:32610...

日期:2020-08-10 23:06:14 浏览:939

network_admin_edit_{$_GET[‘action’]}

do_action( "network_admin_edit_{$_GET[‘action’]}" )操作挂钩:启动请求的处理程序操作。Action Hook: Fires the requested handler action.目录锚点:#说明#源码说明(Description)钩子名称的动态部分$u GET['action']引用请求的操作的名称。源码(Source)更新版本源码位置使用被使用3.1.0 wp-admin/network/edit.php:3600...

日期:2020-08-02 09:56:09 浏览:886

network_sites_updated_message_{$_GET[‘updated’]}

apply_filters( "network_sites_updated_message_{$_GET[‘updated’]}", string $msg )筛选器挂钩:在网络管理中筛选特定的非默认站点更新消息。Filter Hook: Filters a specific, non-default site-updated message in the Network admin.目录锚点:#说明#参数#源码说明(Description)钩子名称的动态部分$_GET['updated']引用了非默认的...

日期:2020-08-02 09:56:03 浏览:873

pre_wp_is_site_initialized

过滤器::过滤在访问数据库之前是否初始化站点的检查。Filter Hook: Filters the check for whether a site is initialized before the database is accessed.目录锚点:#源码源码(Source)更新版本源码位置使用被使用 wp-includes/ms-site.php:93910...

日期:2020-07-29 10:15:38 浏览:838

WordPress 的SEO 教学:如何在网站中加入关键字(Meta Keywords)与Meta 描述(Meta Description)?

你想在WordPress 中添加关键字和meta 描述吗?关键字和meta 描述使你能够提高网站的SEO。在本文中,我们将向你展示如何在WordPress 中正确添加关键字和meta 描述。为什么要在WordPress 中添加关键字和Meta 描述?关键字和说明让搜寻引擎更了解您的帖子和页面的内容。关键词是人们寻找您发布的内容时,可能会搜索的重要词语或片语。而Meta Description则是对你的页面和文章的简要描述。如果你想要了解更多关于中继标签的资讯,可以参考Google的说明。Meta 关键字和描...

日期:2020-10-03 21:18:25 浏览:1761

谷歌的SEO是什么

SEO (Search Engine Optimization)中文是搜寻引擎最佳化,意思近于「关键字自然排序」、「网站排名优化」。简言之,SEO是以搜索引擎(如Google、Bing)为曝光媒体的行销手法。例如搜寻「wordpress教学」,会看到本站的「WordPress教学:12个课程…」排行Google第一:关键字:wordpress教学、wordpress课程…若搜寻「网站架设」,则会看到另一个网页排名第1:关键字:网站架设、架站…以上两个网页,每月从搜寻引擎导入自然流量,达2万4千:每月「有机搜...

日期:2020-10-30 17:23:57 浏览:1319