Gensim Word2Vec 从预训练模型中选择次要的词向量集

本文介绍了Gensim Word2Vec 从预训练模型中选择次要的词向量集的处理方法,对大家解决问题具有一定的参考价值

问题描述

我在 gensim 中有一个大型的预训练 Word2Vec 模型,我想从中使用预训练的词向量作为 Keras 模型中的嵌入层.

问题是嵌入的大小很大,我不需要大部分词向量(因为我知道哪些词可以作为输入出现).所以我想去掉它们以减小嵌入层的大小.

有没有办法根据单词的白名单只保留所需的词向量(包括相应的索引!)?

解决方案

感谢 这个答案(我已经改了代码一点点使它更好).您可以使用此代码来解决您的问题.

我们在 restricted_word_set 中有我们所有的次要词集(它可以是列表或集合)并且 w2v 是我们的模型,所以这里是函数:

将 numpy 导入为 np定义限制w2v(w2v,restricted_word_set):new_vectors = []new_vocab = {}new_index2entity = []new_vectors_norm = []对于我在范围内(len(w2v.vocab)):字 = w2v.index2entity[i]vec = w2v.vectors[i]vocab = w2v.vocab[词]vec_norm = w2v.vectors_norm[i]如果字在受限制的字集:vocab.index = len(new_index2entity)new_index2entity.append(word)new_vocab[词] = 词汇new_vectors.append(vec)new_vectors_norm.append(vec_norm)w2v.vocab = new_vocabw2v.vectors = np.array(new_vectors)w2v.index2entity = np.array(new_index2entity)w2v.index2word = np.array(new_index2entity)w2v.vectors_norm = np.array(new_vectors_norm)

<块引用>

警告:当您第一次创建模型时,vectors_norm == None 所以如果你在那里使用这个函数,你会得到一个错误.vectors_norm第一次使用后将获得 numpy.ndarray 类型的值.所以在使用该函数之前,请尝试使用 most_similar("cat") 之类的东西vectors_norm 不等于 None.

它根据 Word2VecKeyedVectors.

用法:

w2v = KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin.gz", binary=True)w2v.most_similar("啤酒")

<块引用>

[('啤酒', 0.8409687876701355),
('啤酒', 0.7733745574951172),
('啤酒', 0.71753990650177),
('饮料', 0.668931245803833),
('贮藏啤酒', 0.6570086479187012),
('Yuengling_Lager', 0.655455470085144),
('微酿', 0.6534324884414673),
('Brooklyn_Lager', 0.6501551866531372),
('suds', 0.6497018337249756),
('brewed_beer', 0.6490240097045898)]

restricted_word_set = {"beer", "wine", "computer", "python", "bash", "lagers"}限制w2v(w2v,restricted_word_set)w2v.most_similar("啤酒")

<块引用>

[('拉格啤酒', 0.6570085287094116),
('酒', 0.6217695474624634),
('bash', 0.20583480596542358),
('计算机', 0.06677375733852386),
('蟒蛇', 0.005948573350906372)]

它也可以用来删除一些单词.

I have a large pretrained Word2Vec model in gensim from which I want to use the pretrained word vectors for an embedding layer in my Keras model.

The problem is that the embedding size is enormous and I don't need most of the word vectors (because I know which words can occure as Input). So I want to get rid of them to reduce the size of my embedding layer.

Is there a way to just keep desired wordvectors (including the coresponding indices!), based on a whitelist of words?

解决方案

Thanks to this answer (I've changed the code a little bit to make it better). you can use this code for solving your problem.

we have all our minor set of words in restricted_word_set(it can be either list or set) and w2v is our model, so here is the function:

import numpy as np

def restrict_w2v(w2v, restricted_word_set):
    new_vectors = []
    new_vocab = {}
    new_index2entity = []
    new_vectors_norm = []

    for i in range(len(w2v.vocab)):
        word = w2v.index2entity[i]
        vec = w2v.vectors[i]
        vocab = w2v.vocab[word]
        vec_norm = w2v.vectors_norm[i]
        if word in restricted_word_set:
            vocab.index = len(new_index2entity)
            new_index2entity.append(word)
            new_vocab[word] = vocab
            new_vectors.append(vec)
            new_vectors_norm.append(vec_norm)

    w2v.vocab = new_vocab
    w2v.vectors = np.array(new_vectors)
    w2v.index2entity = np.array(new_index2entity)
    w2v.index2word = np.array(new_index2entity)
    w2v.vectors_norm = np.array(new_vectors_norm)

WARNING: when you first create the model the vectors_norm == None so you will get an error if you use this function there. vectors_norm will get a value of the type numpy.ndarray after the first use. so before using the function try something like most_similar("cat") so that vectors_norm not be equal to None.

It rewrites all of the variables which are related to the words based on the Word2VecKeyedVectors.

Usage:

w2v = KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin.gz", binary=True)
w2v.most_similar("beer")

[('beers', 0.8409687876701355),
('lager', 0.7733745574951172),
('Beer', 0.71753990650177),
('drinks', 0.668931245803833),
('lagers', 0.6570086479187012),
('Yuengling_Lager', 0.655455470085144),
('microbrew', 0.6534324884414673),
('Brooklyn_Lager', 0.6501551866531372),
('suds', 0.6497018337249756),
('brewed_beer', 0.6490240097045898)]

restricted_word_set = {"beer", "wine", "computer", "python", "bash", "lagers"}
restrict_w2v(w2v, restricted_word_set)
w2v.most_similar("beer")

[('lagers', 0.6570085287094116),
('wine', 0.6217695474624634),
('bash', 0.20583480596542358),
('computer', 0.06677375733852386),
('python', 0.005948573350906372)]

it can be used for removing some words either.

这篇关于Gensim Word2Vec 从预训练模型中选择次要的词向量集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,WP2

admin_action_{$_REQUEST[‘action’]}

do_action( "admin_action_{$_REQUEST[‘action’]}" )动作钩子::在发送“Action”请求变量时激发。Action Hook: Fires when an ‘action’ request variable is sent.目录锚点:#说明#源码说明(Description)钩子名称的动态部分$_REQUEST['action']引用从GET或POST请求派生的操作。源码(Source)更新版本源码位置使用被使用2.6.0 wp-admin/admin.php:...

日期:2020-09-02 17:44:16 浏览:1127

admin_footer-{$GLOBALS[‘hook_suffix’]}

do_action( "admin_footer-{$GLOBALS[‘hook_suffix’]}", string $hook_suffix )操作挂钩:在默认页脚脚本之后打印脚本或数据。Action Hook: Print scripts or data after the default footer scripts.目录锚点:#说明#参数#源码说明(Description)钩子名的动态部分,$GLOBALS['hook_suffix']引用当前页的全局钩子后缀。参数(Parameters)参数类...

日期:2020-09-02 17:44:20 浏览:1033

customize_save_{$this->id_data[‘base’]}

do_action( "customize_save_{$this-&gt;id_data[‘base’]}", WP_Customize_Setting $this )动作钩子::在调用WP_Customize_Setting::save()方法时激发。Action Hook: Fires when the WP_Customize_Setting::save() method is called.目录锚点:#说明#参数#源码说明(Description)钩子名称的动态部分,$this->id_data...

日期:2020-08-15 15:47:24 浏览:775

customize_value_{$this->id_data[‘base’]}

apply_filters( "customize_value_{$this-&gt;id_data[‘base’]}", mixed $default )过滤器::过滤未作为主题模式或选项处理的自定义设置值。Filter Hook: Filter a Customize setting value not handled as a theme_mod or option.目录锚点:#说明#参数#源码说明(Description)钩子名称的动态部分,$this->id_date['base'],指的是设置...

日期:2020-08-15 15:47:24 浏览:866

get_comment_author_url

过滤钩子:过滤评论作者的URL。Filter Hook: Filters the comment author’s URL.目录锚点:#源码源码(Source)更新版本源码位置使用被使用 wp-includes/comment-template.php:32610...

日期:2020-08-10 23:06:14 浏览:903

network_admin_edit_{$_GET[‘action’]}

do_action( "network_admin_edit_{$_GET[‘action’]}" )操作挂钩:启动请求的处理程序操作。Action Hook: Fires the requested handler action.目录锚点:#说明#源码说明(Description)钩子名称的动态部分$u GET['action']引用请求的操作的名称。源码(Source)更新版本源码位置使用被使用3.1.0 wp-admin/network/edit.php:3600...

日期:2020-08-02 09:56:09 浏览:848

network_sites_updated_message_{$_GET[‘updated’]}

apply_filters( "network_sites_updated_message_{$_GET[‘updated’]}", string $msg )筛选器挂钩:在网络管理中筛选特定的非默认站点更新消息。Filter Hook: Filters a specific, non-default site-updated message in the Network admin.目录锚点:#说明#参数#源码说明(Description)钩子名称的动态部分$_GET['updated']引用了非默认的...

日期:2020-08-02 09:56:03 浏览:834

pre_wp_is_site_initialized

过滤器::过滤在访问数据库之前是否初始化站点的检查。Filter Hook: Filters the check for whether a site is initialized before the database is accessed.目录锚点:#源码源码(Source)更新版本源码位置使用被使用 wp-includes/ms-site.php:93910...

日期:2020-07-29 10:15:38 浏览:809

WordPress 的SEO 教学:如何在网站中加入关键字(Meta Keywords)与Meta 描述(Meta Description)?

你想在WordPress 中添加关键字和meta 描述吗?关键字和meta 描述使你能够提高网站的SEO。在本文中,我们将向你展示如何在WordPress 中正确添加关键字和meta 描述。为什么要在WordPress 中添加关键字和Meta 描述?关键字和说明让搜寻引擎更了解您的帖子和页面的内容。关键词是人们寻找您发布的内容时,可能会搜索的重要词语或片语。而Meta Description则是对你的页面和文章的简要描述。如果你想要了解更多关于中继标签的资讯,可以参考Google的说明。Meta 关键字和描...

日期:2020-10-03 21:18:25 浏览:1620

谷歌的SEO是什么

SEO (Search Engine Optimization)中文是搜寻引擎最佳化,意思近于「关键字自然排序」、「网站排名优化」。简言之,SEO是以搜索引擎(如Google、Bing)为曝光媒体的行销手法。例如搜寻「wordpress教学」,会看到本站的「WordPress教学:12个课程…」排行Google第一:关键字:wordpress教学、wordpress课程…若搜寻「网站架设」,则会看到另一个网页排名第1:关键字:网站架设、架站…以上两个网页,每月从搜寻引擎导入自然流量,达2万4千:每月「有机搜...

日期:2020-10-30 17:23:57 浏览:1264