其他分享
首页 > 其他分享> > 【论文阅读】二进制的Embedding

【论文阅读】二进制的Embedding

作者:互联网

  • 论文标题: Learning Compressed Sentence Representations for On-Device Text Processing
  • 中文标题: 面向机上文本处理的语句压缩表示学习
  • 论文下载链接: 1906.08340
  • 论文项目代码: GitHub@BinarySentEmb

序言

本文是7位来自杜克大学的学者, 1位斯坦福大学的学者, 以及1位微软研究所的研究员联名发表在ACL 2019上的一篇paper, 主要是关于低资源句子嵌入学习的方法, 实话说笔者本以为原作者背景这么强必有高论, 可惜并没有看到很新颖的做法, 感觉内容偏于浅显, 不像是他们这种水平的人, 还是9个人联名合作的paper, 或许是实验结果比较好吧, 倒是可以作为熟悉这个方面的一篇入门paper;

句子嵌入的学习在NLP领域中是非常重要的课题, 因为它可以给非常多的下游任务提供一个预训练的句子嵌入模型, 包括语义匹配, 文本分类, 情感识别等等, 几乎所有非单词级别的NLP任务, 如句子级别, 文档级别, 都避不开句子嵌入这个难题, 有些简单的做法会将句子中所有的词向量取均值或是加权平均, 抑或是拼接, 前者过于简单, 缺乏表征性, 后者又无可避免地面临维数过高而占用大量资源的难题, 因此找到一个简单且具有高可用性的句子嵌入其实是非常关键的, 而本文恰好就是一个很适合熟悉这块方法的paper, 就结果而言, 本文做到了在几乎不损失原模型性能的条件下, 将内存占用减少了2个数量级, 至少在模型跑不动的情况下, 算是一个权宜之策;


目录


摘要 Abstract

  1. 自然语言处理(Natural Language Processing, 下简称为NLP)问题中常常会使用到语句级别的向量表示, 但是往往通过数据学习到的句子向量是连续的(continuous)以及是实值的(real-valued), 这不便于在缺乏计算资源的平台(low-resource(memory and computation) platform)中进行存储与读取, 尤其是移动设备(如手机);

  2. 本文提出四种策略将通用的连续型句子嵌入(sentence embeddings)转化为二值化的形式(binarized form), 并且保持其原嵌入中所包含的语义信息;

  3. 本文在广泛的下游任务中测试了这种二值化的句子嵌入, 实验结果显示这种方法约牺牲 2 2% 2的模型性能水平, 却能够减少超过 98 98% 98的存储需求; 不仅如此, 在计算句子相似度的问题中, 可以简单计算两个二值化的句子嵌入的汉明距离(Hamming distance), 而无需计算两个连续型句子嵌入的内积, 这样可以降低计算复杂度;


1 引入 Introduction

  1. 句子嵌入的学习方式:
  1. 前人对于句子嵌入的研究都基于嵌入向量是连续的(continuous)与实值的(real-valued)假设前提, 这样的问题在于:
  1. 本文的设计思路:
  1. 本文的实验结果:
  1. 本文的贡献:

2 相关工作 Related Work

  1. 关于从大规模语料数据中学习预训练的句子嵌入的方法, 大致可以分为两类:
  1. 关于使用深度神经网络学习低耗内存(memory-efficient)的嵌入的方法, 大致可以分为两类:
  1. 此外本文的研究还与语义哈希(semantic hashing)相关: 参考文献[27, 39, 33, 38, 28];

3 提出的方法 Proposed Approach

  • 本文旨在将连续型句子嵌入转换成压缩的二值化句子嵌入, 并且保存前者包含的语义信息;
  • 数学抽象表示:
    • 设 x x x是输入语句, f f f是通过预训练的通用句子编码器(general-purpose sentence encoder)定义得到的映射函数, 则 f ( x ) f(x) f(x)表示由这个编码器提取得到的连续型句子嵌入;
    • 本文需要学习到一个函数 g g g将 f ( x ) f(x) f(x)转为高信息量的(highly informative)的二值化句子嵌入 g ( f ( x ) ) g(f(x)) g(f(x));

3.1 硬阈值 Hard Threshold

  1. 硬阈值方法的数学符号及公式:
  1. 硬阈值方法的缺陷:

3.2 随机投影 Random Projection

  1. 随机投影方法的概述:
  1. 随机投影方法的数学符号及公式:

3.3 主成分分析 Principal Component Analysis

  1. 主成分分析可以用于特征降维, 即减少预训练的连续型句子嵌入的维数;

  2. 论文中给出了主成分分析的详细步骤:

3.4 自动编码器框架 Autoencoder Architecture

  1. 前面几种方法的缺陷与解决:
  1. 编码器网络原理及数学抽象:
  1. 解码器网络原理及数学抽象:

3.4.1 语义保全正则器 Semantic-preserving Regularizer

  1. 自动编码器框架的缺陷:
  1. 语义保全正则器数学抽象及公式:

3.5 讨论 Discussion

  1. Figure 1中给出了上文提到的几种方法的示意图:
  1. 其他一些可能的策略: 都已经通过实证分析得出效果不行;

4 实验配置 Experimental setup

4.1 预训练的连续型嵌入 Pre-trained Continuous Embeddings

  1. 本文使用参考文献[7]中的InferSent作为预训练的连续型嵌入;
  2. 本文的方法也适用于其他预训练的通用句子嵌入上;
  3. 句子编码器的架构是一个Bi-LSTM, 隐层上进行最大值池化(max-pooling);
  4. 模型参数根据参考文献[2]的数据集SNLI(Standford Natural Language Inference)与参考文献[37]的数据集MultiNLI(Multi-Genre Natural Language Inference)及它们对应的自然语言任务来进行优化;

4.2 训练详情 Training Details

  1. 优化器: Adam;
  2. 学习率: 1 × 1 0 − 5 1×10^{-5} 1×10−5;
  3. 二值化嵌入的维度: 512 , 1024 , 2048 , 4096 512,1024,2048,4096 512,1024,2048,4096;
  4. 批训练量: 64 64 64;
  5. 惩罚项系数 λ s p \lambda_{sp} λsp​: 0.2 , 0.5 , 0.8 , 1.0 0.2,0.5,0.8,1.0 0.2,0.5,0.8,1.0, 其中 0.8 0.8 0.8的实证分析结果最好;
  6. 自动编码器收敛速度: 约 1 1 1小时收敛, 因此可以应用于更大的数据集;

4.3 评估 Evaluation

  1. 本文使用参考文献[6]中提出的SentEval toolkit(Github@SentEval)来进行二进制句子嵌入的评估; 具体而言是将学习得到的二进制句子嵌入在一系列下游任务中评估它的迁移性能(给定编码器的权重不变), 这些下游任务及数据集大致可以分为以下两类:

4.4 基线 Baselines

  1. 连续型嵌入的基线:
  1. 二值化嵌入的基线:

5 实验结果 Experimental Results

  1. 本文在五种学习通用二值化嵌入的模型变体上进行实验, 详细实验数据见Table 1:

Table 1

5.1 迁移任务评估 Task transfer evaluation

  1. 一系列迁移任务上的评估结果同样可见Table 1:

5.2 近邻检索 Nearest Neighbor Retrieval

  1. 案例分析:
  1. 检索速率:

5.3 消融研究 Ablation Study

5.3.1 语义保全损失函数的影响 The effect of semantic-preserving loss

  1. 通过调整Formular 7中的超参数 λ s p \lambda_{sp} λsp​(从 0.0 0.0 0.0到 1.0 1.0 1.0), 可以得到Table 3中的对比结果:

5.3.2 采样策略 Sampling strategy

  1. 正如上文Section 3.4中所述, 二进制隐层向量(binary latent vector) b b b可以通过确定的(deterministic)或随机的(stochastic)采样的阈值生成, 这里也对这两种方法进行了对比, 结果如Figure 2所示:

5.3.3 嵌入维度的影响 The effect of embedding dimension

Figure 3


6 结论 Conclusion

  • 这部分内容不再赘述, 基本都是些重复性的套话;
  • 笔者感觉本文有点水, 本来看作者是来自斯坦福大学, 微软研究所这样的顶级机构, 开头说提出了四种策略, 以为必有高论, 结果前三种原来就是硬阈值, 随机投影和主成分分析, 是非常平实的方法; 最后一种策略也没有看到太多亮点, 也就是找了一个惩罚项加在了均方误差损失函数后面, 这居然也能发表ACL实在是让笔者有点惊诧, 大约机构背景也是录稿的重要参照之一, 即便是顶级会议上怕也不乏人情稿唉;

参考文献 Reference

[01] Yossi Adi, Einat Kermany, Yonatan Belinkov, Ofer Lavi, and Yoav Goldberg. 2017. Fine-grained analysis of sentence embeddings using auxiliary prediction tasks. CoRR, abs/1608.04207. 
[02] Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In EMNLP. 
[03] Miguel A Carreira-Perpin´an and Ramin Raziperchikolaei. 2015. Hashing with binary autoencoders. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 557–566. 
[04] Daniel Cer, Yinfei Yang, Sheng yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. 2018. Universal sentence encoder. CoRR, abs/1803.11175. 
[05] Ting Chen, Martin Renqiang Min, and Yizhou Sun. 2018. Learning k-way d-dimensional discrete codes for compact embedding representations. arXiv preprint arXiv:1806.09464. 
[06] Alexis Conneau and Douwe Kiela. 2018. Senteval: An evaluation toolkit for universal sentence representations. arXiv preprint arXiv:1803.05449. 
[07] Alexis Conneau, Douwe Kiela, Holger Schwenk, Lo¨ıc Barrault, and Antoine Bordes. 2017. Supervised learning of universal sentence representations from natural language inference data. In EMNLP. 
[08] Andrew M Dai and Quoc V Le. 2015. Semi-supervised sequence learning. In Advances in neural information processing systems, pages 3079–3087.
[09] Bo Dai, Ruiqi Guo, Sanjiv Kumar, Niao He, and Le Song. 2017. Stochastic generative hashing. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 913–922. JMLR. org. 
[10] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. 
[11] Zhe Gan, Yunchen Pu, Ricardo Henao, Chunyuan Li, Xiaodong He, and Lawrence Carin. 2017. Learning generic sentence representations using convolutional neural networks. In EMNLP. 
[12] Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256. 
[13] Felix Hill, Kyunghyun Cho, and Anna Korhonen. 2016. Learning distributed representations of sentences from unlabelled data. In HLT-NAACL.
[14] G Hinton. 2012. Neural networks for machine learning. coursera,[video lectures]. 
[15] Eric Jang, Shixiang Gu, and Ben Poole. 2016. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144. 
[16] Yacine Jernite, Samuel R. Bowman, and David A Sontag. 2017. Discourse-based objectives for fast unsupervised sentence representation learning. CoRR, abs/1705.00557. 
[17] Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016. Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759. 
[18] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. 
[19] Jamie Kiros and William Chan. 2018. Inferlite: Simple universal sentence representations from natural language inference data. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4868–4874.
[20] Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. 2015. Skip-thought vectors. In NIPS. 
[21] Lajanugen Logeswaran and Honglak Lee. 2018. An efficient framework for learning sentence representations. ICLR. 
[22] Allen Nie, Erin D. Bennett, and Noah D. Goodman. 2017. Dissent: Sentence representation learning from explicit discourse relations. CoRR, abs/1710.04334. 
[23] Matteo Pagliardini, Prakhar Gupta, and Martin Jaggi. 2018. Unsupervised learning of sentence embeddings using compositional n-gram features. In NAACL-HLT. 
[24] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. URL https://s3us-west-2. amazonaws. com/openai-assets/researchcovers/languageunsupervised/language understanding paper. pdf. 
[25] Sujith Ravi and Zornitsa Kozareva. 2018. Selfgoverning neural networks for on-device short text classification. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 804–810. 
[26] Sebastian Ruder and Jeremy Howard. 2018. Universal language model fine-tuning for text classification. In ACL. 
[27] Ruslan Salakhutdinov and Geoffrey Hinton. 2009. Semantic hashing. International Journal of Approximate Reasoning, 50(7):969–978. 
[28] Dinghan Shen, Qinliang Su, Paidamoyo Chapfuwa, Wenlin Wang, Guoyin Wang, Lawrence Carin, and Ricardo Henao. 2018. Nash: Toward end-to-end neural architecture for generative semantic hashing. In ACL. 
[29] Raphael Shu and Hideki Nakayama. 2017. Compressing word embeddings via deep compositional code learning. arXiv preprint arXiv:1711.01068. 
[30] Shuai Tang and Virginia R de Sa. 2018. Improving sentence representations with multi-view frameworks. arXiv preprint arXiv:1810.01064. 
[31] Julien Tissier, Amaury Habrard, and Christophe Gravier. 2019. Near-lossless binarization of word embeddings. AAAI. 
[32] Benjamin Van Durme and Ashwin Lall. 2010. Online generation of locality sensitive hash signatures. In Proceedings of the ACL 2010 Conference Short Papers, pages 231–235. Association for Computational Linguistics. 
[33] Jingdong Wang, Heng Tao Shen, Jingkuan Song, and Jianqiu Ji. 2014. Hashing for similarity search: A survey. arXiv preprint arXiv:1408.2927. 
[34] John Wieting, Mohit Bansal, Kevin Gimpel, and Karen Livescu. 2016. Towards universal paraphrastic sentence embeddings. CoRR, abs/1511.08198. 
[35] John Wieting and Kevin Gimpel. 2018. Paranmt-50m: Pushing the limits of paraphrastic sentence embeddings with millions of machine translations. In ACL. 
[36] John Wieting and Douwe Kiela. 2018. No training required: Exploring random encoders for sentence classification. CoRR, abs/1901.10444. 
[37] Adina Williams, Nikita Nangia, and Samuel R Bowman. 2017. A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426. 
[38] Jiaming Xu, Peng Wang, Guanhua Tian, Bo Xu, Jun Zhao, Fangyuan Wang, and Hongwei Hao. 2015. Convolutional neural networks for text hashing. In Twenty-Fourth International Joint Conference on Artificial Intelligence. 
[39] Dell Zhang, Jun Wang, Deng Cai, and Jinsong Lu. 2010. Self-taught hashing for fast similarity search. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, pages 18–25. ACM.

附录: 图表汇总

Figure 1
Table 1
Table 2
Table 3
Figure 2
Figure 3

标签:编码器,嵌入,二进制,论文,连续型,二值化,参考文献,Embedding,句子
来源: https://blog.csdn.net/CY19980216/article/details/112700161