其他分享
首页 > 其他分享> > 2019 CS224N Assignment 1: Exploring Word Vectors

2019 CS224N Assignment 1: Exploring Word Vectors

作者:互联网

文章目录

实验最麻烦的部分往往是环境搭建
               ----鲁迅

完整的notebook作业请前往我的github

包的导入

Part 1: Count-Based Word Vectors

大多数词向量模型从下面这个想法得来的。

You shall know a word by the company it keeps

Co-Occurrence

Question 1.1: Implement distinct_words

corpus_words = list(set([y for x in corpus for y in x]))
corpus_words.sort()
num_corpus_words = len(corpus_words)

Question 1.2: Implement compute_co_occurrence_matrix

def compute_co_occurrence_matrix(corpus, window_size=4):
    """ Compute co-occurrence matrix for the given corpus and window_size (default of 4).
    
        Note: Each word in a document should be at the center of a window. Words near edges will have a smaller
              number of co-occurring words.
              
              For example, if we take the document "START All that glitters is not gold END" with window size of 4,
              "All" will co-occur with "START", "that", "glitters", "is", and "not".
    
        Params:
            corpus (list of list of strings): corpus of documents
            window_size (int): size of context window
        Return:
            M (numpy matrix of shape (number of corpus words, number of corpus words)): 
                Co-occurence matrix of word counts. 
                The ordering of the words in the rows/columns should be the same as the ordering of the words given by the distinct_words function.
            word2Ind (dict): dictionary that maps word to index (i.e. row/column number) for matrix M.
    """
    words, num_words = distinct_words(corpus)
    M = None
    word2Ind = {}
    
    # ------------------
    # Write your implementation here.
    for idx,val in enumerate(words):  #构造单词到下标的对应字典(NLP中经常需要)
        word2Ind[val]=idx

    M = np.zeros((num_words,num_words)) #初始化共现矩阵
    for sen in corpus:	#遍历每个句子
        for idx,cen in enumerate(sen):	#遍历每个单词
            for i in range(-window_size,window_size+1,1):	#遍历窗口
                if i!=0 and idx+i>=0 and idx+i<len(sen): #!=0表示跳过该单词本身。然后就是范围判断
                    M[word2Ind[cen]][word2Ind[sen[idx+i]]]+=1 #将第cen行,第sen[idx+i] 列数加一
    # ------------------

    return M, word2Ind

Question 1.3: Implement reduce_to_k_dim

svd = TruncatedSVD(n_components=k, n_iter=n_iters)
M_reduced = svd.fit_transform(M)

Question 1.4: Implement plot_embeddings

for word in words:
    x = M_reduced[word2Ind[word]][0]
    y = M_reduced[word2Ind[word]][1]
    plt.scatter(x, y, marker='x', color='red')
    plt.text(x,y,word)
plt.show()

Question 1.5: Co-Occurrence Plot Analysis

What clusters together in 2-dimensional embedding space?
What doesn’t cluster together that you might think should have?
Note: “bpd” stands for “barrels per day” and is a commonly used abbreviation in crude oil topic articles.
在这里插入图片描述

Part 2: Prediction-Based Word Vectors

word2vec论文:https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf

环境搭建是实验最麻烦的地方
我没说过。 — 鲁迅

词向量导入

def load_word2vec():
    """ Load Word2Vec Vectors
        Return:
            wv_from_bin: All 3 million embeddings, each lengh 300
    """
#    import gensim.downloader as api
#    wv_from_bin = api.load("word2vec-google-news-300")
    import os
    from gensim.models import KeyedVectors
    from gensim.downloader import base_dir
    path = os.path.join(base_dir, 'word2vec-google-news-300', "word2vec-google-news-300.gz")
    wv_from_bin = KeyedVectors.load_word2vec_format(path, binary=True)
    vocab = list(wv_from_bin.vocab.keys())
    print("Loaded vocab size %i" % len(vocab))
    return wv_from_bin

Question 2.1: Word2Vec Plot Analysis

What clusters together in 2-dimensional embedding space?
What doesn’t cluster together that you might think should have?
How is the plot different from the one generated earlier from the co-occurrence matrix?

Question 2.2: Polysemous Words

Cosine Similarity
即根据两个词向量直接的夹角来衡量两个词的距离
在这里插入图片描述
Please state the polysemous word you discover and the multiple meanings that occur in the top 10.
Why do you think many of the polysemous words you tried didn’t work?

Question 2.3: Synonyms & Antonyms

Question 2.4: Finding Analogies

Question 2.5: Incorrect Analogy

Question 2.6: Guided Analysis of Bias in Word Vectors

在这里插入图片描述

Question 2.7: Independent Analysis of Bias in Word Vectors

Question 2.8: Thinking About Bias

总结

24kb_ 发布了38 篇原创文章 · 获赞 10 · 访问量 7483 私信 关注

标签:Exploring,Word,Assignment,Question,window,words,矩阵,word,corpus
来源: https://blog.csdn.net/weixin_42017042/article/details/104436689