首页 > 其他分享> > 获得文本语料和词汇资源

获得文本语料和词汇资源

2021-02-11 09:29:22 作者：互联网

获取文本语料库

古腾堡语料库

方法一（麻烦）

>>> import nltk
>>> nltk.corpus.gutenberg.fileids()
['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt',
'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt',
'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt',
'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt',
'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt',
'shakespeare-macbeth.txt', 'whitman-leaves.txt']

>>> emma = nltk.corpus.gutenberg.words('austen-emma.txt')
55
>>> len(emma)
192427

方法二：

>>> from nltk.corpus import gutenberg
>>> gutenberg.fileids()
['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', ...]
>>> emma = gutenberg.words('austen-emma.txt')

raw() 函数给我们没有进行过任何语言学处理的文件的内容。因此，例如：len(gutenberg.raw(‘blake-poems.txt’)告诉我们文本中出现的词汇个数，包括词之间的空格。sents() 函数把文本划分成句子，其中每一个句子是一个词链表。

网络和聊天文本

不正式语言

>>> from nltk.corpus import webtext

布朗语料库

布朗语料库是一个研究文体之间的系统性差异——一种叫做文体学的语言学研究——
很方便的资源

>>> from nltk.corpus import brown

路透社语料库

这些文档分成 90 个主题，按照
“训练”和“测试”分为两组。因此，fileid 为“test/14826”的文档属于测试组

>>> from nltk.corpus import reuters

语料库方法既接受单个的 fileid 也接受 fileids 列表作为参数

>>> reuters.categories('training/9865')
['barley', 'corn', 'grain', 'wheat']

>>> reuters.categories(['training/9865', 'training/9880'])
['barley', 'corn', 'grain', 'money-fx', 'wheat']

>>> reuters.fileids('barley')
['test/15618', 'test/15649', 'test/15676', 'test/15728', 'test/15871', ...]

>>> reuters.fileids(['barley', 'corn'])
['test/14832', 'test/14858', 'test/15033', 'test/15043', 'test/15106',
'test/15287', 'test/15341', 'test/15618', 'test/15618', 'test/15648', ...]

就职演说语料库

每个文本都是一个总统的演说

>>> from nltk.corpus import inaugural
>>> inaugural.fileids()
['1789-Washington.txt', '1793-Washington.txt', '1797-Adams.txt', ...]
>>> [fileid[:4] for fileid in inaugural.fileids()]
['1789', '1793', '1797', '1801', '1805', '1809', '1813', '1817', '1821', ...] # 每个文本的年代都出现在它的文件名中。要从文件名中获得年代，我们使用 fileid[:4]提取前四个字符

NLTK 中定义的基本语料库函数

（使用 help(nltk.corpus.reader)可以找到更多的文档，
也可以阅读 http://www.nltk.org/howto 上的在线语料库的 HOWTO。）

示例	描述
fileids()	语料库中的文件
fileids([categories])	这些分类对应的语料库中的文件
categories()	语料库中的分类
categories([fileids])	这些文件对应的语料库中的分类
raw()	语料库的原始内容
raw(fileids=[f1,f2,f3])	指定文件的原始内容
raw(categories=[c1,c2])	指定分类的原始内容
words()	整个语料库中的词汇
words(fileids=[f1,f2,f3])	指定文件中的词汇
words(categories=[c1,c2])	指定分类中的词汇
sents()	指定分类中的句子
sents(fileids=[f1,f2,f3])	指定文件中的句子
sents(categories=[c1,c2])	指定分类中的句子
abspath(fileid)	指定文件在磁盘上的位置
encoding(fileid)	文件的编码（如果知道的话）
open(fileid)	打开指定语料库文件的文件流
root()	到本地安装的语料库根目录的路径

条件频率分布

按文体计数词汇

FreqDist()以一个简单的链表作为输入，ConditionalFreqDist()以一个配对链表作为输入。

>>> from nltk.corpus import brown
>>> cfd = nltk.ConditionalFreqDist(
... (genre, word)
... for genre in brown.categories()
... for word in brown.words(categories=genre))

只看两个文体：新闻和言情。对于每个文体②，我们遍历文体中的每个词③以产生文体与词的配对①

>>> genre_word = [(genre, word) # ①
... for genre in ['news', 'romance'] # ②
... for word in brown.words(categories=genre)] # ③
>>> len(genre_word)
170576

使用此配对链表创建一个 ConditionalFreqDist，并将它保存在一个变量 cfd 中。像往常一样，我们可以输入变量的名称来检查它①，并确认它有两个条件②

>>> cfd = nltk.ConditionalFreqDist(genre_word)
>>> cfd # ① 
<ConditionalFreqDist with 2 conditions>
>>> cfd.conditions()
['news', 'romance'] # ②

访问这两个条件，它们每一个都只是一个频率分布

>>> cfd['news']
<FreqDist with 100554 outcomes>
>>> cfd['romance']
<FreqDist with 70022 outcomes>
>>> list(cfd['romance'])
[',', '.', 'the', 'and', 'to', 'a', 'of', '``', "''", 'was', 'I', 'in', 'he', 'had',
'?', 'her', 'that', 'it', 'his', 'she', 'with', 'you', 'for', 'at', 'He', 'on', 'him',
'said', '!', '--', 'be', 'as', ';', 'have', 'but', 'not', 'would', 'She', 'The', ...]
>>> cfd['romance']['could']
193

使用双连词生成随机文本

bigrams()函数接受一个词汇链表，并建立一个连续的词对链表

>>> sent = ['In', 'the', 'beginning', 'God', 'created', 'the', 'heaven',
... 'and', 'the', 'earth', '.']
69
>>> nltk.bigrams(sent)
[('In', 'the'), ('the', 'beginning'), ('beginning', 'God'), ('Go d', 'created'),
('created', 'the'), ('the', 'heaven'), ('heaven', 'and'), ('and', 'the'),
('the', 'earth'), ('earth', '.')]

NLTK 中的条件频率分布：定义、访问和可视化一个计数的条件频率分布的常用方法和习惯用法

示例	描述
cfdist= ConditionalFreqDist(pairs)	从配对链表中创建条件频率分布
cfdist.conditions()	将条件按字母排序
cfdist[condition]	此条件下的频率分布
cfdist[condition][sample]	此条件下给定样本的频率
cfdist.tabulate()	为条件频率分布制表
cfdist.tabulate(samples, conditions)	指定样本和条件限制下制表
cfdist.plot()	为条件频率分布绘图
cfdist.plot(samples, conditions)	指定样本和条件限制下绘图
cfdist1 < cfdist2	测试样本在 cfdist1 中出现次数是否小于在 cfdist2 中出现次数

词典资源

词汇列表语料库

过滤文本：此程序计算文本的词汇表，然后删除所有在现有的词汇列表中出现的元素，只留下罕见或拼写错误的词

def unusual_words(text):
text_vocab = set(w.lower() for w in text if w.isalpha())
english_vocab = set(w.lower() for w in nltk.corpus.words.words())
unusual = text_vocab.difference(english_vocab)
return sorted(unusual)
>>> unusual_words(nltk.corpus.gutenberg.words('austen-sense.txt'))
['abbeyland', 'abhorrence', 'abominably', 'abridgement', 'accordant', 'accustomary',
'adieus', 'affability', 'affectedly', 'aggrandizement', 'alighted', 'allenham',
'amiably', 'annamaria', 'annuities', 'apologising', 'arbour', 'archness', ...]
>>> unusual_words(nltk.corpus.nps_chat.words())
['aaaaaaaaaaaaaaaaa', 'aaahhhh', 'abou', 'abourted', 'abs', 'ack', 'acros',
'actualy', 'adduser', 'addy', 'adoted', 'adreniline', 'ae', 'afe', 'affari', 'afk',
'agaibn', 'agurlwithbigguns', 'ahah', 'ahahah', 'ahahh', 'ahahha', 'ahem', 'ahh', ...]

停用词语料库：高频词汇，如：the，to

>>> from nltk.corpusimport stopwords
>>> stopwords.words('english')
['a', "a's", 'able', 'about', 'above', 'according', 'accordingly', 'across',
'actually', 'after', 'afterwards', 'again', 'against', "ain't", 'all', 'allow',
'allows', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', ...]

更多关于 Python：代码重用

函数

关键字 def 加函数名以及所有输入参数来定义一个函数，接下来是函数的主体。

一个 Python 函数：这个函数试图生成任何英语名词的复数形式

def plural(word):
if word.endswith('y'):
return word[:-1] + 'ies'
elif word[-1] in 'sx' or word[-2:] in ['sh', 'ch']:
return word + 'es'
elif word.endswith('an'):
return word[:-2] + 'en'
else:
return word + 's'
>>> plural('fairy')
'fairies'
>>> plural('woman')
'women'

模块

导入别的模块的方法

from module_name import method_name

WordNet

意义和同义词

>>> from nltk.corpusimport wordnet as wn
80
>>> wn.synsets('motorcar')
[Synset('car.n.01')]

因此，motorcar 只有一个可能的含义，它被定义为 car.n.01，car 的第一个名词意义。

car.n.01 被称为 synset 或“同义词集”，意义相同的词（或“词条”）的集合：

>>> wn.synset('car.n.01').lemma_names
['car', 'auto', 'automobile', 'machine', 'motorcar']

同义词集也有一些一般的定义和例句：

>>> wn.synset('car.n.01').definition
'a motor vehicle with four wheels; usually propelled by an internal combustion engine'
>>> wn.synset('car.n.01').examples
['he needs a car to get to work']

词条：种同义词集和词的配对（例如：car.n.01.automobile，car.n.01.motorcar 等）

>>> wn.synset('car.n.01').lemmas # ①得到指定同义词集的所有词条
[Lemma('car.n.01.car'),Lemma('car.n.01.auto'),Lemma('car.n.01.automobile'),
Lemma('car.n.01.machine'), Lemma('car.n.01.motorcar')]
>>> wn.lemma('car.n.01.automobile') # ②查找特定的词条
Lemma('car.n.01.automobile')
>>> wn.lemma('car.n.01.automobile').synset # ③得到一个词条对应的同义词集
Synset('car.n.01')
>>> wn.lemma('car.n.01.automobile').name # ④得到一个词条的“名字”
'automobile'

注：假如提示bound method，可能是版本问题，在方法后面加上括号即可。

WordNet 的层次结构

在这里插入图片描述
下位词：看到更直接、更具体的描述

>>> motorcar = wn.synset('car.n.01')
>>> types_of_motorcar = motorcar.hyponyms()
>>> types_of_motorcar[26]
Synset('ambulance.n.01')
>>> sorted([lemma.name for synset in types_of_motorcar for lemma in synset.lemmas])
['Model_T', 'S.U.V.', 'SUV', 'Stanley_Steamer', 'ambulance', 'beach_waggon',
'beach_wagon', 'bus', 'cab', 'compact', 'compact_car', 'convertible',
'coupe', 'cruiser', 'electric', 'electric_automobile', 'electric_car',
'estate_car', 'gas_guzzler', 'hack', 'hardtop', 'hatchback', 'heap',
'horseless_carriage', 'hot-rod', 'hot_rod', 'jalopy', 'jeep', 'landrover',
'limo', 'limousine', 'loaner', 'minicar', 'minivan', 'pace_car', 'patrol_car',
'phaeton', 'police_car', 'police_cruiser', 'prowl_car', 'race_car', 'racer',
'racing_car', 'roadster', 'runabout', 'saloon', 'secondhand_car', 'sedan',
'sport_car', 'sport_utility', 'sport_utility_vehicle', 'sports_car', 'squad_car',
'station_waggon', 'station_wagon', 'stock_car', 'subcompact', 'subcompact_car',
'taxi', 'taxicab', 'tourer', 'touring_car', 'two-seater', 'used-car', 'waggon',
'wagon']

也可以通过访问上位词来浏览层次结构

>>> motorcar.hypernyms()
[Synset('motor_vehicle.n.01')]
>>> paths = motorcar.hypernym_paths()
>>> len(paths)
2
>>> [synset.name for synset in paths[0]]
['entity.n.01', 'physical_entity.n.01', 'object.n.01', 'whole.n.02', 'artifact.n.01',
'instrumentality.n.03', 'container.n.01', 'wheeled_vehicle.n.01',
'self-propelled_vehicle.n.01', 'motor_vehicle.n.01', 'car.n.01']
>>> [synset.name() for synset in paths[1]]
['entity.n.01', 'physical_entity.n.01', 'object.n.01', 'whole.n.02', 'artifact.n.01',
'instrumentality.n.03', 'conveyance.n.03', 'vehicle.n.01', 'wheeled_vehicle.n.01',
'self-propelled_vehicle.n.01', 'motor_vehicle.n.01', 'car.n.01']

可以用如下方式得到一个最一般的上位（或根上位）同义词集

>>> motorcar.root_hypernyms()
[Synset('entity.n.01')]

可以使用 dir()查看词汇关系和同义词集上定义的其它方法。例如：尝试 dir(wn.synset(‘harmony.n.02’))

小结

文本语料库是一个大型结构化文本的集合。NLTK 包含了许多语料库，如：布朗语料库nltk.corpus.brown。
有些文本语料库是分类的，例如通过文体或者主题分类；有时候语料库的分类会相互重叠。
条件频率分布是一个频率分布的集合，每个分布都有一个不同的条件。它们可以用于通过给定内容或者文体对词的频率计数。
行数较多的 Python 程序应该使用文本编辑器来输入，保存为.py 后缀的文件，并使用 import 语句来访问。
Python 函数允许你将一段特定的代码块与一个名字联系起来，然后重用这些代码想用多少次就用多少次。
一些被称为“方法”的函数与一个对象联系在起来，我们使用对象名称跟一个点然后跟方法名称来调用它，就像：x.funct(y)或者 word.isalpha()。
要想找到一些关于变量 v 的信息，可以在 Pyhon 交互式解释器中输入 help(v)来阅读这一类对象的帮助条目。
WordNet 是一个面向语义的英语词典，由同义词的集合—或称为同义词集（synsets）—组成，并且组织成一个网络。
默认情况下有些函数是不能使用的，必须使用 Python 的 import 语句来访问。

标签：n.01,词汇,文本,word,car,语料库,txt,语料,nltk
来源： https://blog.csdn.net/weixin_50397333/article/details/113736548