python-无法使用NLTK tokeniser处理重读单词
作者:互联网
我正在尝试使用以下代码来计算utf-8编码文本文件中单词的频率.成功标记文件内容,然后遍历单词后,我的程序无法读取带重音的字符.
import csv
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
print "computing word frequency..."
if lang == "fr":
stop = stopwords.words("french")
stop = [word.encode("utf-8") for word in stop]
stop.append("les")
stop.append("a")
elif lang == "en":
stop = stopwords.words("english")
rb = csv.reader(open(path+file_name))
wb = csv.writer(open('results/all_words_'+file_name,'wb'))
tokenizer = RegexpTokenizer(r'\w+')
word_dict = {}
i = 0
for row in rb:
i += 1
if i == 5:
break
text = tokenizer.tokenize(row[0].lower())
text = [j for j in text if j not in stop]
#print text
for doc in text:
try:
try:
word_dict[doc] += 1
except:
word_dict[doc] = 1
except:
print row[0]
print " ".join(text)
word_dict2 = sorted(word_dict.iteritems(), key=operator.itemgetter(1), reverse=True)
if lang == "English":
for item in word_dict2:
wb.writerow([item[0],stem(item[0]),item[1]])
else:
for item in word_dict2:
wb.writerow([item[0],item[1]])
print "Finished"
输入文字档:
rt annesorose envie crêpes
envoyé jerrylee bonjour monde dimanche crepes dimanche
The output written in a file is destroying certain words.
bonnes crepes tour nouveau vélo
aime crepe soleil ça fera bien recharger batteries vu jours hard annoncent
结果输出:
crepes,2
dimanche,2
rt,1
nouveau,1
envie,1
v�,1
jerrylee,1
cleantext,1
lo,1
bonnes,1
tour,1
crêpes,1
monde,1
bonjour,1
annesorose,1
envoy�,1
特使”是实际文件中的特使.
如何使用重音符号解决此问题?
解决方法:
如果您使用的是py2.x,请将默认编码重置为“ utf8”:
import sys
reload(sys)
sys.setdefaultencoding('utf8')
或者,您可以使用ucsv模块,请参见General Unicode/UTF-8 support for csv files in Python 2.6
或使用io.open():
$echo """rt annesorose envie crêpes
> envoyé jerrylee bonjour monde dimanche crepes dimanche
> The output written in a file is destroying certain words.
> bonnes crepes tour nouveau vélo
> aime crepe soleil ça fera bien recharger batteries vu jours hard annoncent""" > someutf8.txt
$python
>>> import io, csv
>>> text = io.open('someutf8.txt', 'r', encoding='utf8').read().split('\n')
>>> for row in text:
... print row
...
rt annesorose envie crêpes
envoyé jerrylee bonjour monde dimanche crepes dimanche
The output written in a file is destroying certain words.
bonnes crepes tour nouveau vélo
aime crepe soleil ça fera bien recharger batteries vu jours hard annoncent
最后,无需使用如此复杂的读取和计数模块,只需在NLTK中使用FreqDist,请参阅http://www.nltk.org/book/ch01.html中的3.1节
或者我个人更喜欢收藏.
$python
>>> import io
>>> text = io.open('someutf8.txt', 'r', encoding='utf8').read()
>>> from collections import Counter
>>> Counter(word_tokenize(text))
Counter({u'crepes': 2, u'dimanche': 2, u'fera': 1, u'certain': 1, u'is': 1, u'bonnes': 1, u'v\xe9lo': 1, u'batteries': 1, u'envoy\xe9': 1, u'vu': 1, u'file': 1, u'in': 1, u'The': 1, u'rt': 1, u'jerrylee': 1, u'destroying': 1, u'bien': 1, u'jours': 1, u'.': 1, u'written': 1, u'annesorose': 1, u'annoncent': 1, u'nouveau': 1, u'envie': 1, u'hard': 1, u'cr\xeapes': 1, u'\xe7a': 1, u'monde': 1, u'words': 1, u'bonjour': 1, u'a': 1, u'crepe': 1, u'soleil': 1, u'tour': 1, u'aime': 1, u'output': 1, u'recharger': 1})
>>> myFreqDist = Counter(word_tokenize(text))
>>> for word, freq in myFreqDist.items():
... print word, freq
...
fera 1
crepes 2
certain 1
is 1
bonnes 1
vélo 1
batteries 1
envoyé 1
vu 1
file 1
in 1
The 1
rt 1
jerrylee 1
destroying 1
bien 1
jours 1
. 1
written 1
dimanche 2
annesorose 1
annoncent 1
nouveau 1
envie 1
hard 1
crêpes 1
ça 1
monde 1
words 1
bonjour 1
a 1
crepe 1
soleil 1
tour 1
aime 1
output 1
recharger 1
标签:text-mining,python,nltk 来源: https://codeday.me/bug/20191013/1907704.html