编程语言
首页 > 编程语言> > python-无法使用NLTK tokeniser处理重读单词

python-无法使用NLTK tokeniser处理重读单词

作者:互联网

我正在尝试使用以下代码来计算utf-8编码文本文件中单词的频率.成功标记文件内容,然后遍历单词后,我的程序无法读取带重音的字符.

import csv
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords

print "computing word frequency..."
if lang == "fr":
    stop = stopwords.words("french")
    stop = [word.encode("utf-8") for word in stop]
    stop.append("les")
    stop.append("a")
elif lang == "en":
    stop = stopwords.words("english")


rb = csv.reader(open(path+file_name))
wb = csv.writer(open('results/all_words_'+file_name,'wb'))

tokenizer = RegexpTokenizer(r'\w+')

word_dict = {}

i = 0

for row in rb:
    i += 1
    if i == 5:
        break
    text = tokenizer.tokenize(row[0].lower())
    text = [j for j in text if j not in stop]
    #print text
    for doc in text:
        try:

            try:
                word_dict[doc] += 1

            except:

                word_dict[doc] = 1
        except:
            print row[0]
            print " ".join(text)




word_dict2 = sorted(word_dict.iteritems(), key=operator.itemgetter(1), reverse=True)

if lang == "English":
    for item in word_dict2:
        wb.writerow([item[0],stem(item[0]),item[1]])
else:
    for item in word_dict2:
        wb.writerow([item[0],item[1]])

print "Finished"

输入文字档:

rt annesorose envie crêpes
envoyé jerrylee bonjour monde dimanche crepes dimanche
The output written in a file is destroying certain words.
bonnes crepes tour nouveau vélo
aime crepe soleil ça fera bien recharger batteries vu jours hard annoncent

结果输出:

crepes,2
dimanche,2
rt,1
nouveau,1
envie,1
v�,1 
jerrylee,1
cleantext,1
lo,1
bonnes,1
tour,1
crêpes,1
monde,1
bonjour,1
annesorose,1
envoy�,1

特使”是实际文件中的特使.

如何使用重音符号解决此问题?

解决方法:

如果您使用的是py2.x,请将默认编码重置为“ utf8”:

import sys
reload(sys)
sys.setdefaultencoding('utf8')

或者,您可以使用ucsv模块,请参见General Unicode/UTF-8 support for csv files in Python 2.6

或使用io.open():

$echo """rt annesorose envie crêpes
> envoyé jerrylee bonjour monde dimanche crepes dimanche
> The output written in a file is destroying certain words.
> bonnes crepes tour nouveau vélo
> aime crepe soleil ça fera bien recharger batteries vu jours hard annoncent""" > someutf8.txt
$python
>>> import io, csv
>>> text = io.open('someutf8.txt', 'r', encoding='utf8').read().split('\n')
>>> for row in text:
...     print row
... 
rt annesorose envie crêpes
envoyé jerrylee bonjour monde dimanche crepes dimanche
The output written in a file is destroying certain words.
bonnes crepes tour nouveau vélo
aime crepe soleil ça fera bien recharger batteries vu jours hard annoncent

最后,无需使用如此复杂的读取和计数模块,只需在NLTK中使用FreqDist,请参阅http://www.nltk.org/book/ch01.html中的3.1节

或者我个人更喜欢收藏.

$python
>>> import io
>>> text = io.open('someutf8.txt', 'r', encoding='utf8').read()
>>> from collections import Counter
>>> Counter(word_tokenize(text))
Counter({u'crepes': 2, u'dimanche': 2, u'fera': 1, u'certain': 1, u'is': 1, u'bonnes': 1, u'v\xe9lo': 1, u'batteries': 1, u'envoy\xe9': 1, u'vu': 1, u'file': 1, u'in': 1, u'The': 1, u'rt': 1, u'jerrylee': 1, u'destroying': 1, u'bien': 1, u'jours': 1, u'.': 1, u'written': 1, u'annesorose': 1, u'annoncent': 1, u'nouveau': 1, u'envie': 1, u'hard': 1, u'cr\xeapes': 1, u'\xe7a': 1, u'monde': 1, u'words': 1, u'bonjour': 1, u'a': 1, u'crepe': 1, u'soleil': 1, u'tour': 1, u'aime': 1, u'output': 1, u'recharger': 1})
>>> myFreqDist = Counter(word_tokenize(text))
>>> for word, freq in myFreqDist.items():
...     print word, freq
... 
fera 1
crepes 2
certain 1
is 1
bonnes 1
vélo 1
batteries 1
envoyé 1
vu 1
file 1
in 1
The 1
rt 1
jerrylee 1
destroying 1
bien 1
jours 1
. 1
written 1
dimanche 2
annesorose 1
annoncent 1
nouveau 1
envie 1
hard 1
crêpes 1
ça 1
monde 1
words 1
bonjour 1
a 1
crepe 1
soleil 1
tour 1
aime 1
output 1
recharger 1

标签:text-mining,python,nltk
来源: https://codeday.me/bug/20191013/1907704.html