编程语言
首页 > 编程语言> > 使用ntlk的python拼写纠正器

使用ntlk的python拼写纠正器

作者:互联网

我正在尝试在python中编写一个拼写纠正器,用于我的推文语料库(我是python和nltk的新手).这些推文采用xml格式并被标记化.我尝试过使用enchant.checker SpellingCorrector,但似乎得到了一个bug:

>>> text = "this is sme text with a speling mistake."
>>> from enchant.checker import SpellChecker
>>> chkr = SpellChecker("en_US", text)
>>> for err in chkr:
...     err.replace("SPAM")
... 
>>> chkr.get_text()
'this is SPAM text with a SPAMSSPSPAM.SSPSPAM'

当它应该返回“这是一些拼写错误的文本.”

我还为单个单词编写了一个拼写纠正器,我很满意,但我正在努力解决如何解析标记化的推文文件以使其工作:

def __init__(self, dict_name='en', max_dist=2):
        self.spell_dict = enchant.Dict('en_GB')
        self.max_dist = max_dist

    def replace(self, word):
        if self.spell_dict.check(word):
            return word

        suggestions = self.spell_dict.suggest(word)

        if suggestions and edit_distance(word, suggestions[0]) <= self.max_dist:
            return suggestions[0]
        else:
            return word

有人可以帮我吗?

谢谢

解决方法:

我看到你的帖子,并认为我会做一些游戏.这就是我得到的.

我添加了一些打印语句来查看发生了什么:

from enchant.checker import SpellChecker

text = "this is sme text with a speling mistake."

chkr = SpellChecker("en_US", text)
for err in chkr:
    print(err.word + " at position " + str(err.wordpos))  #<----
    err.replace("SPAM")

t = chkr.get_text()
print("\n" + t)  #<----

这是运行代码的结果:

sme at position 8
speling at position 25
ing at position 29
ng at position 30
AMMstake at position 32
ake at position 37
ke at position 38
AMM at position 40

this is SPAM text with a SPAMSSPSPAM.SSPSPAM

正如您所看到的,当拼写错误的单词被“垃圾邮件”取代时,拼写检查器似乎在动态变化,并检查原始文本,因为它包含错误变量中的“垃圾邮件”部分.

我尝试了http://pythonhosted.org/pyenchant/api/enchant.checker.html的原始代码,看起来你看起来像你用的问题,但仍然有一些意想不到的结果.

注意:我添加的唯一内容是print语句:

Orinal:

>>> text = "This is sme text with a fw speling errors in it."
>>> chkr = SpellChecker("en_US",text)
>>> for err in chkr:
...   err.replace("SPAM")
...
>>> chkr.get_text()
'This is SPAM text with a SPAM SPAM errors in it.'

我的代码:

from enchant.checker import SpellChecker

text = "This is sme text with a fw speling errors in it."

chkr = SpellChecker("en_US", text)
for err in chkr:
    print(err.word + " at position " + str(err.wordpos))
    err.replace("SPAM")

t = chkr.get_text()
print("\n" + t)

输出与网站不符:

sme at position 8
fw at position 25
speling at position 30
ing at position 34
ng at position 35
AMMrors at position 37  #<---- seems to add in parts of "SPAM"

This is SPAM text with a SPAM SPAMSSPSPAM in it.  #<---- my output ???

无论如何,这是我提出的解决一些问题的东西.我没有替换为“垃圾邮件”,而是使用您发布的代码版本进行单字替换,并替换为实际建议的单词.重要的是要注意,在这个例子中,“建议的”字在100%的时间是错误的.我过去经常遇到这个问题,“如何在没有用户交互的情况下实现拼写纠正.”这个范围远远超出你的要求.但是,我认为你需要一些NLP来获得准确的结果.

import enchant
from enchant.checker import SpellChecker
from nltk.metrics.distance import edit_distance

class MySpellChecker():

    def __init__(self, dict_name='en_US', max_dist=2):
        self.spell_dict = enchant.Dict(dict_name)
        self.max_dist = max_dist

    def replace(self, word):
        suggestions = self.spell_dict.suggest(word)

        if suggestions:
            for suggestion in suggestions:
                if edit_distance(word, suggestion) <= self.max_dist:
                    return suggestions[0]

        return word


if __name__ == '__main__':
    text = "this is sme text with a speling mistake."

    my_spell_checker = MySpellChecker(max_dist=1)
    chkr = SpellChecker("en_US", text)
    for err in chkr:
        print(err.word + " at position " + str(err.wordpos))
        err.replace(my_spell_checker.replace(err.word))

    t = chkr.get_text()
    print("\n" + t)

标签:python,twitter,nltk,enchant
来源: https://codeday.me/bug/20190612/1226176.html