编程语言
首页 > 编程语言> > UnicodeDecodeError,ascii处理python中的Snowball词干算法

UnicodeDecodeError,ascii处理python中的Snowball词干算法

作者:互联网

我在将常规文件读入我已编写的程序时遇到一些麻烦.我目前遇到的问题是pdf基于某种突变的utf-8,其中包括一个BOM,它会在我的整个操作中引发一个问题.在我的应用程序中,我正在使用需要ascii输入的Snowball词干算法.有许多主题涉及到为utf-8解决错误,但是没有一个涉及将它们发送到Snowball算法,或者考虑ascii是我想要的最终结果.目前我使用的文件是使用标准ANSI编码的记事本文件.我得到的具体错误信息是这样的:

File "C:\Users\svictoroff\Desktop\Alleyoop\Python_Scripts\Keywords.py", line 38, in Map_Sentence_To_Keywords
    Word = Word.encode('ascii', 'ignore')
UnicodeDecodeError: 'ascii' codec can't decode byte 0x96 in position 0: ordinal not in range(128)

我的理解是在python中,包括ignore参数只会传递遇到的任何非ascii字符,这样我就会绕过任何BOM或特殊字符,但显然不是这种情况.调用的实际代码在这里:

def Map_Sentence_To_Keywords(Sentence, Keywords):
    '''Takes in a sentence and a list of Keywords, returns a tuple where the
    first element is the sentence, and the second element is a set of
    all keywords appearing in the sentence. Uses Snowball algorithm'''
    Equivalence = stem.SnowballStemmer('english')
    Found = []
    Sentence = re.sub(r'^(\W*?)(.*)(\n?)$', r'\2', Sentence)
    Words = Sentence.split()
    for Word in Words:
        Word = Word.lower().strip()
        Word = Word.encode('ascii', 'ignore')
        Word = Equivalence.stem(Word)
        Found.append(Word)
    return (Sentence, Found)

通过将一般非贪婪的非字符正则表达式删除包含在字符串的前面,我也希望删除故障字符,但实际情况并非如此.除了ascii之外,我还尝试了许多其他编码,并且严格的base64编码可以工作,但对我的应用程序来说非常不理想.有关如何以自动方式解决此问题的任何想法?

Element的初始解码失败,但在实际传递给编码器时返回unicode错误.

for Element in Curriculum_Elements:
        try:
            Element = Element.decode('utf-8-sig')
        except:
            print Element 
        Curriculum_Tuples.append(Map_Sentence_To_Keywords(Element, Keywords))

def scraping(File):
    '''Takes in txt file of curriculum, removes all newlines and returns that occur \
    after a lowercase character, then splits at all remaining newlines'''
    Curriculum_Elements = []
    Document = open(File, 'rb').read()
    Document = re.sub(r'(?<=[a-zA-Z,])\r?\n', ' ', Document)
    Curriculum_Elements = Document.split('\r\n')
    return Curriculum_Elements

显示的代码生成了所见的课程元素.

 for Element in Curriculum_Elements:
        try:
            Element = unicode(Element, 'utf-8-sig', 'ignore')
        except:
            print Element 

这种类型转换的hackaround实际上有效,但是转换回ascii有点不稳定.返回此错误:

Warning (from warnings module):
  File "C:\Python27\lib\encodings\utf_8_sig.py", line 19
    if input[:3] == codecs.BOM_UTF8:
UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal

解决方法:

尝试首先将UTF-8输入解码为unicode字符串,然后将其编码为ASCII(忽略非ASCII).编码已经编码的字符串真的没有意义.

input = file.read()   # Replace with your file input code...
input = input.decode('utf-8-sig')   # '-sig' handles BOM

# Now isinstance(input, unicode) is True

# ...
Sentence = Sentence.encode('ascii', 'ignore')

在编辑之后,我看到您在使用ASCII编码之前已经尝试解码字符串.但是,在文件的内容已经被操作之后,似乎解码发生得太晚了.这可能会导致问题,因为并非每个UTF-8字节都是一个字符(某些字符需要几个字节才能编码).想象一下将任何字符串转换为as和bs序列的编码.在解码它之前你不想操纵它,因为即使在未编码的字符串中没有任何内容,你也会看到as和bs – UTF-8出现同样的问题,尽管因为大多数字节而更加巧妙真的是人物.

所以,在你做任何其他事情之前解码一次:

def scraping(File):
    '''Takes in txt file of curriculum, removes all newlines and returns that occur \
    after a lowercase character, then splits at all remaining newlines'''
    Curriculum_Elements = []
    Document = open(File, 'rb').read().decode('utf-8-sig')
    Document = re.sub(r'(?<=[a-zA-Z,])\r?\n', ' ', Document)
    Curriculum_Elements = Document.split('\r\n')
    return Curriculum_Elements

# ...

for Element in Curriculum_Elements:
    Curriculum_Tuples.append(Map_Sentence_To_Keywords(Element, Keywords))

您的原始Map_Sentence_To_Keywords函数应该无需修改即可使用,但我建议在拆分之前编码为ASCII,以提高效率/可读性.

标签:python-unicode,byte-order-mark,python,encoding,regex
来源: https://codeday.me/bug/20190902/1788347.html