UnicodeDecodeError,ascii处理python中的Snowball词干算法
作者:互联网
我在将常规文件读入我已编写的程序时遇到一些麻烦.我目前遇到的问题是pdf基于某种突变的utf-8,其中包括一个BOM,它会在我的整个操作中引发一个问题.在我的应用程序中,我正在使用需要ascii输入的Snowball词干算法.有许多主题涉及到为utf-8解决错误,但是没有一个涉及将它们发送到Snowball算法,或者考虑ascii是我想要的最终结果.目前我使用的文件是使用标准ANSI编码的记事本文件.我得到的具体错误信息是这样的:
File "C:\Users\svictoroff\Desktop\Alleyoop\Python_Scripts\Keywords.py", line 38, in Map_Sentence_To_Keywords
Word = Word.encode('ascii', 'ignore')
UnicodeDecodeError: 'ascii' codec can't decode byte 0x96 in position 0: ordinal not in range(128)
我的理解是在python中,包括ignore参数只会传递遇到的任何非ascii字符,这样我就会绕过任何BOM或特殊字符,但显然不是这种情况.调用的实际代码在这里:
def Map_Sentence_To_Keywords(Sentence, Keywords):
'''Takes in a sentence and a list of Keywords, returns a tuple where the
first element is the sentence, and the second element is a set of
all keywords appearing in the sentence. Uses Snowball algorithm'''
Equivalence = stem.SnowballStemmer('english')
Found = []
Sentence = re.sub(r'^(\W*?)(.*)(\n?)$', r'\2', Sentence)
Words = Sentence.split()
for Word in Words:
Word = Word.lower().strip()
Word = Word.encode('ascii', 'ignore')
Word = Equivalence.stem(Word)
Found.append(Word)
return (Sentence, Found)
通过将一般非贪婪的非字符正则表达式删除包含在字符串的前面,我也希望删除故障字符,但实际情况并非如此.除了ascii之外,我还尝试了许多其他编码,并且严格的base64编码可以工作,但对我的应用程序来说非常不理想.有关如何以自动方式解决此问题的任何想法?
Element的初始解码失败,但在实际传递给编码器时返回unicode错误.
for Element in Curriculum_Elements:
try:
Element = Element.decode('utf-8-sig')
except:
print Element
Curriculum_Tuples.append(Map_Sentence_To_Keywords(Element, Keywords))
def scraping(File):
'''Takes in txt file of curriculum, removes all newlines and returns that occur \
after a lowercase character, then splits at all remaining newlines'''
Curriculum_Elements = []
Document = open(File, 'rb').read()
Document = re.sub(r'(?<=[a-zA-Z,])\r?\n', ' ', Document)
Curriculum_Elements = Document.split('\r\n')
return Curriculum_Elements
显示的代码生成了所见的课程元素.
for Element in Curriculum_Elements:
try:
Element = unicode(Element, 'utf-8-sig', 'ignore')
except:
print Element
这种类型转换的hackaround实际上有效,但是转换回ascii有点不稳定.返回此错误:
Warning (from warnings module):
File "C:\Python27\lib\encodings\utf_8_sig.py", line 19
if input[:3] == codecs.BOM_UTF8:
UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
解决方法:
尝试首先将UTF-8输入解码为unicode字符串,然后将其编码为ASCII(忽略非ASCII).编码已经编码的字符串真的没有意义.
input = file.read() # Replace with your file input code...
input = input.decode('utf-8-sig') # '-sig' handles BOM
# Now isinstance(input, unicode) is True
# ...
Sentence = Sentence.encode('ascii', 'ignore')
在编辑之后,我看到您在使用ASCII编码之前已经尝试解码字符串.但是,在文件的内容已经被操作之后,似乎解码发生得太晚了.这可能会导致问题,因为并非每个UTF-8字节都是一个字符(某些字符需要几个字节才能编码).想象一下将任何字符串转换为as和bs序列的编码.在解码它之前你不想操纵它,因为即使在未编码的字符串中没有任何内容,你也会看到as和bs – UTF-8出现同样的问题,尽管因为大多数字节而更加巧妙真的是人物.
所以,在你做任何其他事情之前解码一次:
def scraping(File):
'''Takes in txt file of curriculum, removes all newlines and returns that occur \
after a lowercase character, then splits at all remaining newlines'''
Curriculum_Elements = []
Document = open(File, 'rb').read().decode('utf-8-sig')
Document = re.sub(r'(?<=[a-zA-Z,])\r?\n', ' ', Document)
Curriculum_Elements = Document.split('\r\n')
return Curriculum_Elements
# ...
for Element in Curriculum_Elements:
Curriculum_Tuples.append(Map_Sentence_To_Keywords(Element, Keywords))
您的原始Map_Sentence_To_Keywords函数应该无需修改即可使用,但我建议在拆分之前编码为ASCII,以提高效率/可读性.
标签:python-unicode,byte-order-mark,python,encoding,regex 来源: https://codeday.me/bug/20190902/1788347.html