Python标记化

2019-10-13 05:55:18 作者：互联网

我是Python新手,并且有Tokenization任务
输入是带有句子的.txt文件
输出的是带有令牌的.txt文件,当我说令牌时,我的意思是：简单的单词“,”,“！” ,’？’ ,’.’ ”’

我有这个功能：
输入：
Elemnt是带有或不带有标点符号的单词,可以是诸如：嗨或说：或说“
StrForCheck：是我想与单词分开的标点数组
TokenFile：是我的输出文件

def CheckIfSEmanExist(Elemnt,StrForCheck,TokenFile)：

FirstOrLastIsSeman = 0

for seman in StrForCheck:
    WordSplitOnSeman = Elemnt.split(seman)
    if len(WordSplitOnSeman) > 1:
        if Elemnt[len(Elemnt)-1] == seman:
            FirstOrLastIsSeman = len(Elemnt)-1
        elif Elemnt[0] == seman:
            FirstOrLastIsSeman = 1

if FirstOrLastIsSeman == 1:
    TokenFile.write(Elemnt[0])
    TokenFile.write('\n')
    TokenFile.write(Elemnt[1:-1])
    TokenFile.write('\n')

elif FirstOrLastIsSeman == len(Elemnt)-1:
    TokenFile.write(Elemnt[0:-1])
    TokenFile.write('\n')
    TokenFile.write(Elemnt[len(Elemnt)-1])
    TokenFile.write('\n')

elif FirstOrLastIsSeman == 0:
    TokenFile.write(Elemnt)
    TokenFile.write('\n')

代码遍历标点数组,如果他找到一个标点,我检查标点是单词中的第一个字母还是最后一个字母,然后在输出文件中将单词和标点分别写在不同的行中

但是我的问题是,除了那些词外,它在整个文本上都很棒：
工作”,“创建”,“公共”,“警察”

解决方法:

注意

for l in open('some_file.txt', 'r'):
    ...

遍历每行,因此您只需要考虑在一行中要做什么.

考虑以下功能：

def tokenizer(l):
    prev_i = 0
    for (i, c) in enumerate(l):
        if c in ',.?!- ':
            if prev_i != i:
                yield l[prev_i: i]
            yield c
            prev_i = i + 1
    if prev_i != 0:
        yield l[prev_i: ]

它随着过程“吐出”令牌.您可以像这样使用它：

l = "hello, hello, what's all this shouting? We'll have no trouble here"
for tok in tokenizer(l):
    print tok

hello
,

hello
,

what's

all

this

shouting
?

We'll

have

no

trouble

here

标签：python,tokenize
来源： https://codeday.me/bug/20191013/1906034.html