编程语言
首页 > 编程语言> > python – 检查是否可以进行分词

python – 检查是否可以进行分词

作者:互联网

这是this response的后续问题以及用户发布的伪代码算法.由于它的年龄,我没有对这个问题发表评论.我只想验证一个字符串是否可以拆分成单词.该算法不需要实际拆分字符串.这是相关问题的回复:

Let S[1..length(w)] be a table with Boolean entries. S[i] is true if
the word w[1..i] can be split. Then set S[1] = isWord(w[1]) and for
i=2 to length(w) calculate

S[i] = (isWord[w[1..i] or for any j in {2..i}: S[j-1] and
isWord[j..i]).

我正在将这个算法翻译成简单的python代码,但我不确定我是否正确理解它.码:

def is_all_words(a_string, dictionary)):
    str_len = len(a_string)
    S = [False] * str_len
    S[0] = is_word(a_string[0], dictionary)
    for i in range(1, str_len):
        check = is_word(a_string[0:i], dictionary)
        if (check):
            S[i] = check
        else:
            for j in range(1, str_len):
                check = (S[j - 1] and is_word(a_string[j:i]), dictionary)
                if (check):
                    S[i] == True
                    break
    return S

我有两个相关的问题. 1)这个代码是否是链接算法到Python的正确翻译,如果是,2)现在我有了S,我怎么用它来判断字符串是否只包含单词?在这种情况下,is_word是一个简单地在列表中查找给定单词的函数.我还没有实现它作为特里.

更新:更新代码以包含建议的更改后,它不起作用.这是更新的代码:

def is_all_words(a_string, dictionary)):
    str_len = len(a_string)
    S = [False] * str_len
    S[0] = is_word(a_string[0], dictionary)
    for i in range(1, str_len):
        check = is_word(a_string[0:i], dictionary)
        if (check):
            S[i] = check
        else:
            for j in range(1, i): #THIS LINE WAS UPDATED
                check = (S[j - 1] and is_word(a_string[j:i]), dictionary)
                if (check):
                    S[i] == True
                    break
    return S

a_string = "carrotforever"
S = is_all_words(a_string, dictionary)
print(S[len(S) - 1]) #prints FALSE

a_string = "hello"
S = is_all_words(a_string, dictionary)
print(S[len(S) - 1]) #prints TRUE

对于这两者,它应该返回True.

解决方法:

这是您的代码的修改版本,应该返回良好的结果.
请注意,您的错误只是从伪代码数组索引(从1开始)到python数组索引(从0开始)的转换,因此S [0]和S [1]填充了相同的值,其中S [L-1]实际上从未计算过.您可以通过打印整个S值轻松跟踪此错误.你会发现S [3]在第一个例子中设置为true,它应该是单词“car”的S [2].
此外,您可以通过存储到目前为止找到的复合词的索引来加速该过程,而不是测试每个位置.

def is_all_words(a_string, dictionary):
    str_len = len(a_string)
    S = [False] * (str_len)
# I replaced is_word function by a simple list lookup, 
# feel free to replace it with whatever function you use. 
# tries or suffix tree are best for this.
    S[0] = (a_string[0] in dictionary) 
    for i in range(1, str_len):
        check = a_string[0:i+1] in dictionary # i+1 instead of i
        if (check):
            S[i] = check
    else:
        for j in range(0,i+1): # i+1 instead of i
            if (S[j-1] and (a_string[j:i+1] in dictionary)): # i+1 instead of i
            S[i] = True
            break


    return S

a_string = "carrotforever"
S = is_all_words(a_string, ["a","car","carrot","for","eve","forever"])
print(S[len(a_string)-1]) #prints TRUE

a_string = "helloworld"
S = is_all_words(a_string, ["hello","world"])
print(S[len(a_string)-1]) #prints TRUE

标签:python,algorithm,nlp,dynamic-programming,text-segmentation
来源: https://codeday.me/bug/20190621/1250121.html