编程语言
首页 > 编程语言> > 在Python中一次迭代String字

在Python中一次迭代String字

作者:互联网

我有一个巨大的文本文件的字符串缓冲区.我必须在字符串缓冲区中搜索给定的单词/短语.什么是有效的方法呢?

我尝试使用re模块匹配.但由于我有一个巨大的文本语料库,我必须搜索.这需要花费大量时间.

给出单词和短语词典.

我遍历每个文件,将其读入字符串,搜索字典中的所有单词和短语,并在找到键时增加字典中的计数.

我们认为的一个小优化是将短语/单词的字典排序为最大单词数.然后比较字符串缓冲区中的每个单词起始位置并比较单词列表.如果找到一个短语,我们不会搜索其他短语(因为它匹配最长的短语,这是我们想要的)

有人可以建议如何在字符串缓冲区中逐字逐句. (逐字迭代字符串缓冲区)?

此外,还有其他优化可以做到吗?

data = str(file_content)
for j in dictionary_entity.keys():
    cnt = data.count(j+" ")
    if cnt != -1:
        dictionary_entity[j] = dictionary_entity[j] + cnt
f.close()

解决方法:

通过文件的内容(在我的案例中来自Project Gutenberg的绿野仙踪)逐字逐句地迭代,有三种不同的方式:

from __future__ import with_statement
import time
import re
from cStringIO import StringIO

def word_iter_std(filename):
    start = time.time()
    with open(filename) as f:
        for line in f:
            for word in line.split():
                yield word
    print 'iter_std took %0.6f seconds' % (time.time() - start)

def word_iter_re(filename):
    start = time.time()
    with open(filename) as f:
        txt = f.read()
    for word in re.finditer('\w+', txt):
        yield word
    print 'iter_re took %0.6f seconds' % (time.time() - start)

def word_iter_stringio(filename):
    start = time.time()
    with open(filename) as f:
        io = StringIO(f.read())
    for line in io:
        for word in line.split():
            yield word
    print 'iter_io took %0.6f seconds' % (time.time() - start)

woo = '/tmp/woo.txt'

for word in word_iter_std(woo): pass
for word in word_iter_re(woo): pass
for word in word_iter_stringio(woo): pass

导致:

% python /tmp/junk.py
iter_std took 0.016321 seconds
iter_re took 0.028345 seconds
iter_io took 0.016230 seconds

标签:string-matching,python,string
来源: https://codeday.me/bug/20190726/1546272.html