在Python中一次迭代String字
作者:互联网
我有一个巨大的文本文件的字符串缓冲区.我必须在字符串缓冲区中搜索给定的单词/短语.什么是有效的方法呢?
我尝试使用re模块匹配.但由于我有一个巨大的文本语料库,我必须搜索.这需要花费大量时间.
给出单词和短语词典.
我遍历每个文件,将其读入字符串,搜索字典中的所有单词和短语,并在找到键时增加字典中的计数.
我们认为的一个小优化是将短语/单词的字典排序为最大单词数.然后比较字符串缓冲区中的每个单词起始位置并比较单词列表.如果找到一个短语,我们不会搜索其他短语(因为它匹配最长的短语,这是我们想要的)
有人可以建议如何在字符串缓冲区中逐字逐句. (逐字迭代字符串缓冲区)?
此外,还有其他优化可以做到吗?
data = str(file_content)
for j in dictionary_entity.keys():
cnt = data.count(j+" ")
if cnt != -1:
dictionary_entity[j] = dictionary_entity[j] + cnt
f.close()
解决方法:
通过文件的内容(在我的案例中来自Project Gutenberg的绿野仙踪)逐字逐句地迭代,有三种不同的方式:
from __future__ import with_statement
import time
import re
from cStringIO import StringIO
def word_iter_std(filename):
start = time.time()
with open(filename) as f:
for line in f:
for word in line.split():
yield word
print 'iter_std took %0.6f seconds' % (time.time() - start)
def word_iter_re(filename):
start = time.time()
with open(filename) as f:
txt = f.read()
for word in re.finditer('\w+', txt):
yield word
print 'iter_re took %0.6f seconds' % (time.time() - start)
def word_iter_stringio(filename):
start = time.time()
with open(filename) as f:
io = StringIO(f.read())
for line in io:
for word in line.split():
yield word
print 'iter_io took %0.6f seconds' % (time.time() - start)
woo = '/tmp/woo.txt'
for word in word_iter_std(woo): pass
for word in word_iter_re(woo): pass
for word in word_iter_stringio(woo): pass
导致:
% python /tmp/junk.py
iter_std took 0.016321 seconds
iter_re took 0.028345 seconds
iter_io took 0.016230 seconds
标签:string-matching,python,string 来源: https://codeday.me/bug/20190726/1546272.html