编程语言
首页 > 编程语言> > 智能过滤器与python

智能过滤器与python

作者:互联网

你好
我需要过滤掉所有不包含来自巨大“必要”列表的符号的行,示例代码:

def any_it(iterable):
      for element in iterable:
          if element: return True
      return False

regexp = re.compile(r'fruit=([A-Z]+)')
necessary = ['YELLOW', 'GREEN', 'RED', ...] # huge list of 10 000 members
f = open("huge_file", "r") ## file with > 100 000 lines
lines = f.readlines()
f.close()

## File rows like, let's say:
# 1 djhds fruit=REDSOMETHING sdkjld
# 2 sdhfkjk fruit=GREENORANGE lkjfldk
# 3 dskjldsj fruit=YELLOWDOG sldkfjsdl
# 4 gfhfg fruit=REDSOMETHINGELSE fgdgdfg

filtered = (line for line in lines if any_it(regexp.findall(line)[0].startswith(x) for x in necessary))

我有python 2.4,所以我不能使用内置的any().
我等了很长时间才进行过滤,但有没有办法对其进行优化?例如,第1行和第4行包含“RED ..”模式,如果我们发现“RED ..”模式没问题,我们可以跳过搜索10000个成员列表中第4行相同的模式吗?
还有其他方法可以优化过滤吗?
谢谢.
… …编辑
UPD:在这篇文章的评论中查看真实的示例数据.我也有兴趣用“水果”来分类结果.谢谢!
……结束编辑……

解决方法:

我确信Zach’s answer正在走上正轨.出于好奇,我已经实现了另一个版本(结合了Zach关于使用dict而不是bisect的评论)并将其折叠成符合您示例的解决方案.

#!/usr/bin/env python
import re
from trieMatch import PrefixMatch # https://gist.github.com/736416

pm = PrefixMatch(['YELLOW', 'GREEN', 'RED', ]) # huge list of 10 000 members
# if list is static, it might be worth picking "pm" to avoid rebuilding each time

f = open("huge_file.txt", "r") ## file with > 100 000 lines
lines = f.readlines()
f.close()

regexp = re.compile(r'^.*?fruit=([A-Z]+)')
filtered = (line for line in lines if pm.match(regexp.match(line).group(1)))

为简洁起见,PrefixMatch的实现是published here.

如果您的必要前缀列表是静态的或不经常更改,您可以通过酸洗和重用PickleMatch对象来加速后续运行,而不是每次都重建它.

更新(关于排序结果)

根据changelog for Python 2.4

key should be a single-parameter function that takes a list element and
returns a comparison key for the
element. The list is then sorted using
the comparison keys.

另外,在the source code, line 1792

/* Special wrapper to support stable sorting using the decorate-sort-undecorate
   pattern.  Holds a key which is used for comparisons and the original record
   which is returned during the undecorate phase.  By exposing only the key
   .... */

这意味着您的正则表达式模式仅针对每个条目评估一次(每次比较不评估一次),因此它不应该太昂贵:

sorted_generator = sorted(filtered, key=regexp.match(line).group(1))

标签:python,optimization,text,filter,python-2-4
来源: https://codeday.me/bug/20190721/1497039.html