首页 > 编程语言> > 如何提高Python C Extensions文件行的读取？

如何提高Python C Extensions文件行的读取？

2019-10-10 07:08:47 作者：互联网

最初在Are there alternative and portable algorithm implementation for reading lines from a file on Windows (Visual Studio Compiler) and Linux?上被问到,但由于在国外也关闭,因此,我在这里尝试通过更简洁的用例来缩小其范围.

我的目标是使用带有行缓存策略的Python C Extensions为Python实现我自己的文件读取模块.没有任何行缓存策略的纯Python算法实现是这样的：

# This takes 1 second to parse 100MB of log data
with open('myfile', 'r', errors='replace') as myfile:
    for line in myfile:
        if 'word' in line: 
            pass

恢复Python C扩展实现：(see here the full code with line caching policy)

// other code to open the file on the std::ifstream object and create the iterator
...

static PyObject * PyFastFile_iternext(PyFastFile* self, PyObject* args)
{
    std::string newline;

    if( std::getline( self->fileifstream, newline ) ) {
        return PyUnicode_DecodeUTF8( newline.c_str(), newline.size(), "replace" );
    }

    PyErr_SetNone( PyExc_StopIteration );
    return NULL;
}

static PyTypeObject PyFastFileType =
{
    PyVarObject_HEAD_INIT( NULL, 0 )
    "fastfilepackage.FastFile" /* tp_name */
};

// create the module
PyMODINIT_FUNC PyInit_fastfilepackage(void)
{
    PyFastFileType.tp_iternext = (iternextfunc) PyFastFile_iternext;
    Py_INCREF( &PyFastFileType );

    PyObject* thismodule;
    // other module code creating the iterator and context manager
    ...

    PyModule_AddObject( thismodule, "FastFile", (PyObject *) &PyFastFileType );
    return thismodule;
}

这是Python代码,它使用Python C扩展代码打开文件并逐行读取其行：

from fastfilepackage import FastFile

# This takes 3 seconds to parse 100MB of log data
iterable = fastfilepackage.FastFile( 'myfile' )
for item in iterable:
    if 'word' in iterable():
        pass

现在,带有C 11 std :: ifstream的Python C扩展代码fastfilepackage.FastFile需要3秒来解析100MB的日志数据,而提供的Python实现需要1秒.

文件myfile的内容仅是日志行,每行大约有100〜300个字符.字符只是ASCII码(模块％256),但是由于记录器引擎上的错误,它可以放置无效的ASCII或Unicode字符.因此,这就是为什么我在打开文件时使用errors =’replace’策略的原因.

我只是想知道我是否可以替换或改进此Python C Extension实现,从而减少运行Python程序的3秒时间.

我用它来做基准测试：

import time
import datetime
import fastfilepackage

# usually a file with 100MB
testfile = './myfile.log'

timenow = time.time()
with open( testfile, 'r', errors='replace' ) as myfile:
    for item in myfile:
        if None:
            var = item

python_time = time.time() - timenow
timedifference = datetime.timedelta( seconds=python_time )
print( 'Python   timedifference', timedifference, flush=True )
# prints about 3 seconds

timenow = time.time()
iterable = fastfilepackage.FastFile( testfile )
for item in iterable:
    if None:
        var = iterable()

fastfile_time = time.time() - timenow
timedifference = datetime.timedelta( seconds=fastfile_time )
print( 'FastFile timedifference', timedifference, flush=True )
# prints about 1 second

print( 'fastfile_time %.2f%%, python_time %.2f%%' % ( 
        fastfile_time/python_time, python_time/fastfile_time ), flush=True )

解决方法:

逐行阅读将导致不可避免的减速. Python的内置的面向文本的只读文件对象实际上是三层：

> io.FileIO-对文件的原始,无缓冲访问
> io.BufferedReader-缓冲底层的FileIO
> io.TextIOWrapper-包装BufferedReader以实现对str的缓冲解码

尽管iostream确实执行缓冲,但它仅完成io.BufferedReader的工作,而不是io.TextIOWrapper. io.TextIOWrapper添加了额外的缓冲层,从BufferedReader中读取8 KB块并将其批量解码为str(当一个块以不完整的字符结尾时,它将保留剩余字节以保留到下一个块),然后根据请求从解码后的块中生成单独的行,直到用完为止(当解码后的块以部分行结尾时,其余部分将优先于下一个解码后的块).

相反,您使用std :: getline一次占用一行,然后使用PyUnicode_DecodeUTF8一次解码一行,然后返回给调用者；到调用者请求下一行时,赔率是至少一些与您的tp_iternext实现相关的代码已离开CPU高速缓存(或至少离开了高速缓存的最快部分).紧密循环将8 KB文本解码为UTF-8将会非常快.反复离开循环,一次只解码100-300字节会比较慢.

解决方案是大致执行io.TextIOWrapper的操作：读取块而不是行,然后批量解码(保留下一个块的不完整UTF-8编码字符),然后搜索换行符以从解码缓冲区中找出子字符串直到耗尽为止(不要每次都修剪缓冲区,只跟踪索引).当解码缓冲区中没有剩余的完整行时,请修剪已产生的内容,然后读取,解码并追加一个新块.

在Python’s underlying implementation of io.TextIOWrapper.readline上还有一些改进的余地(例如,他们每次读取大块并间接调用它们都必须构造一个Python级别的int,因为他们不能保证自己包装了BufferedReader),但这是重新实现自己的坚实基础方案.

更新：在检查完整的代码(与您发布的代码完全不同)时,您遇到了其他问题.您的tp_iternext只是重复产生None,要求您调用对象以检索字符串.那真不幸.这比将每个项目的Python解释器开销增加一倍还多(tp_iternext调用起来很便宜,非常专业； tp_call并不是那么便宜,它要经过复杂的通用代码路径,要求解释器传递一个空的元组,您永远不会使用,等等；旁注,PyFastFile_tp_call应该接受kwds的第三个参数,您可以忽略它,但仍然必须接受它；强制转换为ternaryfunc可以使错误消失,但这在某些平台上会中断).

最后说明(与最小文件无关,它与性能几乎没有关系)：tp_iternext的协定不需要在迭代器用尽时设置异常,只需返回NULL;即可.您可以删除对PyErr_SetNone(PyExc_StopIteration);的调用；只要未设置其他异常,则返回NULL；否则,返回NULL.单独表示迭代结束,因此您可以完全不进行设置来节省一些工作.

标签：c-3,python,c,python-c-api
来源： https://codeday.me/bug/20191010/1884895.html