c – 提升精神QI慢
作者:互联网
我尝试使用Boost Spirit QI解析TPCH文件.
我的实现灵感来自Spirit QI(http://www.boost.org/doc/libs/1_52_0/libs/spirit/example/qi/employee.cpp)的员工示例.
数据采用csv格式,令牌以“|”分隔.字符.
它工作但很慢(1秒时为20秒).
这是我对lineitem文件的qi语法:
struct lineitem {
int l_orderkey;
int l_partkey;
int l_suppkey;
int l_linenumber;
std::string l_quantity;
std::string l_extendedprice;
std::string l_discount;
std::string l_tax;
std::string l_returnflag;
std::string l_linestatus;
std::string l_shipdate;
std::string l_commitdate;
std::string l_recepitdate;
std::string l_shipinstruct;
std::string l_shipmode;
std::string l_comment;
};
BOOST_FUSION_ADAPT_STRUCT( lineitem,
(int, l_orderkey)
(int, l_partkey)
(int, l_suppkey)
(int, l_linenumber)
(std::string, l_quantity)
(std::string, l_extendedprice)
(std::string, l_discount)
(std::string, l_tax)
(std::string, l_returnflag)
(std::string, l_linestatus)
(std::string, l_shipdate)
(std::string, l_commitdate)
(std::string, l_recepitdate)
(std::string, l_shipinstruct)
(std::string, l_shipmode)
(std::string, l_comment))
vector<lineitem>* lineitems=new vector<lineitem>();
phrase_parse(state->dataPointer,
state->dataEndPointer,
(*(int_ >> "|" >>
int_ >> "|" >>
int_ >> "|" >>
int_ >> "|" >>
+(char_ - '|') >> "|" >>
+(char_ - '|') >> "|" >>
+(char_ - '|') >> "|" >>
+(char_ - '|') >> "|" >>
+(char_ - '|') >> '|' >>
+(char_ - '|') >> '|' >>
+(char_ - '|') >> '|' >>
+(char_ - '|') >> '|' >>
+(char_ - '|') >> '|' >>
+(char_ - '|') >> '|' >>
+(char_ - '|') >> '|' >>
+(char_ - '|') >> '|'
) ), space, *lineitems
);
问题似乎是字符解析.它比其他转换慢得多.
有没有更好的方法将可变长度标记解析为字符串?
解决方法:
我找到了解决问题的方法.如本文Boost Spirit QI grammar slow for parsing delimited strings所述
性能瓶颈是Spirit qi的字符串处理.所有其他数据类型似乎都非常快.
我通过自己处理数据而不是使用Spirit qi处理来避免这个问题.
我的解决方案使用一个帮助程序类,它为csv文件的每个字段提供函数.函数将值存储到结构中.字符串存储在char []中.命中解析器一个换行符,它调用一个函数,将结构添加到结果向量中.
Boost解析器调用此函数,而不是将值自身存储到向量中.
这是TCPH Benchmark的region.tbl文件的代码:
struct region{
int r_regionkey;
char r_name[25];
char r_comment[152];
};
class regionStorage{
public:
regionStorage(vector<region>* regions) :regions(regions), pos(0) {}
void storer_regionkey(int const&i){
currentregion.r_regionkey = i;
}
void storer_name(char const&i){
currentregion.r_name[pos] = i;
pos++;
}
void storer_comment(char const&i){
currentregion.r_comment[pos] = i;
pos++;
}
void resetPos() {
pos = 0;
}
void endOfLine() {
pos = 0;
regions->push_back(currentregion);
}
private:
vector<region>* regions;
region currentregion;
int pos;
};
void parseRegion(){
vector<region> regions;
regionStorage regionstorageObject(®ions);
phrase_parse(dataPointer, /*< start iterator >*/
state->dataEndPointer, /*< end iterator >*/
(*(lexeme[
+(int_[boost::bind(®ionStorage::storer_regionkey, ®ionstorageObject, _1)] - '|') >> '|' >>
+(char_[boost::bind(®ionStorage::storer_name, ®ionstorageObject, _1)] - '|') >> char_('|')[boost::bind(®ionStorage::resetPos, ®ionstorageObject)] >>
+(char_[boost::bind(®ionStorage::storer_comment, ®ionstorageObject, _1)] - '|') >> char_('|')[boost::bind(®ionStorage::endOfLine, ®ionstorageObject)]
])), space);
cout << regions.size() << endl;
}
它不是一个漂亮的解决方案,但它可以工作,而且速度更快. (对于1 GB TCPH数据,2.2秒,多线程)
标签:boost-spirit-qi,c,csv,parsing,boost-spirit 来源: https://codeday.me/bug/20191002/1840817.html