编程语言
首页 > 编程语言> > python-lark-parser缩进DSL和多行文档字符串

python-lark-parser缩进DSL和多行文档字符串

作者:互联网

我正在尝试使用lark实现记录定义DSL.它基于缩进,这会使事情变得更加复杂.

云雀是一个很好的工具,但是我面临一些困难.

这是我正在实现的DSL的代码段:

record Order :
    """Order record documentation
    should have arbitrary size"""

    field1 Int
    field2 Datetime:
        """Attributes should also have
        multiline documentation"""

    field3 String "inline documentation also works"

这是使用的语法:

?start: (_NEWLINE | redorddef)*

simple_type: NAME

multiline_doc:  MULTILINE_STRING _NEWLINE
inline_doc: INLINE_STRING

?element_doc:  ":" _NEWLINE _INDENT multiline_doc _DEDENT | inline_doc

attribute_name: NAME
attribute_simple_type: attribute_name simple_type [element_doc] _NEWLINE
attributes: attribute_simple_type+
_recordbody: _NEWLINE _INDENT [multiline_doc] attributes _DEDENT
redorddef: "record" NAME ":" _recordbody



MULTILINE_STRING: /"""([^"\\]*(\\.[^"\\]*)*)"""/
INLINE_STRING: /"([^"\\]*(\\.[^"\\]*)*)"/

_WS_INLINE: (" "|/\t/)+
COMMENT: /#[^\n]*/
_NEWLINE: ( /\r?\n[\t ]*/ | COMMENT )+

%import common.CNAME -> NAME
%import common.INT

%ignore /[\t \f]+/  // WS
%ignore /\\[\t \f]*\r?\n/   // LINE_CONT
%ignore COMMENT
%declare _INDENT _DEDENT

它适用于记录定义的多行字符串文档,适用于内联属性定义,但不适用于属性多行字符串文档.

我用来执行的代码是这样的:

import sys
import pprint

from pathlib import Path

from lark import Lark, UnexpectedInput
from lark.indenter import Indenter

scheman_data_works = '''
record Order :
        """Order record documentation
        should have arbitrary size"""

        field1 Int
        # field2 Datetime:
        #   """Attributes should also have
        #   multiline documentation"""

        field3 String "inline documentation also works"
'''

scheman_data_wrong = '''
record Order :
        """Order record documentation
        should have arbitrary size"""

        field1 Int
        field2 Datetime:
                """Attributes should also have
                multiline documentation"""

        field3 String "inline documentation also works"
'''
grammar = r'''

?start: (_NEWLINE | redorddef)*

simple_type: NAME

multiline_doc:  MULTILINE_STRING _NEWLINE
inline_doc: INLINE_STRING

?element_doc:  ":" _NEWLINE _INDENT multiline_doc _DEDENT | inline_doc

attribute_name: NAME
attribute_simple_type: attribute_name simple_type [element_doc] _NEWLINE
attributes: attribute_simple_type+
_recordbody: _NEWLINE _INDENT [multiline_doc] attributes _DEDENT
redorddef: "record" NAME ":" _recordbody



MULTILINE_STRING: /"""([^"\\]*(\\.[^"\\]*)*)"""/
INLINE_STRING: /"([^"\\]*(\\.[^"\\]*)*)"/

_WS_INLINE: (" "|/\t/)+
COMMENT: /#[^\n]*/
_NEWLINE: ( /\r?\n[\t ]*/ | COMMENT )+

%import common.CNAME -> NAME
%import common.INT

%ignore /[\t \f]+/  // WS
%ignore /\\[\t \f]*\r?\n/   // LINE_CONT
%ignore COMMENT
%declare _INDENT _DEDENT

'''

class SchemanIndenter(Indenter):
    NL_type = '_NEWLINE'
    OPEN_PAREN_types = ['LPAR', 'LSQB', 'LBRACE']
    CLOSE_PAREN_types = ['RPAR', 'RSQB', 'RBRACE']
    INDENT_type = '_INDENT'
    DEDENT_type = '_DEDENT'
    tab_len = 4

scheman_parser = Lark(grammar, parser='lalr', postlex=SchemanIndenter())
print(scheman_parser.parse(scheman_data_works).pretty())
print("\n\n")
print(scheman_parser.parse(scheman_data_wrong).pretty())

结果是:

redorddef
Order
multiline_doc """Order record documentation
        should have arbitrary size"""
attributes
    attribute_simple_type
    attribute_name    field1
    simple_type       Int
    attribute_simple_type
    attribute_name    field3
    simple_type       String
    inline_doc        "inline documentation also works"




Traceback (most recent call last):
File "schema_parser.py", line 83, in <module>
    print(scheman_parser.parse(scheman_data_wrong).pretty())
File "/Users/branquif/Dropbox/swf_projects/schema-manager/.venv/lib/python3.7/site-packages/lark/lark.py", line 228, in parse
    return self.parser.parse(text)
File "/Users/branquif/Dropbox/swf_projects/schema-manager/.venv/lib/python3.7/site-packages/lark/parser_frontends.py", line 38, in parse
    return self.parser.parse(token_stream, *[sps] if sps is not NotImplemented else [])
File "/Users/branquif/Dropbox/swf_projects/schema-manager/.venv/lib/python3.7/site-packages/lark/parsers/lalr_parser.py", line 68, in parse
    for token in stream:
File "/Users/branquif/Dropbox/swf_projects/schema-manager/.venv/lib/python3.7/site-packages/lark/indenter.py", line 31, in process
    for token in stream:
File "/Users/branquif/Dropbox/swf_projects/schema-manager/.venv/lib/python3.7/site-packages/lark/lexer.py", line 319, in lex
    for x in l.lex(stream, self.root_lexer.newline_types, self.root_lexer.ignore_types):
File "/Users/branquif/Dropbox/swf_projects/schema-manager/.venv/lib/python3.7/site-packages/lark/lexer.py", line 167, in lex
    raise UnexpectedCharacters(stream, line_ctr.char_pos, line_ctr.line, line_ctr.column, state=self.state)
lark.exceptions.UnexpectedCharacters: No terminal defined for 'f' at line 11 col 2

        field3 String "inline documentation also
^

我无法理解缩进语法更复杂,而云雀似乎更容易,但是在这里找不到错误.

PS:我也尝试过pyparsing,但在这种情况下却没有成功,考虑到可能需要的代码量,我很难改用PLY.

解决方法:

该错误来自错误放置的_NEWLINE终端.通常,建议根据规则在语法中的作用来确保规则的平衡.因此,这是您应该如何定义element_doc的方法:

?element_doc:  ":" _NEWLINE _INDENT multiline_doc _DEDENT
            | inline_doc _NEWLINE

请注意添加的换行符,这意味着无论解析器采用两个选项中的哪一个,它都将在语法上以相似的状态结束(_DEDENT也与换行符匹配).

由于第一个变化,第二个变化是:

attribute_simple_type: attribute_name simple_type (element_doc|_NEWLINE)

由于element_doc已经可以处理换行符,因此我们不应尝试对其进行两次匹配.

标签:grammar,lark-parser,python,parsing
来源: https://codeday.me/bug/20191108/2006587.html