斯坦福大学对Python NLTK的通用依赖关系
作者:互联网
我有什么办法可以使用python或nltk获得Universal依赖项吗?我只能生成解析树.
例:
输入句子:
My dog also likes eating sausage.
输出:
Universal dependencies
nmod:poss(dog-2, My-1)
nsubj(likes-4, dog-2)
advmod(likes-4, also-3)
root(ROOT-0, likes-4)
xcomp(likes-4, eating-5)
dobj(eating-5, sausage-6)
解决方法:
Wordseer’s stanford-corenlp-python fork是一个好的开始,因为它可以与最新的CoreNLP版本(3.5.2)一起使用.但是,它将为您提供原始输出,您需要对其进行手动转换.例如,假设您正在运行包装器:
>>> import json, jsonrpclib
>>> from pprint import pprint
>>>
>>> server = jsonrpclib.Server("http://localhost:8080")
>>>
>>> pprint(json.loads(server.parse('John loves Mary.'))) # doctest: +SKIP
{u'sentences': [{u'dependencies': [[u'root', u'ROOT', u'0', u'loves', u'2'],
[u'nsubj',
u'loves',
u'2',
u'John',
u'1'],
[u'dobj', u'loves', u'2', u'Mary', u'3'],
[u'punct', u'loves', u'2', u'.', u'4']],
u'parsetree': [],
u'text': u'John loves Mary.',
u'words': [[u'John',
{u'CharacterOffsetBegin': u'0',
u'CharacterOffsetEnd': u'4',
u'Lemma': u'John',
u'PartOfSpeech': u'NNP'}],
[u'loves',
{u'CharacterOffsetBegin': u'5',
u'CharacterOffsetEnd': u'10',
u'Lemma': u'love',
u'PartOfSpeech': u'VBZ'}],
[u'Mary',
{u'CharacterOffsetBegin': u'11',
u'CharacterOffsetEnd': u'15',
u'Lemma': u'Mary',
u'PartOfSpeech': u'NNP'}],
[u'.',
{u'CharacterOffsetBegin': u'15',
u'CharacterOffsetEnd': u'16',
u'Lemma': u'.',
u'PartOfSpeech': u'.'}]]}]}
如果您想使用依赖关系解析器,则可以花点力气重用NLTK的DependencyGraph
>>> import jsonrpclib, json
>>> from nltk.parse import DependencyGraph
>>>
>>> server = jsonrpclib.Server("http://localhost:8080")
>>> parses = json.loads(
... server.parse(
... 'John loves Mary. '
... 'I saw a man with a telescope. '
... 'Ballmer has been vocal in the past warning that Linux is a threat to Microsoft.'
... )
... )['sentences']
>>>
>>> def transform(sentence):
... for rel, _, head, word, n in sentence['dependencies']:
... n = int(n)
...
... word_info = sentence['words'][n - 1][1]
... tag = word_info['PartOfSpeech']
... lemma = word_info['Lemma']
... if rel == 'root':
... # NLTK expects that the root relation is labelled as ROOT!
... rel = 'ROOT'
...
... # Hack: Return values we don't know as '_'.
... # Also, consider tag and ctag to be equal.
... # n is used to sort words as they appear in the sentence.
... yield n, '_', word, lemma, tag, tag, '_', head, rel, '_', '_'
...
>>> dgs = [
... DependencyGraph(
... ' '.join(items) # NLTK expects an iterable of strings...
... for n, *items in sorted(transform(parse))
... )
... for parse in parses
... ]
>>>
>>> # Play around with the information we've got.
>>>
>>> pprint(list(dgs[0].triples()))
[(('loves', 'VBZ'), 'nsubj', ('John', 'NNP')),
(('loves', 'VBZ'), 'dobj', ('Mary', 'NNP')),
(('loves', 'VBZ'), 'punct', ('.', '.'))]
>>>
>>> print(dgs[1].tree())
(saw I (man a (with (telescope a))) .)
>>>
>>> print(dgs[2].to_conll(4)) # doctest: +NORMALIZE_WHITESPACE
Ballmer NNP 4 nsubj
has VBZ 4 aux
been VBN 4 cop
vocal JJ 0 ROOT
in IN 4 prep
the DT 8 det
past JJ 8 amod
warning NN 5 pobj
that WDT 13 dobj
Linux NNP 13 nsubj
is VBZ 13 cop
a DT 13 det
threat NN 8 rcmod
to TO 13 prep
Microsoft NNP 14 pobj
. . 4 punct
<BLANKLINE>
设置CoreNLP并不难,请查看http://www.eecs.qmul.ac.uk/~dm303/stanford-dependency-parser-nltk-and-anaconda.html了解更多详细信息.
标签:stanford-nlp,nlp,nltk,python 来源: https://codeday.me/bug/20191119/2039223.html