编程语言
首页 > 编程语言> > java – 如何从解析的文本中提取名词短语

java – 如何从解析的文本中提取名词短语

作者:互联网

我用constituency解析器解析了一个文本,将结果复制到如下文本文件中:

(ROOT (S (NP (NN Yesterday)) (, ,) (NP (PRP we)) (VP (VBD went) (PP (TO to)....
(ROOT (FRAG (SBAR (SBAR (IN While) (S (NP (PRP I)) (VP (VBD was) (NP (NP (EX...
(ROOT (S (NP (NN Yesterday)) (, ,) (NP (PRP I)) (VP (VBD went) (PP (TO to.....
(ROOT (FRAG (SBAR (SBAR (IN While) (S (NP (NNP Jim)) (VP (VBD was) (NP (NP (....
(ROOT (S (S (NP (PRP I)) (VP (VBD started) (S (VP (VBG talking) (PP.....

我需要从这个文本文件中提取所有NounPhrases(NP).我编写了以下代码,仅从每行中提取第一个NP.但是,我需要提取所有名词短语.我的代码是:

public class nounPhrase {

    public static int findClosingParen(char[] text, int openPos) {
        int closePos = openPos;
        int counter = 1;
        while (counter > 0) {
            char c = text[++closePos];
            if (c == '(') {

                counter++;
            }
            else if (c == ')') {
                counter--;
            }
        }
        return closePos;
    }

     public static void main(String[] args) throws IOException {

        ArrayList npList = new ArrayList ();
        String line;
        String line1;
        int np;

        String Input = "/local/Input/Temp/Temp.txt";

        String Output = "/local/Output/Temp/Temp-out.txt";  

        FileInputStream  fis = new FileInputStream (Input);
        BufferedReader br = new BufferedReader(new InputStreamReader(fis,"UTF-8"
        ));
        while ((line = br.readLine())!= null){
        char[] lineArray = line.toCharArray();
        np = findClosingParen (lineArray, line.indexOf("(NP"));
        line1 = line.substring(line.indexOf("(NP"),np+1);
        System.out.print(line1+"\n");
        }
    }
}

输出是:

(NP (NN Yesterday))...I need other NPs in this line also
(NP (PRP I)).....I need other NPs in this line also
(NP (NNP Jim)).....I need other NPs in this line also
(NP (PRP I)).....I need other NPs in this line also

我的代码只使用右括号获取每行的第一个NP,但我需要从文本中提取所有NP.

解决方法:

编写自己的树解析器是一个很好的练习(!),如果你只是想要结果,最简单的方法是使用Stanford NLP工具的更多功能,即Tregex,它就是为这些东西设计的.您可以将最终的while循环更改为以下内容:

TregexPattern tPattern = TregexPattern.compile("NP");
while ((line = br.readLine()) != null) {
    Tree t = Tree.valueOf(line);
    TregexMatcher tMatcher = tPattern.matcher(t);
    while (tMatcher.find()) {
      System.out.println(tMatcher.getMatch());
    }
}

标签:stanford-nlp,java
来源: https://codeday.me/bug/20190830/1765062.html