如何从文本中提取引文(PHP)?
作者:互联网
你好!
我想从文本中提取所有引文.此外,应提取被引人员的姓名. DayLife does this very well.
例:
“They think it’s ‘game over,’ ” one senior administration official said.
他们认为这是“游戏结束”这一短语,应该提取一位高级政府官员.
你觉得这可能吗?如果您检查引用的人是否被提及,您只能区分引文和引号中的单词.
例:
“I think it is serious and it is deteriorating,” Admiral Mullen said Sunday on CNN’s “State of the Union” program.
通过国联盟不是报价.但你怎么发现这个? a)您检查是否有引用的人提到. b)计算所谓引用中的空格.如果空格少于3个,则不会是报价,对吧?我更喜欢b)因为并不总是被引用的人命名.
怎么开始?
我会先用一种类型替换所有类型的引号,这样你以后只需要检查一个引号.
<?php
$text = '';
$quote_marks = array('“', '”', '„', '»', '«');
$text = str_replace($quote_marks, '"', $text);
?>
然后我将提取包含3个以上空格的引号之间的所有短语:
<?php
function extract_quotations($text) {
$result = preg_match_all('/"([^"]+)"/', $text, $found_quotations);
if ($result == TRUE) {
return $found_quotations;
// check for count of blank spaces
}
return array();
}
?>
你怎么能改善这个?
我希望你能帮助我.非常感谢你提前!
解决方法:
正如ceejayoz已经指出的那样,这不适合单一功能.您在问题中描述的内容(检测句子的语法功能 – 即“我认为它是严重的并且正在恶化”,与“国家联盟”)将最好用图书馆解决这可以将自然语言分解为令牌.我不知道PHP中有任何这样的库,但你可以看一下你在python中使用的项目大小:http://www.nltk.org/
我认为您可以做的最好的事情是定义一组您手动验证的语法规则.这样的事情怎么样:
abstract class QuotationExtractor {
protected static $instances;
public static function getAllPossibleQuotations($string) {
$possibleQuotations = array();
foreach (self::$instances as $instance) {
$possibleQuotations = array_merge(
$possibleQuotations,
$instance->extractQuotations($string)
);
}
return $possibleQuotations;
}
public function __construct() {
self::$instances[] = $this;
}
public abstract function extractQuotations($string);
}
class RegexExtractor extends QuotationExtractor {
protected $rules;
public function extractQuotations($string) {
$quotes = array();
foreach ($this->rules as $rule) {
preg_match_all($rule[0], $string, $matches, PREG_SET_ORDER);
foreach ($matches as $match) {
$quotes[] = array(
'quote' => trim($match[$rule[1]]),
'cited' => trim($match[$rule[2]])
);
}
}
return $quotes;
}
public function addRule($regex, $quoteIndex, $authorIndex) {
$this->rules[] = array($regex, $quoteIndex, $authorIndex);
}
}
$regexExtractor = new RegexExtractor();
$regexExtractor->addRule('/"(.*?)[,.]?\h*"\h*said\h*(.*?)\./', 1, 2);
$regexExtractor->addRule('/"(.*?)\h*"(.*)said/', 1, 2);
$regexExtractor->addRule('/\.\h*(.*)(once)?\h*said[\-]*"(.*?)"/', 3, 1);
class AnotherExtractor extends Quot...
如果您具有上述结构,则可以通过其中任何/所有结构运行相同的文本,并列出可能的引用以选择正确的引用.我用这个线程运行代码作为测试输入,结果是:
array(4) {
[0]=>
array(2) {
["quote"]=>
string(15) "Not necessarily"
["cited"]=>
string(8) "ceejayoz"
}
[1]=>
array(2) {
["quote"]=>
string(28) "They think it's `game over,'"
["cited"]=>
string(34) "one senior administration official"
}
[2]=>
array(2) {
["quote"]=>
string(46) "I think it is serious and it is deteriorating,"
["cited"]=>
string(14) "Admiral Mullen"
}
[3]=>
array(2) {
["quote"]=>
string(16) "Not necessarily,"
["cited"]=>
string(0) ""
}
}
标签:php,regex,quotations 来源: https://codeday.me/bug/20190827/1742896.html