编程语言
首页 > 编程语言> > 为什么PHP preg_match_all与PCRE_UTF8在CLI和Apache / mod_php上给出不同的结果?

为什么PHP preg_match_all与PCRE_UTF8在CLI和Apache / mod_php上给出不同的结果?

作者:互联网

通过CLI和Apache / mod_php运行时,以下代码会产生不同的结果:

<pre>
<?php
error_reporting(E_ALL);
ini_set('display_errors', '1');

echo setlocale(LC_ALL, 0)."\n";
// echo setlocale(LC_ALL, "en_GB.UTF-8")."\n";

$terms = array
(
    //Always matches:
    "Label Generation",
    //Doesn't match when using u (PCRE_UTF8) modifier:
    "Receipt of Prescription and Validation of Patient Information",

);

$text       = "Some terms to match: ".implode(", ",$terms);
$pattern    = "/(".implode(")|(", $terms).")/is";
$regexps    = array
(
   "Unicode"     => $pattern."u", //Add u (PCRE_UTF8) modifier
   "Non-unicode" => $pattern
);

echo "Text:\n'$text'\n";

foreach($regexps as $type=>$regexp)
{
    $matches    = array();
    $total      = preg_match_all($regexp,$text,$matches);

    echo "\n\n";
    echo "$type regex:\n'$regexp'\n\n";
    echo "Total $type matches: ";
    var_dump($total);
    echo "\n$type matches: ";
    var_dump($matches[0]);
}
?>
</pre>

CLI输出(正确):

<pre>
/en_GB.UTF-8/C/C/C/C/C
Text:
'Some terms to match: Label Generation, Receipt of Prescription and Validation of Patient Information'


Unicode regex:
'/(Label Generation)|(Receipt of Prescription and Validation of Patient Information)/isu'

Total Unicode matches: int(2)

Unicode matches: array(2) {
  [0]=>
  string(16) "Label Generation"
  [1]=>
  string(61) "Receipt of Prescription and Validation of Patient Information"
}


Non-unicode regex:
'/(Label Generation)|(Receipt of Prescription and Validation of Patient Information)/is'

Total Non-unicode matches: int(2)

Non-unicode matches: array(2) {
  [0]=>
  string(16) "Label Generation"
  [1]=>
  string(61) "Receipt of Prescription and Validation of Patient Information"
}
</pre>

Apache / mod_php webserver结果(不正确 – 仅在不使用/ u修饰符时匹配字符串):

/en_GB.ISO8859-1/C/C/C/C/C
Text:
'Some terms to match: Label Generation, Receipt of Prescription and Validation of Patient Information'


Unicode regex:
'/(Label Generation)|(Receipt of Prescription and Validation of Patient Information)/isu'

Total Unicode matches: int(1)

Unicode matches: array(1) {
  [0]=>
  string(16) "Label Generation"
}


Non-unicode regex:
'/(Label Generation)|(Receipt of Prescription and Validation of Patient Information)/is'

Total Non-unicode matches: int(2)

Non-unicode matches: array(2) {
  [0]=>
  string(16) "Label Generation"
  [1]=>
  string(61) "Receipt of Prescription and Validation of Patient Information"
}

使用/ u(PCRE_UTF8)选项时,Web服务器无法匹配两个字符串.
我试过setlocale(LC_ALL,“en_GB.UTF-8”);将Web服务器区域设置与它成功执行的CLI区域设置相匹配,但它与输出无关.
我怀疑PCRE库存在问题,但我不明白CLI和Web服务器之间的区别如何 – PHP在两种环境中报告相同的库版本:
PHP 5.4.14
PCRE(Perl兼容正则表达式)支持=>启用
PCRE库版本=> 8.32 2012-11-30

pcretest报告没有UTF-8支持但是尽管如此,CLI版本产生了正确的结果

$> pcretest -C
PCRE version 8.32 2012-11-30
Compiled with
  8-bit support
  No UTF-8 support
  No Unicode properties support
  No just-in-time compiler support
  Newline sequence is LF
  \R matches all Unicode newlines
  Internal link size = 2
  POSIX malloc threshold = 10
  Default match limit = 10000000
  Default recursion depth limit = 10000000
  Match recursion uses stack

解决方法:

这个PHP设置帮助我:

pcre.jit=0 

标签:preg-match-all,php,utf-8,pcre
来源: https://codeday.me/bug/20190831/1775460.html