php-将大字符串拆分为数组,但拆分点无法破坏标签
作者:互联网
我编写了一个脚本,该脚本将的大块文本发送给Google进行翻译,但是有时,该文本(即html源代码)最终会在html标记中间分裂,并且Google会错误地返回代码.
我已经知道如何将字符串拆分为数组,但是在确保输出字符串不超过5000个字符且不会在标签上拆分的情况下,还有更好的方法吗?
更新:感谢您的回答,这是我最终在项目中使用的代码,并且效果很好
function handleTextHtmlSplit($text, $maxSize) {
//our collection array
$niceHtml[] = '';
// Splits on tags, but also includes each tag as an item in the result
$pieces = preg_split('/(<[^>]*>)/', $text, -1, PREG_SPLIT_DELIM_CAPTURE);
//the current position of the index
$currentPiece = 0;
//start assembling a group until it gets to max size
foreach ($pieces as $piece) {
//make sure string length of this piece will not exceed max size when inserted
if (strlen($niceHtml[$currentPiece] . $piece) > $maxSize) {
//advance current piece
//will put overflow into next group
$currentPiece += 1;
//create empty string as value for next piece in the index
$niceHtml[$currentPiece] = '';
}
//insert piece into our master array
$niceHtml[$currentPiece] .= $piece;
}
//return array of nicely handled html
return $niceHtml;
}
解决方法:
注意:还没有机会进行测试(因此可能会有一个或两个小错误),但是它应该给您一个想法:
function get_groups_of_5000_or_less($input_string) {
// Splits on tags, but also includes each tag as an item in the result
$pieces = preg_split('/(<[^>]*>)/', $input_string,
-1, PREG_SPLIT_DELIM_CAPTURE);
$groups[] = '';
$current_group = 0;
while ($cur_piece = array_shift($pieces)) {
$piecelen = strlen($cur_piece);
if(strlen($groups[$current_group]) + $piecelen > 5000) {
// Adding the next piece whole would go over the limit,
// figure out what to do.
if($cur_piece[0] == '<') {
// Tag goes over the limit, just put it into a new group
$groups[++$current_group] = $cur_piece;
} else {
// Non-tag goes over the limit, split it and put the
// remainder back on the list of un-grabbed pieces
$grab_amount = 5000 - $strlen($groups[$current_group];
$groups[$current_group] .= substr($cur_piece, 0, $grab_amount);
$groups[++$current_group] = '';
array_unshift($pieces, substr($cur_piece, $grab_amount));
}
} else {
// Adding this piece doesn't go over the limit, so just add it
$groups[$current_group] .= $cur_piece;
}
}
return $groups;
}
还要注意,这可以在常规单词的中间进行拆分-如果您不希望这样做,则可以修改以// //非标记开头的部分,以便为$grab_amount选择一个更好的值.我不费吹灰之力地进行编码,因为这仅是如何解决拆分标签的一个示例,而不是嵌入式解决方案.
标签:string-split,php 来源: https://codeday.me/bug/20191209/2097718.html