编程语言
首页 > 编程语言> > Javascript replace()正则表达式太贪心了

Javascript replace()正则表达式太贪心了

作者:互联网

我正在尝试清理HTML输入字段.我想保留一些标签,但不是全部,所以我不能在读取元素值时使用.text().我在Safari中使用JavaScript中的正则表达式时遇到了一些麻烦.这是代码片段(我从另一个SO线程答案中复制了这一部分正则表达式):

aString.replace (/<\s*a.*href=\"(.*?)\".*>(.*?)<\/a>/gi, '$2 (Link->$1)' ) ;

以下是失败的示例输入:

<a href="http://blar.pirates.net/black/ship.html">Go here please.</a></p><p class="p1"><a href="http://blar.pirates.net/black/ship.html">http://blar.pirates.net/black/ship.html</a></p>

我们的想法是,href将被拉出并以纯文本形式输出到已链接的文本旁边.所以上面的输出最终应该是这样的:

Go here please (Link->http://blar.pirates.net/black/ship.html)
http://blar.pirates.net/black/ship.html (Link->http://blar.pirates.net/black/ship.html)

然而,正则表达式一直在下降到第二个< / a>在第一场比赛中标记,所以我输掉了第一行输出. (实际上,只要锚元素相邻,它就会在列表中占据尽可能远的位置.)输入是一个长字符串,不是用CR / LF或任何东西分割.

我尝试使用这样的非贪婪标志(注意第二个问号):

/<\s*a.*href=\"(.*?)\".*?>(.*?)<\/a>/ig

但这似乎没有改变任何东西(至少在我尝试的少数测试人员/解析器中,其中一个在这里:http://refiddle.com).还尝试了/ U标志但没有帮助(或者这些解析器没有识别它).

有什么建议?

解决方法:

模式和可能的改进存在一些错误:

/<
\s*    #  not needed (browsers don't recognize "< a" as an "a" tag)

a      #  if you want to avoid a confusion between an "a" tag and the start
       # of an "abbr" tag, you can add a word boundary or better, a "\s+" since
       # there is at least one white character after.

.      #  The dot match all except newlines, if you have an "a" tag on several
       # lines, your pattern will fail. Since Javascript doesn't have the 
       # "singleline" or "dotall" mode, you must replace it with `[\s\S]` that
       # can match all characters (all that is a space + all that is not a space)

*      #  Quantifiers are greedy by default. ".*" will match all until the end of
       # the line, "[\s\S]*" will match all until the end of the string!
       # This will cause to the regex engine a lot of backtracking until the last
       # "href" will be found (and it is not always the one you want)

href=  # You can add a word boundary before the "h" and put optional spaces around
       # the equal sign to make your pattern more "waterproof": \bhref\s*=\s*

\"     #  Don't need to be escaped, as Markasoftware notices it, an attribute
       # value is not always between double quotes. You can have single quotes or
       # no quotes at all. (1)
(.*?)
\"     # same thing
.*     # same thing: match all until the last >
>(.*?)<\/a>/gi

(1) – >关于引号和href属性值:

要处理单引号,双引号或无引号,您可以使用捕获组和反向引用:

\bhref\s*=\s*(["']?)([^"'\s>]*)\1

细节:

\bhref\s*=\s*
(["']?)     # capture group 1: can contain a single, a double quote or nothing 
([^"'\s>]*) # capture group 2: all that is not a quote to stop before the possible
            # closing quote, a space (urls don't have spaces, however javascript
            # code can contain spaces) or a ">" to stop at the first space or
            # before the end of the tag if quotes are not used. 
\1          # backreference to the capture group 1

请注意,您使用此子模式添加捕获组,并且标记之间的内容现在位于捕获组中3.请将替换字符串$2更改为$3.

好的,你可以写这样的模式:

aString.replace(/<a\s+[\s\S]*?\bhref\s*=\s*(["']?)([^"'\s>]*)\1[^>]*>([\s\S]*?)<\/a>/gi,
               '$3 (Link->$1)');

标签:javascript,non-greedy,regex
来源: https://codeday.me/bug/20190728/1563925.html