首页 > 其他分享> > 字符串题目：HTML 实体解析器

字符串题目：HTML 实体解析器

2022-01-06 18:02:00 作者：互联网

文章目录

题目
解法

题目

标题和出处

标题：HTML 实体解析器

出处：1410. HTML 实体解析器

难度

6 级

题目描述

要求

「HTML 实体解析器」是一种特殊的解析器，它将 HTML 代码作为输入，并用字符本身替换掉所有这些特殊的字符实体。

HTML 里这些特殊字符和它们对应的字符实体包括：

双引号：字符实体为 " \texttt{\"} "，对应的字符是 " \texttt{"} "。
单引号：字符实体为 ' \texttt{\'} '，对应的字符是 ’ \texttt{'} ’。
与符号：字符实体为 & \texttt{\&} &，对应对的字符是 & \texttt{\&} &。
大于号：字符实体为 > \texttt{\>} >，对应的字符是 > \texttt{>} >。
小于号：字符实体为 < \texttt{\<} <，对应的字符是 < \texttt{<} <。
斜线号：字符实体为 &frasl; \texttt{\&frasl;} &frasl;，对应的字符是 / \texttt{/} /。

给你输入字符串 text \texttt{text} text，请你实现一个 HTML 实体解析器，返回解析器解析后的结果。

示例

示例 1：

输入： text = "& is an HTML entity but &ambassador; is not." \texttt{text = "\& is an HTML entity but \&ambassador; is not."} text = "& is an HTML entity but &ambassador; is not."
输出： "& is an HTML entity but &ambassador; is not." \texttt{"\& is an HTML entity but \&ambassador; is not."} "& is an HTML entity but &ambassador; is not."
解释：解析器把字符实体 & \texttt{\&} & 用 & \texttt{\&} & 替换

示例 2：

输入： text = "and I quote: "..."" \texttt{text = "and I quote: \"...\""} text = "and I quote: "...""
输出： "and I quote: \"...\"" \texttt{"and I quote: \textbackslash"...\textbackslash""} "and I quote: \"...\""

示例 3：

输入： text = "Stay home! Practice on Leetcode :)" \texttt{text = "Stay home! Practice on Leetcode :)"} text = "Stay home! Practice on Leetcode :)"
输出： "Stay home! Practice on Leetcode :)" \texttt{"Stay home! Practice on Leetcode :)"} "Stay home! Practice on Leetcode :)"

示例 4：

输入： text = "x > y && x < y is always false" \texttt{text = "x \> y \&\& x \< y is always false"} text = "x > y && x < y is always false"
输出： "x > y && x < y is always false" \texttt{"x > y \&\& x < y is always false"} "x > y && x < y is always false"

示例 5：

输入： text = "leetcode.com&frasl;problemset&frasl;all" \texttt{text = "leetcode.com\&frasl;problemset\&frasl;all"} text = "leetcode.com&frasl;problemset&frasl;all"
输出： "leetcode.com/problemset/all" \texttt{"leetcode.com/problemset/all"} "leetcode.com/problemset/all"

数据范围

1 ≤ text.length ≤ 10 5 \texttt{1} \le \texttt{text.length} \le \texttt{10}^\texttt{5} 1≤text.length≤105
字符串可能包含 256 \texttt{256} 256 个ASCII 字符中的任意字符。

解法

思路和算法

HTML 中的每个特殊字符的字符实体都以 ‘ ‘ & " ``\&" ‘‘&" 开始，以 ‘ ‘ ; " ``;" ‘‘;" 结束，其余字符都是小写字母。对于输入字符串 text \textit{text} text，其中的每一个符合上述特点的子串都可能是特殊字符的字符实体。遍历输入字符串 text \textit{text} text，在遍历的同时解析特殊字符。

理想情况下， ‘ ‘ & " ``\&" ‘‘&" 和 ‘ ‘ ; " ``;" ‘‘;" 应该成对出现，但是实际情况不一定成对出现，例如在 ‘ ‘ ; " ``;" ‘‘;" 之前没有 ‘ ‘ & " ``\&" ‘‘&"，或者在 ‘ ‘ & " ``\&" ‘‘&" 之后没有 ‘ ‘ ; " ``;" ‘‘;"，或者在一个 ‘ ‘ ; " ``;" ‘‘;" 之前出现两个 ‘ ‘ & " ``\&" ‘‘&"，因此在从左到右遍历 text \textit{text} text 时，对于每个 ‘ ‘ ; " ``;" ‘‘;"，应该找到在这个 ‘ ‘ ; " ``;" ‘‘;" 之前的最后一个 ‘ ‘ & " ``\&" ‘‘&"（如果存在）作为字符实体的开始。

具体做法是，维护字符实体的开始下标 entityStart \textit{entityStart} entityStart，遍历过程中进行如下操作：

如果遇到 ‘ ‘ & " ``\&" ‘‘&"，当 entityStart ≥ 0 \textit{entityStart} \ge 0 entityStart≥0 时，需要首先将 text \textit{text} text 从 entityStart \textit{entityStart} entityStart 到当前下标前一个位置的字符拼接到解析后的结果，由于 ‘ ‘ & " ``\&" ‘‘&" 表示字符实体的开始，因此将 entityStart \textit{entityStart} entityStart 设为当前下标；
对于非 ‘ ‘ & " ``\&" ‘‘&" 的字符，当 entityStart ≥ 0 \textit{entityStart} \ge 0 entityStart≥0 时，如果遇到 ‘ ‘ ; " ``;" ‘‘;"，则遇到一个以 ‘ ‘ & " ``\&" ‘‘&" 开头和以 ‘ ‘ ; " ``;" ‘‘;" 结尾的子串，如果该子串是特殊字符的字符实体则将字符实体转换成对应的特殊字符拼接到解析后的结果，否则将该子串直接拼接到解析后的结果，然后令 entityStart = − 1 \textit{entityStart} = -1 entityStart=−1，表示上一个字符实体已经遍历完毕；
对于非 ‘ ‘ & " ``\&" ‘‘&" 的字符，当 entityStart < 0 \textit{entityStart} < 0 entityStart<0 时，当前字符不属于字符实体中的某个字符，因此将当前字符拼接到解析后的结果。

遍历结束后，如果 entityStart ≥ 0 \textit{entityStart} \ge 0 entityStart≥0，则将 text \textit{text} text 从 entityStart \textit{entityStart} entityStart 到末尾的字符拼接到解析后的结果。

代码

class Solution {
    public String entityParser(String text) {
        StringBuffer sb = new StringBuffer();
        int length = text.length();
        int entityStart = -1;
        for (int i = 0; i < length; i++) {
            char c = text.charAt(i);
            if (c == '&') {
                if (entityStart >= 0) {
                    sb.append(text.substring(entityStart, i));
                }
                entityStart = i;
            } else {
                if (entityStart >= 0) {
                    if (c == ';') {
                        String entity = text.substring(entityStart, i + 1);
                        String specialCharacter = getCharacter(entity);
                        sb.append(specialCharacter);
                        entityStart = -1;
                    }
                } else {
                    sb.append(c);
                }
            }
        }
        if (entityStart >= 0) {
            sb.append(text.substring(entityStart));
        }
        return sb.toString();
    }

    public String getCharacter(String entity) {
        switch (entity) {
        case "&quot;":
            return "\"";
        case "&apos;":
            return "\'";
        case "&amp;":
            return "&";
        case "&gt;":
            return ">";
        case "&lt;":
            return "<";
        case "&frasl;":
            return "/";
        default:
            return entity;
        }
    }
}

复杂度分析

时间复杂度： O ( n ) O(n) O(n)，其中 n n n 是字符串 text \textit{text} text 的长度。需要遍历字符串 text \textit{text} text 一次，完成解析和拼接解析后的结果。
空间复杂度： O ( n ) O(n) O(n)，其中 n n n 是字符串 text \textit{text} text 的长度。需要额外创建一个长度为 O ( n ) O(n) O(n) 的 StringBuffer \texttt{StringBuffer} StringBuffer 或 StringBuilder \texttt{StringBuilder} StringBuilder 类型的对象用于存储解析后的结果。由于 Java 中的 String \texttt{String} String 类型的对象不可变，因此空间复杂度至少为 O ( n ) O(n) O(n)。

标签：解析器,字符,entityStart,text,texttt,textit,HTML,字符串,amp
来源： https://blog.csdn.net/stormsunshine/article/details/120387943