编程语言
首页 > 编程语言> > 无法在java.util.Scanner中设置字符编码

无法在java.util.Scanner中设置字符编码

作者:互联网

我使用Apache Tika来获取文件的编码.

            FileInputStream fis = new FileInputStream(my_file);
            final AutoDetectReader detector = new AutoDetectReader(fis);
            fis.close();
            System.out.println("Encoding:" + detector.getCharset().toString());

我使用Scanner从文件中读取值.

                Scanner scanner = new Scanner(my_file, detector.getCharset().toString());
                Map<String, String> values = new HashMap<>();
                String line, key = null, value = null;
                while (scanner.hasNextLine()) {
                    line = scanner.nextLine();
                    if (line.contains(":")) {
                        if (key != null) {
                            values.put(key, value.trim());
                            key = null;
                            value = null;
                        }
                        int indexOfColon = line.indexOf(":");
                        key = line.substring(0, indexOfColon);
                        value = line.substring(indexOfColon + 1);
                    } else {
                        value += " " + line;
                    }
                }

扫描仪无法从具有编码窗口-1252的文件中读取文本,我得到空字符串.

更新2018.11.07.
在BufferedReader的情况下我有同样的问题.

                    Map<String, String> values = new HashMap<>();
                    String line, key = null, value = null;
                    FileInputStream is = new FileInputStream(my_file);
                    InputStreamReader isr = new InputStreamReader(is, getEncoding(my_file));
                    BufferedReader buffReader = new BufferedReader(isr);

                    while (buffReader.readLine() != null) {
                        line = buffReader.readLine();
                        if (line.contains(":")) {
                            if (key != null) {
                                values.put(key, value.trim());
                                key = null;
                                value = null;
                            }
                            int indexOfColon = line.indexOf(":");
                            key = line.substring(0, indexOfColon);
                            value = line.substring(indexOfColon + 1);
                        } else {
                            value += " " + line;
                        }
                    }

解决方法:

我会尝试使用以下方法读取字符而不是读取行:

ByteArrayOutputStream line = new ByteArrayOutputStream();
Scanner scanner = new Scanner(my_file);

while (scanner.hasNextInt()) {
    int c = 0;
    // read every line
    while (c != newline) { // TODO: Check for a newline char
        c = scanner.nextInt();
        line.write((byte) c);
    }
    byte[] array = line.toByteArray();
    String output = new String(array, "Windows-1252"); // This should do the trick

    // We have a string here, do your logic

    line.reset();
}

这种方法很难看,但使用了能够指定特定编码的新String.我根本没有测试或运行此代码,但至少它会告诉您是否实际读取了任何内容.

标签:java,java-util-scanner,apache-tika
来源: https://codeday.me/bug/20190710/1424745.html