系统相关
首页 > 系统相关> > Tesseract-OCR 4.1.0 安装及使用—windows及CentOS【附Java源码实现】

Tesseract-OCR 4.1.0 安装及使用—windows及CentOS【附Java源码实现】

作者:互联网

 

 截止笔者发文(2019.12.25),tesseract-ocr 最新发布的稳定版本是4.1.0. 而tesseract-ocr需要依赖leptonica,截止笔者发文,最新稳定版本是1.78.0

经过测试得出如下结论:

 

转载请注明出处:https://www.cnblogs.com/NaughtyCat/p/how-to-install-tesseract-ocr-on-windows-and-centos.html

(1)详情请移步至如下链接,下载安装:

https://github.com/UB-Mannheim/tesseract/wiki

(2)配置环境变量(跟JAVA一样)及添加TESSDATA_PREFIX请参见:

https://www.cnblogs.com/jianqingwang/p/6978724.html

注意需要下载训练集—traineddata:

https://github.com/tesseract-ocr/tessdata

中文请选如下4个:

chi_sim.traineddata (简体— 对于宋体,像素>= 300dpi:识别率高达%100,同时对英文阿拉伯数字识别率高达百分之90以上
chi_sim_vert.traineddata (简体,竖排)
chi_tra.traineddata (繁体)
chi_tra_vert.traineddata(繁体,竖排)【CoderBaby

(1)下载Leptonica 和 Teseract 源码

wget http://www.leptonica.org/source/leptonica-1.78.0.tar.gz
wget https://github.com/tesseract-ocr/tesseract/archive/4.1.0.tar.gz

(2)配置、编译和安装

 $ tar xzvf leptonica-1.78.0.tar.gz      
 $ cd leptonica-1.78.0     
 $ ./configure
 $ make
 $ sudo make install

 $ tar xzf tesseract-ocr-4.1.0.tar.gz
 $ cd tesseract-4.1.0
 $ ./autogen.sh
 $ ./configure
 $ make
 $ sudo make install
 $ sudo ldconfig

3)下载语言包,并且拷贝到testdata

$ wget http://tesseract-ocr.googlecode.com/files/tesseract-ocr-3.02.eng.tar.gz       
$ tar xzf tesseract-ocr-3.02.eng.tar.gz       
$ sudo cp tesseract-ocr/tessdata/* /usr/local/share/tessdata

具体说明及测试效果请参见:https://ocr.space/blog/2015/03/best-ocr-software-for-chinese.html

相关测试图片请参见:https://github.com/A9T9/OCR-Benchmark

请参考官网:https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-3.00%E2%80%933.02 

1)源码如下(支持多个图片识别)

    @Test
    public void testCode() throws IOException, SAXException, TikaException, InterruptedException {
        List<String> fileNames = new ArrayList<>();
        fileNames.add("chi_eng.png");
        fileNames.add("chi_eng01.png");
        fileNames.add("chi_old.png");
        fileNames.add("chi-scan-75dpi.jpg");
        fileNames.add("chi-scan-100dpi.jpg");
        fileNames.add("chi-scan-300dpi.jpg");
        fileNames.add("chi-smartphone.jpg");
        fileNames.add("chi-subtitle-v1.jpg");
        fileNames.add("english00.png");
        fileNames.add("pdf_shaomiao.png");
        fileNames.add("test.tiff");
        fileNames.add("weather.png");

        // 转载请注明出处:https://www.cnblogs.com/NaughtyCat/p/how-to-install-tesseract-ocr-on-windows-and-centos.html
        TesseractOCRParser parser = new TesseractOCRParser();

        TesseractOCRConfig config = new TesseractOCRConfig();
        // 设置简体中文训练集
        config.setLanguage("chi_sim");
        // 设置Tesseract 安装路径
        config.setTesseractPath("C:/Program Files/Tesseract-OCR");
        // 设置train data 路径
        config.setTessdataPath("C:/Program Files/Tesseract-OCR/tessdata");

        ParseContext context = new ParseContext();
        context.set(TesseractOCRConfig.class, config);
        context.set(TesseractOCRParser.class, parser);

        fileNames.forEach(filename -> {
            BodyContentHandler handler = new BodyContentHandler();
            File file = new File("E:/tika/testData" + File.separator + filename);
            if (file.exists()) {
                Metadata metadata = new Metadata();
                try (InputStream stream = new FileInputStream(file)) {
                    parser.parse(stream, handler, metadata, context);
                } catch (Exception e) { }
                handler.toString();
            }
        });
    }
}

(2)原始图片及效果

 

转换效果如下图:

参考:

1)https://stackoverflow.com/questions/23792373/installing-tesseract-ocr-on-centos-6

2)http://www.zmonster.me/2015/04/17/tesseract-install-usage.html 

*****************************************************************************************************

精力有限,想法太多,专注做好一件事就行

*****************************************************************************************************

 

标签:ocr,4.1,识别率,chi,fileNames,add,源码,tesseract,OCR
来源: https://www.cnblogs.com/NaughtyCat/p/how-to-install-tesseract-ocr-on-windows-and-centos.ht