在Linux用Python写爬虫(一)
作者:互联网
参考书籍:《Python3 网络爬虫开发实战》2018年4月第一版
系统: Ubuntu 18.04.2 LTS
背景:已经安装好了Tesseract 以及多国语言包 tessdata
安装命令: pip3 install tesserocr pillow
报错:
Collecting tesserocr
Using cached https://files.pythonhosted.org/packages/92/2d/05a7f8387e93c192919b508e4f4936f232bd3d2ca388b9130ae538a9f9ad/tesserocr-2.4.0.tar.gz
Collecting pillow
Using cached https://files.pythonhosted.org/packages/d2/c2/f84b1e57416755e967236468dcfb0fad7fd911f707185efc4ba8834a1a94/Pillow-6.0.0-cp36-cp36m-manylinux1_x86_64.whl
Building wheels for collected packages: tesserocr
Running setup.py bdist_wheel for tesserocr ... error
Complete output from command /usr/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-n7t6st2b/tesserocr/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d /tmp/tmpn73hfamcpip-wheel- --python-tag cp36:
Supporting tesseract v4.0.0-beta.1
Configs from pkg-config: {'include_dirs': ['/usr/include'], 'libraries': ['lept', 'tesseract'], 'cython_compile_time_env': {'TESSERACT_VERSION': 60397825}}
/usr/lib/python3.6/distutils/dist.py:261: UserWarning: Unknown distribution option: 'long_description_content_type'
warnings.warn(msg)
running bdist_wheel
running build
running build_ext
building 'tesserocr' extension
creating build
creating build/temp.linux-x86_64-3.6
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/include -I/usr/include/python3.6m -c tesserocr.cpp -o build/temp.linux-x86_64-3.6/tesserocr.o -std=c++11 -DUSE_STD_NAMESPACE
tesserocr.cpp:42:10: fatal error: Python.h: No such file or directory
#include "Python.h"
^~~~~~~~~~
compilation terminated.
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
----------------------------------------
Failed building wheel for tesserocr
Running setup.py clean for tesserocr
Failed to build tesserocr
Installing collected packages: tesserocr, pillow
Running setup.py install for tesserocr ... error
Complete output from command /usr/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-n7t6st2b/tesserocr/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-7bsa_hbd-record/install-record.txt --single-version-externally-managed --compile --user --prefix=:
Supporting tesseract v4.0.0-beta.1
Configs from pkg-config: {'include_dirs': ['/usr/include'], 'libraries': ['lept', 'tesseract'], 'cython_compile_time_env': {'TESSERACT_VERSION': 60397825}}
/usr/lib/python3.6/distutils/dist.py:261: UserWarning: Unknown distribution option: 'long_description_content_type'
warnings.warn(msg)
running install
running build
running build_ext
building 'tesserocr' extension
creating build
creating build/temp.linux-x86_64-3.6
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/include -I/usr/include/python3.6m -c tesserocr.cpp -o build/temp.linux-x86_64-3.6/tesserocr.o -std=c++11 -DUSE_STD_NAMESPACE
tesserocr.cpp:42:10: fatal error: Python.h: No such file or directory
#include "Python.h"
^~~~~~~~~~
compilation terminated.
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
解决方案:替换新的安装命令 sudo apt install tesseract-ocr
(PS:这个版本与原书中版本命令的差别可能是,此版本并非pillow friendly版本。)
(PPS: Pillow. Pillow is the friendly PIL fork by Alex Clark and Contributors. PIL is thePython Imaging Library by Fredrik Lundh and Contributors.)
原文如下:
Linux
To install Tesseract 4.x you can simply run the following command on your Ubuntu 18.xx bionic:
sudo apt install tesseract-ocr
If you wish to install the Developer Tools which can be used for training, run the following command:
sudo apt install libtesseract-dev
The following instructions are for building on Linux, which also can be applied to other UNIX like operating systems.
Dependencies
A compiler for C and C++: GCC or Clang
GNU Autotools: autoconf, automake, libtool
pkg-config
Leptonica
libpng, libjpeg, libtiff
Ubuntu
If they are not already installed, you need the following libraries (Ubuntu 16.04/14.04):
sudo apt-get install g++ # or clang++ (presumably)
sudo apt-get install autoconf automake libtool
sudo apt-get install pkg-config
sudo apt-get install libpng-dev
sudo apt-get install libjpeg8-dev
sudo apt-get install libtiff5-dev
sudo apt-get install zlib1g-dev
if you plan to install the training tools, you also need the following libraries:
sudo apt-get install libicu-dev
sudo apt-get install libpango1.0-dev
sudo apt-get install libcairo2-dev
原文地址: https://github.com/tesseract-ocr/tesseract/wiki/Compiling
标签:__,Python,爬虫,apt,build,install,Linux,include,tesserocr 来源: https://www.cnblogs.com/chowkaiyat/p/10958834.html