python – 从PDF文件中突出显示的注释中提取文本
作者:互联网
从昨天开始,我正在尝试使用python-poppler-qt4从一个pdf中的一些突出显示的注释中提取文本.
根据this documentation,看起来我必须使用Page.text()方法获取文本,从higlighted注释传递Rectangle参数,我使用Annotation.boundary().但我只得到空白文本.有人能帮我吗?我复制了下面的代码,并为我正在使用的PDF添加了一个链接.谢谢你的帮助!
import popplerqt4
import sys
import PyQt4
def main():
doc = popplerqt4.Poppler.Document.load(sys.argv[1])
total_annotations = 0
for i in range(doc.numPages()):
page = doc.page(i)
annotations = page.annotations()
if len(annotations) > 0:
for annotation in annotations:
if isinstance(annotation, popplerqt4.Poppler.Annotation):
total_annotations += 1
if(isinstance(annotation, popplerqt4.Poppler.HighlightAnnotation)):
print str(page.text(annotation.boundary()))
if total_annotations > 0:
print str(total_annotations) + " annotation(s) found"
else:
print "no annotations found"
if __name__ == "__main__":
main()
测试pdf:
https://www.dropbox.com/s/10plnj67k9xd1ot/test.pdf
解决方法:
看看the documentation for Annotations,看起来边界属性在规范化坐标中返回此注释的边界矩形.虽然这看起来很奇怪,但我们可以通过page.pageSize().width()和.height()值来简单地缩放坐标.
import popplerqt4
import sys
import PyQt4
def main():
doc = popplerqt4.Poppler.Document.load(sys.argv[1])
total_annotations = 0
for i in range(doc.numPages()):
#print("========= PAGE {} =========".format(i+1))
page = doc.page(i)
annotations = page.annotations()
(pwidth, pheight) = (page.pageSize().width(), page.pageSize().height())
if len(annotations) > 0:
for annotation in annotations:
if isinstance(annotation, popplerqt4.Poppler.Annotation):
total_annotations += 1
if(isinstance(annotation, popplerqt4.Poppler.HighlightAnnotation)):
quads = annotation.highlightQuads()
txt = ""
for quad in quads:
rect = (quad.points[0].x() * pwidth,
quad.points[0].y() * pheight,
quad.points[2].x() * pwidth,
quad.points[2].y() * pheight)
bdy = PyQt4.QtCore.QRectF()
bdy.setCoords(*rect)
txt = txt + unicode(page.text(bdy)) + ' '
#print("========= ANNOTATION =========")
print(unicode(txt))
if total_annotations > 0:
print str(total_annotations) + " annotation(s) found"
else:
print "no annotations found"
if __name__ == "__main__":
main()
另外,我决定连接.highlightQuads()以更好地表示实际突出显示的内容.
请注意明确的< space>我已经附加到文本的每个四边形区域.
在示例文档中,返回的QString无法直接传递给print()或str(),解决方法是使用unicode()代替.
我希望这可以帮助我,因为它帮助了我.
注意:页面旋转可能会影响缩放值,我无法对此进行测试.
标签:python,pdf,qt,poppler 来源: https://codeday.me/bug/20190831/1773610.html