【OCR技术系列之八】端到端不定长文本识别CRNN代码实现
作者:互联网
CRNN是OCR领域非常经典且被广泛使用的识别算法,其理论基础可以参考我上一篇文章,本文将着重讲解CRNN代码实现过程以及识别效果。
数据处理
利用图像处理技术我们手工大批量生成文字图像,一共360万张图像样本,效果如下:
我们划分了训练集和测试集(10:1),并单独存储为两个文本文件:
文本文件里的标签格式如下:
我们获取到的是最原始的数据集,在图像深度学习训练中我们一般都会把原始数据集转化为lmdb格式以方便后续的网络训练。因此我们也需要对该数据集进行lmdb格式转化。下面代码就是用于lmdb格式转化,思路比较简单,就是首先读入图像和对应的文本标签,先使用字典将该组合存储起来(cache),再利用lmdb包的put函数把字典(cache)存储的k,v写成lmdb格式存储好(cache当有了1000个元素就put一次)。
import lmdb
import cv2
import numpy as np
import os
def checkImageIsValid(imageBin):
if imageBin is None:
return False
try:
imageBuf = np.fromstring(imageBin, dtype=np.uint8)
img = cv2.imdecode(imageBuf, cv2.IMREAD_GRAYSCALE)
imgH, imgW = img.shape[0], img.shape[1]
except:
return False
else:
if imgH * imgW == 0:
return False
return True
def writeCache(env, cache):
with env.begin(write=True) as txn:
for k, v in cache.items():
txn.put(k, v)
def createDataset(outputPath, imagePathList, labelList, lexiconList=None, checkValid=True):
"""
Create LMDB dataset for CRNN training.
ARGS:
outputPath : LMDB output path
imagePathList : list of image path
labelList : list of corresponding groundtruth texts
lexiconList : (optional) list of lexicon lists
checkValid : if true, check the validity of every image
"""
assert (len(imagePathList) == len(labelList))
nSamples = len(imagePathList)
env = lmdb.open(outputPath, map_size=1099511627776)
cache = {}
cnt = 1
for i in range(nSamples):
imagePath = ''.join(imagePathList[i]).split()[0].replace('\n', '').replace('\r\n', '')
# print(imagePath)
label = ''.join(labelList[i])
print(label)
# if not os.path.exists(imagePath):
# print('%s does not exist' % imagePath)
# continue
with open('.' + imagePath, 'r') as f:
imageBin = f.read()
if checkValid:
if not checkImageIsValid(imageBin):
print('%s is not a valid image' % imagePath)
continue
imageKey = 'image-%09d' % cnt
labelKey = 'label-%09d' % cnt
cache[imageKey] = imageBin
cache[labelKey] = label
if lexiconList:
lexiconKey = 'lexicon-%09d' % cnt
cache[lexiconKey] = ' '.join(lexiconList[i])
if cnt % 1000 == 0:
writeCache(env, cache)
cache = {}
print('Written %d / %d' % (cnt, nSamples))
cnt += 1
print(cnt)
nSamples = cnt - 1
cache['num-samples'] = str(nSamples)
writeCache(env, cache)
print('Created dataset with %d samples' % nSamples)
OUT_PATH = '../crnn_train_lmdb'
IN_PATH = './train.txt'
if __name__ == '__main__':
outputPath = OUT_PATH
if not os.path.exists(OUT_PATH):
os.mkdir(OUT_PATH)
imgdata = open(IN_PATH)
imagePathList = list(imgdata)
labelList = []
for line in imagePathList:
word = line.split()[1]
labelList.append(word)
createDataset(outputPath, imagePathList, labelList)
我们运行上面的代码,可以得到训练集和测试集的lmdb
在数据准备部分还有一个操作需要强调的,那就是文字标签数字化,即我们用数字来表示每一个文字(汉字,英文字母,标点符号)。比如“我”字对应的id是1,“l”对应的id是1000,“?”对应的id是90,如此类推,这种编解码工作使用字典数据结构存储即可,训练时先把标签编码(encode),预测时就将网络输出结果解码(decode)成文字输出。
class strLabelConverter(object):
"""Convert between str and label.
NOTE:
Insert `blank` to the alphabet for CTC.
Args:
alphabet (str): set of the possible characters.
ignore_case (bool, default=True): whether or not to ignore all of the case.
"""
def __init__(self, alphabet, ignore_case=False):
self._ignore_case = ignore_case
if self._ignore_case:
alphabet = alphabet.lower()
self.alphabet = alphabet + '-' # for `-1` index
self.dict = {}
for i, char in enumerate(alphabet):
# NOTE: 0 is reserved for 'blank' required by wrap_ctc
self.dict[char] = i + 1
def encode(self, text):
"""Support batch or single str.
Args:
text (str or list of str): texts to convert.
Returns:
torch.IntTensor [length_0 + length_1 + ... length_{n - 1}]: encoded texts.
torch.IntTensor [n]: length of each text.
"""
length = []
result = []
for item in text:
item = item.decode('utf-8', 'strict')
length.append(len(item))
for char in item:
index = self.dict[char]
result.append(index)
text = result
# print(text,length)
return (torch.IntTensor(text), torch.IntTensor(length))
def decode(self, t, length, raw=False):
"""Decode encoded texts back into strs.
Args:
torch.IntTensor [length_0 + length_1 + ... length_{n - 1}]: encoded texts.
torch.IntTensor [n]: length of each text.
Raises:
AssertionError: when the texts and its length does not match.
Returns:
text (str or list of str): texts to convert.
"""
if length.numel() == 1:
length = length[0]
assert t.numel() == length, "text with length: {} does not match declared length: {}".format(t.numel(),
length)
if raw:
return ''.join([self.alphabet[i - 1] for i in t])
else:
char_list = []
for i in range(length):
if t[i] != 0 and (not (i > 0 and t[i - 1] == t[i])):
char_list.append(self.alphabet[t[i] - 1])
return ''.join(char_list)
else:
# batch mode
assert t.numel() == length.sum(), "texts with length: {} does not match declared length: {}".format(
t.numel(), length.sum())
texts = []
index = 0
for i in range(length.numel()):
l = length[i]
texts.append(
self.decode(
t[index:index + l], torch.IntTensor([l]), raw=raw))
index += l
return texts
网络设计
根据CRNN的论文描述,CRNN是由CNN-》RNN-》CTC三大部分架构而成,分别对应卷积层、循环层和转录层。首先CNN部分用于底层的特征提取,RNN采取了BiLSTM,用于学习关联序列信息并预测标签分布,CTC用于序列对齐,输出预测结果。
为了将特征输入到Recurrent Layers,做如下处理:
- 首先会将图像缩放到 32×W×3 大小
- 然后经过CNN后变为 1×(W/4)× 512
- 接着针对LSTM,设置 T=(W/4) , D=512 ,即可将特征输入LSTM。
以上是理想训练时的操作,但是CRNN论文提到的网络输入是归一化好的100×32大小的灰度图像,即宽度统一为100。下面是CRNN的深度神经网络结构图,CNN采取了经典的VGG16,值得注意的是,在VGG16的第3第4个max pooling层CRNN采取的是1×2的矩形池化窗口(w×h),这有别于经典的VGG16的2×2的正方形池化窗口,这个改动是因为文本图像多数都是高较小而宽较长,所以其feature map也是这种高小宽长的矩形形状,如果使用1×2的池化窗口则更适合英文字母识别(比如区分i和l)。VGG16部分还引入了BatchNormalization模块,旨在加速模型收敛。还有值得注意一点,CRNN的输入是灰度图像,即图像深度为1。CNN部分的输出是512x1x16(c×h×w)的特征向量。
接下来分析RNN层。RNN部分使用了双向LSTM,隐藏层单元数为256,CRNN采用了两层BiLSTM来组成这个RNN层,RNN层的输出维度将是(s,b,class_num) ,其中class_num为文字类别总数。
值得注意的是:Pytorch里的LSTM单元接受的输入都必须是3维的张量(Tensors).每一维代表的意思不能弄错。第一维体现的是序列(sequence)结构,第二维度体现的是小块(mini-batch)结构,第三位体现的是输入的元素(elements of input)。如果在应用中不适用小块结构,那么可以将输入的张量中该维度设为1,但必须要体现出这个维度。
LSTM的输入
input of shape (seq_len, batch, input_size): tensor containing the features of the input sequence.
The input can also be a packed variable length sequence.
input shape(a,b,c)
a:seq_len -> 序列长度
b:batch
c:input_size 输入特征数目
根据LSTM的输入要求,我们要对CNN的输出做些调整,即把CNN层的输出调整为[seq_len, batch, input_size]形式,下面为具体操作:先使用squeeze函数移除h维度,再使用permute函数调整各维顺序,即从原来[w, b, c]的调整为[seq_len, batch, input_size],具体尺寸为[16,batch,512],调整好之后即可以将该矩阵送入RNN层。
x = self.cnn(x)
b, c, h, w = x.size()
# print(x.size()): b,c,h,w
assert h == 1 # "the height of conv must be 1"
x = x.squeeze(2) # remove h dimension, b *512 * width
x = x.permute(2, 0, 1) # [w, b, c] = [seq_len, batch, input_size]
x = self.rnn(x)
RNN层输出格式如下,因为我们采用的是双向BiLSTM,所以输出维度将是hidden_unit * 2
Outputs: output, (h_n, c_n)
output of shape (seq_len, batch, num_directions * hidden_size)
h_n of shape (num_layers * num_directions, batch, hidden_size)
c_n (num_layers * num_directions, batch, hidden_size)
然后我们再通过线性变换操作self.embedding1 = torch.nn.Linear(hidden_unit * 2, 512)
是的输出维度再次变为512,继续送入第二个LSTM层。第二个LSTM层后继续接线性操作torch.nn.Linear(hidden_unit * 2, class_num)
使得整个RNN层的输出为文字类别总数。
import torch
import torch.nn.functional as F
class Vgg_16(torch.nn.Module):
def __init__(self):
super(Vgg_16, self).__init__()
self.convolution1 = torch.nn.Conv2d(1, 64, 3, padding=1)
self.pooling1 = torch.nn.MaxPool2d(2, stride=2)
self.convolution2 = torch.nn.Conv2d(64, 128, 3, padding=1)
self.pooling2 = torch.nn.MaxPool2d(2, stride=2)
self.convolution3 = torch.nn.Conv2d(128, 256, 3, padding=1)
self.convolution4 = torch.nn.Conv2d(256, 256, 3, padding=1)
self.pooling3 = torch.nn.MaxPool2d((1, 2), stride=(2, 1)) # notice stride of the non-square pooling
self.convolution5 = torch.nn.Conv2d(256, 512, 3, padding=1)
self.BatchNorm1 = torch.nn.BatchNorm2d(512)
self.convolution6 = torch.nn.Conv2d(512, 512, 3, padding=1)
self.BatchNorm2 = torch.nn.BatchNorm2d(512)
self.pooling4 = torch.nn.MaxPool2d((1, 2), stride=(2, 1))
self.convolution7 = torch.nn.Conv2d(512, 512, 2)
def forward(self, x):
x = F.relu(self.convolution1(x), inplace=True)
x = self.pooling1(x)
x = F.relu(self.convolution2(x), inplace=True)
x = self.pooling2(x)
x = F.relu(self.convolution3(x), inplace=True)
x = F.relu(self.convolution4(x), inplace=True)
x = self.pooling3(x)
x = self.convolution5(x)
x = F.relu(self.BatchNorm1(x), inplace=True)
x = self.convolution6(x)
x = F.relu(self.BatchNorm2(x), inplace=True)
x = self.pooling4(x)
x = F.relu(self.convolution7(x), inplace=True)
return x # b*512x1x16
class RNN(torch.nn.Module):
def __init__(self, class_num, hidden_unit):
super(RNN, self).__init__()
self.Bidirectional_LSTM1 = torch.nn.LSTM(512, hidden_unit, bidirectional=True)
self.embedding1 = torch.nn.Linear(hidden_unit * 2, 512)
self.Bidirectional_LSTM2 = torch.nn.LSTM(512, hidden_unit, bidirectional=True)
self.embedding2 = torch.nn.Linear(hidden_unit * 2, class_num)
def forward(self, x):
x = self.Bidirectional_LSTM1(x) # LSTM output: output, (h_n, c_n)
T, b, h = x[0].size() # x[0]: (seq_len, batch, num_directions * hidden_size)
x = self.embedding1(x[0].view(T * b, h)) # pytorch view() reshape as [T * b, nOut]
x = x.view(T, b, -1) # [16, b, 512]
x = self.Bidirectional_LSTM2(x)
T, b, h = x[0].size()
x = self.embedding2(x[0].view(T * b, h))
x = x.view(T, b, -1)
return x # [16,b,class_num]
# output: [s,b,class_num]
class CRNN(torch.nn.Module):
def __init__(self, class_num, hidden_unit=256):
super(CRNN, self).__init__()
self.cnn = torch.nn.Sequential()
self.cnn.add_module('vgg_16', Vgg_16())
self.rnn = torch.nn.Sequential()
self.rnn.add_module('rnn', RNN(class_num, hidden_unit))
def forward(self, x):
x = self.cnn(x)
b, c, h, w = x.size()
# print(x.size()): b,c,h,w
assert h == 1 # "the height of conv must be 1"
x = x.squeeze(2) # remove h dimension, b *512 * width
x = x.permute(2, 0, 1) # [w, b, c] = [seq_len, batch, input_size]
# x = x.transpose(0, 2)
# x = x.transpose(1, 2)
x = self.rnn(x)
return x
损失函数设计
刚刚完成了CNN层和RNN层的设计,现在开始设计转录层,即将RNN层输出的结果翻译成最终的识别文字结果,从而实现不定长的文字识别。pytorch没有内置的CTC loss,所以只能去Github下载别人实现的CTC loss来完成损失函数部分的设计。安装CTC-loss的方式如下:
git clone https://github.com/SeanNaren/warp-ctc.git
cd warp-ctc
mkdir build; cd build
cmake ..
make
cd ../pytorch_binding/
python setup.py install
cd ../build
cp libwarpctc.so ../../usr/lib
待安装完毕后,我们可以直接调用CTC loss了,以一个小例子来说明ctc loss的用法。
import torch
from warpctc_pytorch import CTCLoss
ctc_loss = CTCLoss()
# expected shape of seqLength x batchSize x alphabet_size
probs = torch.FloatTensor([[[0.1, 0.6, 0.1, 0.1, 0.1], [0.1, 0.1, 0.6, 0.1, 0.1]]]).transpose(0, 1).contiguous()
labels = torch.IntTensor([1, 2])
label_sizes = torch.IntTensor([2])
probs_sizes = torch.IntTensor([2])
probs.requires_grad_(True) # tells autograd to compute gradients for probs
cost = ctc_loss(probs, labels, probs_sizes, label_sizes)
cost.backward()
CTCLoss(size_average=False, length_average=False)
# size_average (bool): normalize the loss by the batch size (default: False)
# length_average (bool): normalize the loss by the total number of frames in the batch. If True, supersedes size_average (default: False)
forward(acts, labels, act_lens, label_lens)
# acts: Tensor of (seqLength x batch x outputDim) containing output activations from network (before softmax)
# labels: 1 dimensional Tensor containing all the targets of the batch in one large sequence
# act_lens: Tensor of size (batch) containing size of each output sequence from the network
# label_lens: Tensor of (batch) containing label length of each example
从上面的代码可以看出,CTCLoss的输入为[probs, labels, probs_sizes, label_sizes],即预测结果、标签、预测结果的数目和标签数目。那么我们仿照这个例子开始设计CRNN的CTC LOSS。
preds = net(image)
preds_size = Variable(torch.IntTensor([preds.size(0)] * batch_size)) # preds.size(0)=w=16
cost = criterion(preds, text, preds_size, length) / batch_size # 这里的length就是包含每个文本标签的长度的list,除以batch_size来求平均loss
cost.backward()
网络训练设计
接下来我们需要完善具体的训练流程,我们还写了个trainBatch函数用于bacth形式的梯度更新。
def trainBatch(net, criterion, optimizer, train_iter):
data = train_iter.next()
cpu_images, cpu_texts = data
batch_size = cpu_images.size(0)
lib.dataset.loadData(image, cpu_images)
t, l = converter.encode(cpu_texts)
lib.dataset.loadData(text, t)
lib.dataset.loadData(length, l)
preds = net(image)
#print("preds.size=%s" % preds.size)
preds_size = Variable(torch.IntTensor([preds.size(0)] * batch_size)) # preds.size(0)=w=22
cost = criterion(preds, text, preds_size, length) / batch_size # length= a list that contains the len of text label in a batch
net.zero_grad()
cost.backward()
optimizer.step()
return cost
整个网络训练的流程如下:CTC-LOSS对象->CRNN网络对象->image,text,len的tensor初始化->优化器初始化,然后开始循环每个epoch,指定迭代次数就进行模型验证和模型保存。CRNN论文提到所采用的优化器是Adadelta,但是经过我实验看来,Adadelta的收敛速度非常慢,所以改用了RMSprop优化器,模型收敛速度大幅度提升。
criterion = CTCLoss()
net = Net.CRNN(n_class)
print(net)
net.apply(lib.utility.weights_init)
image = torch.FloatTensor(Config.batch_size, 3, Config.img_height, Config.img_width)
text = torch.IntTensor(Config.batch_size * 5)
length = torch.IntTensor(Config.batch_size)
if cuda:
net.cuda()
image = image.cuda()
criterion = criterion.cuda()
image = Variable(image)
text = Variable(text)
length = Variable(length)
loss_avg = lib.utility.averager()
optimizer = optim.RMSprop(net.parameters(), lr=Config.lr)
#optimizer = optim.Adadelta(net.parameters(), lr=Config.lr)
#optimizer = optim.Adam(net.parameters(), lr=Config.lr,
#betas=(Config.beta1, 0.999))
for epoch in range(Config.epoch):
train_iter = iter(train_loader)
i = 0
while i < len(train_loader):
for p in net.parameters():
p.requires_grad = True
net.train()
cost = trainBatch(net, criterion, optimizer, train_iter)
loss_avg.add(cost)
i += 1
if i % Config.display_interval == 0:
print('[%d/%d][%d/%d] Loss: %f' %
(epoch, Config.epoch, i, len(train_loader), loss_avg.val()))
loss_avg.reset()
if i % Config.test_interval == 0:
val(net, test_dataset, criterion)
# do checkpointing
if i % Config.save_interval == 0:
torch.save(
net.state_dict(), '{0}/netCRNN_{1}_{2}.pth'.format(Config.model_dir, epoch, i))
训练过程与测试设计
下面这幅图表示的就是CRNN训练过程,文字类别数为6732,一共训练20个epoch,batch_Szie设置为64,所以一共是51244次迭代/epoch。
在迭代4个epoch时,loss降到0.1左右,acc上升到0.98。
接下来我们设计推断预测部分的代码,首先需初始化CRNN网络,载入训练好的模型,读入待预测的图像并resize为高为32的灰度图像,接着讲该图像送入网络,最后再将网络输出解码成文字即可输出。
import time
import torch
import os
from torch.autograd import Variable
import lib.convert
import lib.dataset
from PIL import Image
import Net.net as Net
import alphabets
import sys
import Config
os.environ['CUDA_VISIBLE_DEVICES'] = "4"
crnn_model_path = './bs64_model/netCRNN_9_48000.pth'
IMG_ROOT = './test_images'
running_mode = 'gpu'
alphabet = alphabets.alphabet
nclass = len(alphabet) + 1
def crnn_recognition(cropped_image, model):
converter = lib.convert.strLabelConverter(alphabet) # 标签转换
image = cropped_image.convert('L') # 图像灰度化
### Testing images are scaled to have height 32. Widths are
# proportionally scaled with heights, but at least 100 pixels
w = int(image.size[0] / (280 * 1.0 / Config.infer_img_w))
#scale = image.size[1] * 1.0 / Config.img_height
#w = int(image.size[0] / scale)
transformer = lib.dataset.resizeNormalize((w, Config.img_height))
image = transformer(image)
if torch.cuda.is_available():
image = image.cuda()
image = image.view(1, *image.size())
image = Variable(image)
model.eval()
preds = model(image)
_, preds = preds.max(2)
preds = preds.transpose(1, 0).contiguous().view(-1)
preds_size = Variable(torch.IntTensor([preds.size(0)]))
sim_pred = converter.decode(preds.data, preds_size.data, raw=False) # 预测输出解码成文字
print('results: {0}'.format(sim_pred))
if __name__ == '__main__':
# crnn network
model = Net.CRNN(nclass)
# 载入训练好的模型,CPU和GPU的载入方式不一样,需分开处理
if running_mode == 'gpu' and torch.cuda.is_available():
model = model.cuda()
model.load_state_dict(torch.load(crnn_model_path))
else:
model.load_state_dict(torch.load(crnn_model_path, map_location='cpu'))
print('loading pretrained model from {0}'.format(crnn_model_path))
files = sorted(os.listdir(IMG_ROOT)) # 按文件名排序
for file in files:
started = time.time()
full_path = os.path.join(IMG_ROOT, file)
print("=============================================")
print("ocr image is %s" % full_path)
image = Image.open(full_path)
crnn_recognition(image, model)
finished = time.time()
print('elapsed time: {0}'.format(finished - started))
识别效果和总结
首先我从测试集中抽取几张图像送入模型识别,识别全部正确。
我也随机在一些文档图片、扫描图像上截取了一段文字图像送入我们该模型进行识别,识别效果也挺好的,基本识别正确,表明模型泛化能力很强。
这里做个小小的总结:对于端到端不定长的文字识别,CRNN是最为经典的识别算法,而且实战看来效果非常不错。上面识别结果可以看出,虽然我们用于训练的数据集是自己生成的,但是我们该模型对于pdf文档、扫描图像等都有很不错的识别结果,如果需要继续提升对特定领域的文本图像的识别,直接大量加入该类图像用于训练即可。CRNN的完整代码可以参考我的Github。
标签:之八,self,torch,batch,length,CRNN,端到,image,size 来源: https://www.cnblogs.com/skyfsm/p/10345305.html