This tutorial will present and explain the database structure and code behind the paper titled VML-HD: The Historical Arabic Documents Dataset for Recognition Systems which was presented in ASAR 2017. Feel free to download the paper from the link here. It is found under publications section.
This work came to life after experiencing troubles acquiring annotated data that is publicly available and free for use. The scarcity and lack of annotated datasets, especially in non-Latin script is hindering advances in research. Due to this, I have decided, with the assistance of few of my colleagues, to annotate a dataset of Arabic manuscripts and release it for free research use.
The database is based on five books written by different writers in the years 1088-1451. The 668 pages are fully annotated on the subword level. For each page we manually applied bounding boxes on the different subwords and annotated the sequence of characters. It consists of 159,149 subword appearances consisted of 326,289 characters out of a vocabulary of 5,509 forms of subwords. The database is designed for training and testing recognition systems for handwritten Arabic subwords. This database is available for the purpose of research, and we encourage researchers to develop and test new methods using our database.
To acquire the annotated data, refer here. The dataset contains annotation in the Hadara format. Another, and a more conveient format is also included here. Either one of these formats can be used to extract the annotated subword images from the dataset.
A sample page of each manuscript can be seen above. In the next section, we will detail both methods and provide code segments for easy extraction of the subwords from the dataset.
In the database, each manuscript of the five books can be found in its own folder. Inside the manuscript folder a file named docElementsXml.ashx contains the complete annotation information of the said manuscript.
HadaraXML format contains two main tags image tag, and content tag. image meta information consist of bounding box information for each manuscript and its corresponding pages. content contain labeling information for each one of the bounding boxes found under image tag.
<image id="781" src="0003-1"></image>
<zone id="113804"> </zone>
<polygon></polygon>
<point y="324" x="764" />
<content image_id="781"></content>
<section type="page"></section>
<segment id="113804" ref_id="113804">
<transcriptionInfo id="113804"/>
<HADARA>
<document nbpages="196" id="61">
<image id="781" src="0003-1">
<page>
<zone id="113804">
<polygon>
<point y="324" x="764" />
<point y="324" x="821" />
<point y="391" x="821" />
<point y="391" x="764" />
</polygon>
</zone>
<zone id="113805">
<polygon>
<point y="332" x="831" />
<point y="332" x="839" />
<point y="374" x="839" />
<point y="374" x="831" />
</polygon>
</zone>
.
.
.
<zone id="113808">
<polygon>
<point y="318" x="717" />
<point y="318" x="744" />
<point y="384" x="744" />
<point y="384" x="717" />
</polygon>
</zone>
</page>
</image>
<content image_id="781">
<section type="page">
<segment id="113804" ref_id="113804">
<transcriptionInfo id="113804"/>
<transcription>لم</transcription>
</segment>
<segment id="113805" ref_id="113805">
<transcriptionInfo id="113805"/>
<transcription>ا</transcription>
</segment>
.
.
.
<segment id="113808" ref_id="113808">
<transcriptionInfo id="113808"/>
<transcription>ذ</transcription>
</segment>
</section>
</content>
</document>
</HADARA>
XML Format is a simple format. For each page png in a manuscript, one xml file is present with the same name.
It is basically an array of elements. Each subword information is stored under one element.
<ArrayOfDocumentElement>
<DocumentElement>
<X>764</X>
<Y>324</Y>
<Width>57</Width>
<Height>67</Height>
<Transcript>لم</Transcript>
<ArrayOfDocumentElement>
<DocumentElement>
<ID>113804</ID>
<ParentID xsi:nil="true" />
<ElementType>PartOfWord</ElementType>
<X>764</X>
<Y>324</Y>
<Width>57</Width>
<Height>67</Height>
<Transcript>لم</Transcript>
<Threshold>100</Threshold>
<OriginX>806</OriginX>
<OriginY>377</OriginY>
</DocumentElement>
<DocumentElement>
<ID>113805</ID>
<ParentID xsi:nil="true" />
<ElementType>PartOfWord</ElementType>
<X>831</X>
<Y>332</Y>
<Width>8</Width>
<Height>42</Height>
<Transcript>ا</Transcript>
<Threshold>100</Threshold>
<OriginX>835</OriginX>
<OriginY>350</OriginY>
</DocumentElement>
.
.
.
<DocumentElement>
<ID>113808</ID>
<ParentID xsi:nil="true" />
<ElementType>PartOfWord</ElementType>
<X>717</X>
<Y>318</Y>
<Width>27</Width>
<Height>66</Height>
<Transcript>ذ</Transcript>
<Threshold>100</Threshold>
<OriginX>736</OriginX>
<OriginY>375</OriginY>
</DocumentElement>
The following section contains a one-click-parser, for ease of use. The parser parses the xml file for each manuscript page, segments the subwords, and stores them in their own file at a folder which has the same name as the page name. The subword images for each page are numbered, and a text file containing the subwords images numbers alongside their labeling is stored as well.
import xml.etree.ElementTree as ET
import os
import cv2
manuscripts = ['3157556', '3158466', '3249138', '3368132', '3426930']
paddings = ['00', '_0', '00', '_00', '00']
ranges = [[3, 100], [5, 72] , [7, 57] , [4, 50] , [10, 79]]
root_path = '/DATA/majeek/dataset/'
for manuscript, padding, one_range in zip(manuscripts, paddings, ranges):
print('parsing manuscript ', manuscript)
path = root_path + manuscript + '/'
current = one_range[0]
while current <= one_range[1]:
perm = 1
while perm < 3:
if current < 10:
curImgName = padding + '0' + str(current) + '-' + str(perm)
elif current < 100:
curImgName = padding + str(current) + '-' + str(perm)
else:
curImgName = '0' + str(current) + '-' + str(perm)
img = cv2.imread(path + curImgName + '.png')
tree = ET.parse(path + curImgName + '.xml')
root = tree.getroot()
idx = 0
pixelPadding = 3
annotations = []
idxList = []
for documentElement in root: # list of bounding boxes
elementType = documentElement.find('ElementType').text
# make folder of image name + CCs folder
d = os.path.dirname(path + 'segments/' + curImgName + '/')
if not os.path.exists(d):
os.makedirs(d)
e = os.path.dirname(path + 'segments/' + curImgName + '/CCs/')
if not os.path.exists(e):
os.makedirs(e)
if elementType == 'PartOfWord':
x = documentElement.find('X').text
y = documentElement.find('Y').text
width = documentElement.find('Width').text
height = documentElement.find('Height').text
try:
roi = img[
int(y) - pixelPadding:int(y) + int(height) + pixelPadding,
int(x) - pixelPadding:int(x) + int(width) + pixelPadding
]
transcript = documentElement.find('Transcript').text
annotations += [transcript]
idxList += [idx]
cv2.imwrite(path + 'segments/' + curImgName + '/CCs/' +
manuscript + '-' + curImgName + '-' +
str(idx) + '.png', roi)
except (TypeError, AttributeError) as e:
print('error', idx, curImgName, e)
finally:
idx += 1
# write annotations to file
idx = 0
with open(path + 'segments/' + curImgName + '/' +
curImgName + '.txt', 'w') as outFile:
for annotation in annotations:
outFile.write((str(idxList[idx]) + '\t' +
annotation + '\n').encode('utf-8'))
idx += 1
perm += 1
current += 1
I hope you enjoyed this tutorial. Please execute your research on the dataset and publish your work for the betterment of humanity.
If you have any question or comment feel free to email me at majeek @ cs bgu ac il.
More information regarding my work and me can be found at my webpage, and my linkedin.