Introduction

This tutorial will present and explain the database structure and code behind the paper titled VML-HD: The Historical Arabic Documents Dataset for Recognition Systems which was presented in ASAR 2017. Feel free to download the paper from the link here. It is found under publications section.

Background and Data

This work came to life after experiencing troubles acquiring annotated data that is publicly available and free for use. The scarcity and lack of annotated datasets, especially in non-Latin script is hindering advances in research. Due to this, I have decided, with the assistance of few of my colleagues, to annotate a dataset of Arabic manuscripts and release it for free research use.

The database is based on five books written by different writers in the years 1088-1451. The 668 pages are fully annotated on the subword level. For each page we manually applied bounding boxes on the different subwords and annotated the sequence of characters. It consists of 159,149 subword appearances consisted of 326,289 characters out of a vocabulary of 5,509 forms of subwords. The database is designed for training and testing recognition systems for handwritten Arabic subwords. This database is available for the purpose of research, and we encourage researchers to develop and test new methods using our database.

To acquire the annotated data, refer here. The dataset contains annotation in the Hadara format. Another, and a more conveient format is also included here. Either one of these formats can be used to extract the annotated subword images from the dataset.

A sample page of each manuscript can be seen above. In the next section, we will detail both methods and provide code segments for easy extraction of the subwords from the dataset.

Hadara Format

In the database, each manuscript of the five books can be found in its own folder. Inside the manuscript folder a file named docElementsXml.ashx contains the complete annotation information of the said manuscript.

HadaraXML format contains two main tags image tag, and content tag. image meta information consist of bounding box information for each manuscript and its corresponding pages. content contain labeling information for each one of the bounding boxes found under image tag.

Image Tag

  • <image id="781" src="0003-1"></image>
    • src denotes the image file name. Image format is png
    • all annotated subwords of image src are found under this tag.
  • <zone id="113804"> </zone>
    • each zone contains one bounding box of a subword with id.
  • <polygon></polygon>
    • each zone has one polygon and inside it.
    • each polygon the four coordinates of the bounding box denoted by a point tag.
  • <point y="324" x="764" />
    • y denotes the row coordinate of the subword bounding box.
    • x denotes the column coordinate of the subword bounding box.

Content Tag

  • <content image_id="781"></content>
    • each image_id value corresponds to a value of id for the image tag mentioned above.
    • each bounding box found under image tag with same id value has one segment containing transcribtion.
  • <section type="page"></section>
    • inside this section, all segments can be found.
  • <segment id="113804" ref_id="113804">
    • each segment has a ref_id that corresponds to one zone id value.
    • inside this tag, transcription information are found.
  • <transcriptionInfo id="113804"/>
    • inside this, a transcription is present that contains the labeling information of the bounding box.

Example XML

<HADARA>
  <document nbpages="196" id="61">
    <image id="781" src="0003-1">
      <page>
        <zone id="113804">
          <polygon>
            <point y="324" x="764" />
            <point y="324" x="821" />
            <point y="391" x="821" />
            <point y="391" x="764" />
          </polygon>
        </zone>
        <zone id="113805">
          <polygon>
            <point y="332" x="831" />
            <point y="332" x="839" />
            <point y="374" x="839" />
            <point y="374" x="831" />
          </polygon>
        </zone>
          .
          .
          .
        <zone id="113808">
          <polygon>
            <point y="318" x="717" />
            <point y="318" x="744" />
            <point y="384" x="744" />
            <point y="384" x="717" />
          </polygon>
        </zone>
      </page>
    </image>
    <content image_id="781">
      <section type="page">
        <segment id="113804" ref_id="113804">
          <transcriptionInfo id="113804"/>
          <transcription>لم</transcription>
        </segment>
        <segment id="113805" ref_id="113805">
          <transcriptionInfo id="113805"/>
          <transcription>ا</transcription>
        </segment>
          .
          .
          .
        <segment id="113808" ref_id="113808">
          <transcriptionInfo id="113808"/>
          <transcription>ذ</transcription>
        </segment>
      </section>
    </content>
  </document>
</HADARA>

XML Format

XML Format is a simple format. For each page png in a manuscript, one xml file is present with the same name.

It is basically an array of elements. Each subword information is stored under one element.

Important Tags

  • <ArrayOfDocumentElement>
    • main tag, contains all bounding box information.
  • <DocumentElement>
    • one bounding box is stored under one element.
  • <X>764</X>
    • top left column coordinate.
  • <Y>324</Y>
    • top left row coodinate.
  • <Width>57</Width>
    • width of bounding box.
  • <Height>67</Height>
    • height of bounding box.
  • <Transcript>لم</Transcript>
    • transcription value for the bounding box.

Example XML

<ArrayOfDocumentElement>
  <DocumentElement>
    <ID>113804</ID>
    <ParentID xsi:nil="true" />
    <ElementType>PartOfWord</ElementType>
    <X>764</X>
    <Y>324</Y>
    <Width>57</Width>
    <Height>67</Height>
    <Transcript>لم</Transcript>
    <Threshold>100</Threshold>
    <OriginX>806</OriginX>
    <OriginY>377</OriginY>
  </DocumentElement>
  <DocumentElement>
    <ID>113805</ID>
    <ParentID xsi:nil="true" />
    <ElementType>PartOfWord</ElementType>
    <X>831</X>
    <Y>332</Y>
    <Width>8</Width>
    <Height>42</Height>
    <Transcript>ا</Transcript>
    <Threshold>100</Threshold>
    <OriginX>835</OriginX>
    <OriginY>350</OriginY>
  </DocumentElement>
    .
    .
    .
  <DocumentElement>
    <ID>113808</ID>
    <ParentID xsi:nil="true" />
    <ElementType>PartOfWord</ElementType>
    <X>717</X>
    <Y>318</Y>
    <Width>27</Width>
    <Height>66</Height>
    <Transcript>ذ</Transcript>
    <Threshold>100</Threshold>
    <OriginX>736</OriginX>
    <OriginY>375</OriginY>
  </DocumentElement>

Database Parsing Script

The following section contains a one-click-parser, for ease of use. The parser parses the xml file for each manuscript page, segments the subwords, and stores them in their own file at a folder which has the same name as the page name. The subword images for each page are numbered, and a text file containing the subwords images numbers alongside their labeling is stored as well.

In [ ]:
import xml.etree.ElementTree as ET
import os
import cv2

manuscripts = ['3157556', '3158466', '3249138', '3368132', '3426930']
paddings = ['00', '_0', '00', '_00', '00']
ranges = [[3, 100], [5, 72] , [7, 57] , [4, 50] , [10, 79]]
root_path = '/DATA/majeek/dataset/'

for manuscript, padding, one_range in zip(manuscripts, paddings, ranges):
    print('parsing manuscript ', manuscript)
    path = root_path + manuscript + '/'
    current = one_range[0]
    while current <= one_range[1]:
        perm = 1
        while perm < 3:
            if current < 10:
                curImgName = padding + '0' + str(current) + '-' + str(perm)
            elif current < 100:
                curImgName = padding + str(current) + '-' + str(perm)
            else:
                curImgName = '0' + str(current) + '-' + str(perm)

            img = cv2.imread(path + curImgName + '.png')
            tree = ET.parse(path + curImgName + '.xml')
            root = tree.getroot()

            idx = 0
            pixelPadding = 3
            annotations = []
            idxList = []
            for documentElement in root:  # list of bounding boxes
                elementType = documentElement.find('ElementType').text
                # make folder of image name + CCs folder
                d = os.path.dirname(path + 'segments/' + curImgName + '/')
                if not os.path.exists(d):
                    os.makedirs(d)
                e = os.path.dirname(path + 'segments/' + curImgName + '/CCs/')
                if not os.path.exists(e):
                    os.makedirs(e)
                if elementType == 'PartOfWord':
                    x = documentElement.find('X').text
                    y = documentElement.find('Y').text
                    width = documentElement.find('Width').text
                    height = documentElement.find('Height').text

                    try:
                        roi = img[
                              int(y) - pixelPadding:int(y) + int(height) + pixelPadding,
                              int(x) - pixelPadding:int(x) + int(width) + pixelPadding
                              ]
                        transcript = documentElement.find('Transcript').text
                        annotations += [transcript]
                        idxList += [idx]
                        cv2.imwrite(path + 'segments/' + curImgName + '/CCs/' +
                                    manuscript + '-' + curImgName + '-' + 
                                    str(idx) + '.png', roi)
                    except (TypeError, AttributeError) as e:
                        print('error', idx, curImgName, e)
                    finally:
                        idx += 1
            # write annotations to file
            idx = 0
            with open(path + 'segments/' + curImgName + '/' + 
                      curImgName + '.txt', 'w') as outFile:
                for annotation in annotations:
                    outFile.write((str(idxList[idx]) + '\t' + 
                                   annotation + '\n').encode('utf-8'))
                    idx += 1
            perm += 1
        current += 1

Final Words

I hope you enjoyed this tutorial. Please execute your research on the dataset and publish your work for the betterment of humanity.

If you have any question or comment feel free to email me at majeek @ cs bgu ac il.

More information regarding my work and me can be found at my webpage, and my linkedin.

Comments