Thursday, December 11, 2008

Index Based XML Parser - Introduction

This work aims at creating an efficient DOM parser for XML documents. The main attraction for IBXP is significant reduction of processing and storage overheads by using the power of indexing mechanism and lighter data structures.

Existing DOM parsers break the input XML string (character sequence) into many small substrings (tokens) and then create a tree of objects (the DOM) in the memory. Apart from this, heavy data-structures like Maps and Lists are used to store attributes and child nodes respectively. This process consumes both, time and the memory.

In IBXP, the input XML (optimized) is stored in the form of a single string and a data structure containing integer offsets is created to represent the DOM. This way, only offsets are stored to navigate through all the elements, attributes, children and other nodes; this significantly reduces initial processing time and memory consumption both. Further updates to the DOM are maintained in a separate modify-buffer. The algorithm also uses meta-data to keep the search faster.

Thus, IBXP becomes more efficient by using indexes and reducing the object creation. And could save significant processing time and memory as compared to existing open source DOM parsers. The algorithm currently targets Java platform and will support the current version of JAXP; but with slight modifications, it should work for any platform.