|
Only about 15 percent to 20 percent of most commercial organizations'
operational data resides in a structured format (databases) while
the remaining 80 percent to 85 percent resides in some form of unstructured
format (e-mail, documents, images, etc.). Government organizations
typically have even more unstructured data because on the need for
paper-based forms. For example, agencies that provide citizen benefits
such as health services and immigration, and agencies that require
documentation from external sources such as investigative and intelligence
agencies have large unstructured content repositories.
Over the past seven years many agencies have deployed e-forms using
technologies like the Adobe PDF format, while others have scanned
paper documents into bitmapped images for on-line retrieval through
a document management system. Use of these technologies has allowed
agencies to increase effectiveness and reduce costs by increasing
processing speed and reducing paper management and hard-copy file
storage. However, the actual data are often "stuck" in
the PDF or the image and not readily available for querying and
reporting. Many organizations are now finding that while they achieved
some efficiency by reducing the use of paper, they now cannot use
those data as effectively as they could if they existed in a structured
format.
The need to bring structure to unstructured data is growing, and
technologies are emerging to help solve that problem. Enterprise
Content Management (ECM) is an emerging field that takes document
management, database management, and knowledge management to the
next level.
"Structured data" is a fairly straightforward concept;
it is anything that has an enforced composition to the atomic data
types. Most correctly designed rational database management systems
(RDBMS) store data in a structured format. The relationships between
elements are understood, the data have context as they usually support
a specific process, and the data can be queried and reported.
Unstructured data, on the other hand, are everything else. To be
more precise, unstructured data contain two basic categories of
data:
-
bitmap objects: non-language based images, video, or audio files; and,
-
textual objects: language-based, word-processed documents, e-mails, spreadsheets, PDFs, etc.
Both object types are considered data, but they are formatted,
retrieved, and rendered in different ways. Many technologies exist
to parse and create structured data from unstructured textual objects.
However, bitmap object data parsing is much harder. Outside of optical
character recognition (OCR) this type of technology still resides
in the scientific and intelligence communities and has not reached
the maturity level needed for effective business application.
Many agencies that make effective use of OCR (e.g. the IRS) can
do so because the documents they scan and convert already contain
highly structured formatting. In this scenario, organizations store
both the scanned images in a document management system and the
matching structured data in a traditional database. The records
are linked by a key, such as a social security number.
This is the entry point for enterprise solution vendors, which are
starting to offer ways to link unstructured data with the traditional
structured data in the same repository. For example, SAP, Siebel,
PeopleSoft, and Oracle now offer mechanisms to hook into ECM tools
like FileNet, Documentum, and IBM Content Manager. Many companies
are using these links to create holistic customer views as is the
case for CRM. ERP vendors needed scalable ways to handle volumes
of scanned invoices. Complex web sites required sophisticated content
management and publishing capabilities. However, this method of
application and data integration greatly increases complexity and
cost. What is needed is a mechanism to store all that data and content
into one repository.

Figure 1. Structured and Unstructured data repositories running in parallel vs. a combined ECM storage management system.
ECM systems offer one way to combine the storage of structured and unstructured data into one master repository. At this point however, these systems provide only data and content storage. And while these solutions greatly reduce overall infrastructure costs and reduce complexity in application and data integration, they do not solve the issue of getting useful structured data out of the unstructured content. To do that requires the creation of semi-structured data, which we will address in future newsletters with examples using XML and XBRL for Sarbanes-Oxley compliance.
|