Welcome

 

A Message to Our Readers
Featured Domain

 

Trends and variations among agencies' Program Assessment Rating Tool (PART) scores
Insights

 

Acquisition Management

Program Management

Human Capital Management
Research Update
Events
Links
Contact Us
Free trial           Subscribe Now
Taming Unstructured Data with Enterprise Content Management

Only about 15 percent to 20 percent of most commercial organizations' operational data resides in a structured format (databases) while the remaining 80 percent to 85 percent resides in some form of unstructured format (e-mail, documents, images, etc.). Government organizations typically have even more unstructured data because on the need for paper-based forms. For example, agencies that provide citizen benefits such as health services and immigration, and agencies that require documentation from external sources such as investigative and intelligence agencies have large unstructured content repositories.

Over the past seven years many agencies have deployed e-forms using technologies like the Adobe PDF format, while others have scanned paper documents into bitmapped images for on-line retrieval through a document management system. Use of these technologies has allowed agencies to increase effectiveness and reduce costs by increasing processing speed and reducing paper management and hard-copy file storage. However, the actual data are often "stuck" in the PDF or the image and not readily available for querying and reporting. Many organizations are now finding that while they achieved some efficiency by reducing the use of paper, they now cannot use those data as effectively as they could if they existed in a structured format.

The need to bring structure to unstructured data is growing, and technologies are emerging to help solve that problem. Enterprise Content Management (ECM) is an emerging field that takes document management, database management, and knowledge management to the next level.

"Structured data" is a fairly straightforward concept; it is anything that has an enforced composition to the atomic data types. Most correctly designed rational database management systems (RDBMS) store data in a structured format. The relationships between elements are understood, the data have context as they usually support a specific process, and the data can be queried and reported.

Unstructured data, on the other hand, are everything else. To be more precise, unstructured data contain two basic categories of data:

  1. bitmap objects: non-language based images, video, or audio files; and,
  2. textual objects: language-based, word-processed documents, e-mails, spreadsheets, PDFs, etc.

Both object types are considered data, but they are formatted, retrieved, and rendered in different ways. Many technologies exist to parse and create structured data from unstructured textual objects. However, bitmap object data parsing is much harder. Outside of optical character recognition (OCR) this type of technology still resides in the scientific and intelligence communities and has not reached the maturity level needed for effective business application.

Many agencies that make effective use of OCR (e.g. the IRS) can do so because the documents they scan and convert already contain highly structured formatting. In this scenario, organizations store both the scanned images in a document management system and the matching structured data in a traditional database. The records are linked by a key, such as a social security number.

This is the entry point for enterprise solution vendors, which are starting to offer ways to link unstructured data with the traditional structured data in the same repository. For example, SAP, Siebel, PeopleSoft, and Oracle now offer mechanisms to hook into ECM tools like FileNet, Documentum, and IBM Content Manager. Many companies are using these links to create holistic customer views as is the case for CRM. ERP vendors needed scalable ways to handle volumes of scanned invoices. Complex web sites required sophisticated content management and publishing capabilities. However, this method of application and data integration greatly increases complexity and cost. What is needed is a mechanism to store all that data and content into one repository.


Figure 1. Structured and Unstructured data repositories running in parallel vs. a combined ECM storage management system.

ECM systems offer one way to combine the storage of structured and unstructured data into one master repository. At this point however, these systems provide only data and content storage. And while these solutions greatly reduce overall infrastructure costs and reduce complexity in application and data integration, they do not solve the issue of getting useful structured data out of the unstructured content. To do that requires the creation of semi-structured data, which we will address in future newsletters with examples using XML and XBRL for Sarbanes-Oxley compliance.

Free trial           Subscribe Now

 

©2006 Pivotal Insight, LLC. All rights reserved.