Welcome

 

A Message to Our Readers
Featured Domain

 

Highlights of the proposed National Security Personnel System (NSPS)
Insights

 

Acquisition Management

Strategic Managment

Program Management
Research Update
Events
Links
Contact Us
Free trial           Subscribe Now
Bringing Structure to Unstructured Data

In the April issue of the Pivotal Insight newsletter we addressed using Enterprise Content Management (ECM) systems to manage structured and unstructured data. This approach to data management is sound for consolidating infrastructure and reducing overall complexity. But to truly gain business value from unstructured data, those data have to be given some of the same properties as structured data, such as the ability to search and find the appropriate data, to query the data, and to provide reports. Text searching and result scoring may allow you to search and find the document you are looking for, but they do little to help you query and run reports from the data, nor are they scalable/usable past a few hundred thousand documents.

There are a number of approaches that content management software vendors and those industries that leverage ECM are taking to bring structure to unstructured data. Some in the industry call the resulting data “semi-structured” because not all of the unstructured data need to be captured in a structured format. In this article, we provide case studies of a couple of ways different organizations are working to structure unstructured data. Problems persist, but these cases may provide insight into the applicability of these technologies for your organization.

SEC Financial Reporting
For years publicly traded companies have had to file standard financial reports with the SEC using the EDGAR system. A company, for example, sent the SEC an electronic form 10-Q via FTP in a text format. The SEC augmented the unstructured text data with basic structured data such as company name, address, filing type, and filing date for basic indexing reasons. The form 10-Q text submitted was really no different from the information any investor received in the mail; it made great reading for humans but was useless for reading by machines. The real meat of the filing was still in a highly unstructured format that provided no value from a business intelligence standpoint. It is extremely difficult to parse this kind of data to query and report without manually capturing the pertinent data.

Under the Sarbanes-Oxley Act, the Accounting Review Board is trying to standardize the meanings of certain financial terms (e.g., GAAP) and structure the data and terms into an XML format, Extensible Business Reporting Language (XBRL). These changes will allow organizations to search filings and compile analyses on companies’ financial reports because the semantics will be common. For example, in essence, income as reported by Microsoft will have the same definition as income reported by Apple (although the values may be different). The data are tagged so that they become structured and thus reportable. These changes bring a whole new level of atomic detail to the data and make true data mining of the SEC filings possible, which many argue will allow for better financial performance monitoring.

Issues remain. Still unresolved, for example, are how to account for footnotes and notes to consolidated financial statements.

Department of Education Financial Aid Application Management
The U.S. Department of Education (DoE) is tackling the issue of structuring unstructured data in its e-grant program using a very common document storage format, the Adobe PDF. Applicants can complete the Free Application for Federal Student Aid (FASFA) on paper, through electronic PDF forms or via a web-based form. Regardless of the format the applicant selects, the data are stored in the same place; the images of paper documentation (the unstructured data) are stored in conjunction with the structured data. The data capture process for the structured data is automatic and requires no data entry on the part of DoE because every field in the form has an XML tag and definition behind it. The different application submittal methods follow the paths outlined below.

The added benefit of using Adobe’s e-form is that a digital signature can be incorporated into the PDF document as well as password and data encryption. The web form requires a PIN to be mailed to the user and entered into the system as a form of digital signature. DoE would like to move all forms to electronic or web formats, but the broad customer reach needed prohibits such an undertaking at this time. And, except for the addition of the optical character recognition process and related hardware costs, the resulting data are the same.

Both these examples show that, given strong planning and the right application of technology, organizations can bring structure to their unstructured data. That, combined with an ECM system provides a powerful mechanism for processing large amounts of data in multiple formats with the least amount of complexity and cost.

Free trial           Subscribe Now

 

©2006 Pivotal Insight, LLC. All rights reserved.