In the April issue of the Pivotal Insight newsletter we addressed
using Enterprise Content Management (ECM) systems to manage structured
and unstructured data. This approach to data management is sound
for consolidating infrastructure and reducing overall complexity.
But to truly gain business value from unstructured data, those data
have to be given some of the same properties as structured data,
such as the ability to search and find the appropriate data, to
query the data, and to provide reports. Text searching and result
scoring may allow you to search and find the document you are looking
for, but they do little to help you query and run reports from the
data, nor are they scalable/usable past a few hundred thousand documents.
There are a number of approaches that content management software
vendors and those industries that leverage ECM are taking to bring
structure to unstructured data. Some in the industry call the resulting
data “semi-structured” because not all of the unstructured
data need to be captured in a structured format. In this article,
we provide case studies of a couple of ways different organizations
are working to structure unstructured data. Problems persist, but
these cases may provide insight into the applicability of these
technologies for your organization.
SEC Financial Reporting
For years publicly traded companies have had to file standard financial
reports with the SEC using the EDGAR system. A company, for example,
sent the SEC an electronic form 10-Q via FTP in a text format. The
SEC augmented the unstructured text data with basic structured data
such as company name, address, filing type, and filing date for
basic indexing reasons. The form 10-Q text submitted was really
no different from the information any investor received in the mail;
it made great reading for humans but was useless for reading by
machines. The real meat of the filing was still in a highly unstructured
format that provided no value from a business intelligence standpoint.
It is extremely difficult to parse this kind of data to query and
report without manually capturing the pertinent data.
Under the Sarbanes-Oxley Act, the Accounting Review Board is trying
to standardize the meanings of certain financial terms (e.g., GAAP)
and structure the data and terms into an XML format, Extensible
Business Reporting Language (XBRL). These changes will allow organizations
to search filings and compile analyses on companies’ financial
reports because the semantics will be common. For example, in essence,
income as reported by Microsoft will have the same definition as
income reported by Apple (although the values may be different).
The data are tagged so that they become structured and thus reportable.
These changes bring a whole new level of atomic detail to the data
and make true data mining of the SEC filings possible, which many
argue will allow for better financial performance monitoring.
Issues remain. Still unresolved, for example, are how to account
for footnotes and notes to consolidated financial statements.
Department of Education Financial Aid Application Management
The U.S. Department of Education (DoE) is tackling the issue of
structuring unstructured data in its e-grant program using a very
common document storage format, the Adobe PDF. Applicants can complete
the Free Application for Federal Student Aid (FASFA) on paper, through
electronic PDF forms or via a web-based form. Regardless of the
format the applicant selects, the data are stored in the same place;
the images of paper documentation (the unstructured data) are stored
in conjunction with the structured data. The data capture process
for the structured data is automatic and requires no data entry
on the part of DoE because every field in the form has an XML tag
and definition behind it. The different application submittal methods
follow the paths outlined below.

The added benefit of using Adobe’s e-form is that a digital
signature can be incorporated into the PDF document as well as password
and data encryption. The web form requires a PIN to be mailed to
the user and entered into the system as a form of digital signature.
DoE would like to move all forms to electronic or web formats, but
the broad customer reach needed prohibits such an undertaking at
this time. And, except for the addition of the optical character
recognition process and related hardware costs, the resulting data
are the same.
Both these examples show that, given strong planning and the right
application of technology, organizations can bring structure to
their unstructured data. That, combined with an ECM system provides
a powerful mechanism for processing large amounts of data in multiple
formats with the least amount of complexity and cost. |