Introduction to Data Integration


This course is aimed at teaching you how to perform data integration, i.e.,  the merging of two or more disparate data sources into a single coherent data  set to support the information needs of the target business or enterprise. The  methods and procedures described herein are based, and, take advantage of the  power of XML and the robustness of the tools available to manipulate data once  it has been 'tagged' in accordance with the XML syntax.

It is recommended that you read all sections in the order in which they are  presented since the examples and techniques of each new chapter build on those  introduced in the preceding ones.

No in-depth knowledge of database design, data modeling or programming is  required but a certain familiarity with them is definitely an asset.

To make the most of it you should have access to a computer running Windows  or some other operating system such as Linux and you should have a suite of  office tools capable of performing word processing, spreadsheet manipulations  and basic database operations, e.g., MS Office or Star Office.

Finally, you should also have an XML-enabled browser such as Internet  Explorer version 5.0 or higher.


Part 1�

Chapter 1 provides the road map for the entire book. This chapter gives a high level description of the methods and procedures for how to integrate  data using XML technologies and the role each one of them plays in the  process.

Chapter 2 covers the essential concepts of XML. The reader should be aware  that this is not a book about XML, but about how to use XML for the stated  purpose of data integration. Therefore, specialized aspects of the XML  technologies such as browser rendering, style sheets, etc., are not covered.  If you need to learn more about those aspects of XML you should consult the  specialized literature (see the suggested Bibliography at the end of this  introduction).

Chapter 3 covers World Wide Web Consortium (W3C) XML Schema Document (XSD).  Although the main use of XSDs is for validation, in this book we look at  schemas as a form of ‘data modeling’ in the context of XML, i.e., they  represent the specification of the semantics and syntax of the data as well  as their relationships.

Part 2—

Chapter 4 covers XSD alignment using the results developed in Chapter 3.  Specifically, it shows how to apply basic data modeling concepts to perform  semantic and syntactic alignment starting with the XSD’s created for data  sources that will be integrated.

Chapter 5 covers the use of XML Style sheet Language/Transformation (XSLT)  to convert the properly tagged data sources into XML instance documents that  conforms to the semantics and syntax specification of the integrated data  source (the target XSD). Again, the focus here is not to explore all the  variations and applications of these two rich areas, but simply to provide  you with the necessary tools required to merge data from different sources  via XML in an efficient manner.

Part 3�

Chapter 6 describes the theory behind Bayesian algorithms.

Chapter 7 applies the Bayesian approach to the sample data that has been  created in Chapter 4. The final data set is one which shows no data  redundancy and reflects the data structure of the integrated data source.  The chapter also shows how data load can be performed using XML.

Statement of the Problem

Health care providers collect and maintain large quantities of data. Except  for rare cases the structure of the data of one health care provider bears no  similarity to the structure of any other one. Yet data communication and data  sharing is becoming more important as organizations see the advantages of  integrating their activities and the cost benefits that accrue when data can be  reused rather than recreated from scratch.

Given this diversity of data sources, how does one efficiently and reliably  integrate them? One approach, described in this book, is to leverage the power  and ease of implementation of eXtensible Markup Language (XML) technologies to  accomplish that goal.

Role of XML Technologies

Figure 0-1 below depicts the roles and steps necessary to accomplish data  integration using XML technologies. According to this methodology the first step  consists in taking the raw data sources—spreadsheets, text, etc—and converting  them into well-formed XML documents. This is needed to take advantage of all the  other XML technologies which will enable the health provider to integrate  relevant data for its organization.

Once data has been XMLized the next step is to analyze and document its  structure. Within the context of XML technologies this is done by creating the  XML schemas (XSD) for each of the data sources. As we will see, this is  basically another form of doing data modeling, i.e., understanding and  specifying the semantics and syntax of the data, as well as their relationships.

Once the different XSD’s—i.e., the ‘data models’—for each of the data sources  are available the health care provider can begin to align them and generate a  single integrated XSD, i.e., an integrated data model for all of them.

Out of this XSD we can create both the target data base as well as the  specifications for how to recast the XMLized data sources in terms of this new  vocabulary. This transformation from the original data structure into the target  data structure is readily carried out with another XML technology, namely, XML  Style sheet Language/Transformation (XSL/T).

Once the data transformation has taken place the data sources will have a  consistent, common structure but most likely may not be completely free of  redundancies. The assessment of data overlap and its elimination can be  automatically performed using machine learning algorithms. In this book we  describe how a Bayesian predictor approach can determine which records are  duplicate and, therefore, need to be removed from the final data pool.

Figure  0-1.  XML Technologies for Data Integration Roadmap

Once the transformed source data is free of  duplications the health care provider can proceed to upload it into the target  data base.  In this book we show how XML can also facilitate data upload  into data store.

Data Integration Activity Model

Figure 0-2 below shows  the context activity diagram for the entire data integration activity model  using the IDEF0 notation.  The inputs (arrows coming into the box  from the left) for this process are the raw data sources, which can be  spreadsheets, text lists (e.g., tab delimited, or comma delimited files), as  well as any other type of ASCII file containing the data that needs to be  integrated.  The controls (arrows coming from the top into the box) are the goals and  objectives that a given health care provider has with respect to the data to be  integrated.  The output (arrow exiting the box on the right) is the  fully aligned and integrated data once it has undergone the data integration  process.  The mechanisms (arrows coming into the box from the  bottom) are the resources required to accomplish the stated activity.

Figure 0-2.  XML  Technologies for Data Integration Roadmap


Data sources constitute a valuable asset for health care providers. As these  organizations coordinate and combine their activities they need to be able to  reuse their data. Because in most cases data sources are structured in  dissimilar ways the health care provider must first integrate them. XML  technologies provide an easy and efficient way for accomplishing this goal. Most  data sources can be XMLized in a few simple steps. Once in that form their  structure can be defined and documented in the form of schemas (XSDs). From the  individual XSDs the health care provider can generate a common, integrated  specification of the semantics and syntax for all the pertinent data sources.  This specification can generate not only the physical target data base, but  serves to specify the transformations to recast the source data in a form  conformant with the new semantics and syntax. This step is accomplished using  XSL/T. The final step prior to loading the data is to remove redundancies. An  efficient way to do this is to use Bayesian predictor methods. Once the source  data is free of redundancies it can be loaded into the target data repository.  This step can also be accomplished using XML since most commercial RDBMs are now  capable of automatically importing XML documents.


