Introduction to Data Integration

Introduction

This course is aimed at teaching you how to perform data integration, i.e., the merging of two or more disparate data sources into a single coherent data set to support the information needs of the target business or enterprise. The methods and procedures described herein are based, and, take advantage of the power of XML and the robustness of the tools available to manipulate data once it has been 'tagged' in accordance with the XML syntax.

It is recommended that you read all sections in the order in which they are presented since the examples and techniques of each new chapter build on those introduced in the preceding ones.

No in-depth knowledge of database design, data modeling or programming is required but a certain familiarity with them is definitely an asset.

To make the most of it you should have access to a computer running Windows or some other operating system such as Linux and you should have a suite of office tools capable of performing word processing, spreadsheet manipulations and basic database operations, e.g., MS Office or Star Office.

Finally, you should also have an XML-enabled browser such as Internet Explorer version 5.0 or higher.

Part 1�

Chapter 1 provides the road map for the entire book. This chapter gives a high level description of the methods and procedures for how to integrate data using XML technologies and the role each one of them plays in the process.

Chapter 2 covers the essential concepts of XML. The reader should be aware that this is not a book about XML, but about how to use XML for the stated purpose of data integration. Therefore, specialized aspects of the XML technologies such as browser rendering, style sheets, etc., are not covered. If you need to learn more about those aspects of XML you should consult the specialized literature (see the suggested Bibliography at the end of this introduction).

Chapter 3 covers World Wide Web Consortium (W3C) XML Schema Document (XSD). Although the main use of XSDs is for validation, in this book we look at schemas as a form of ‘data modeling’ in the context of XML, i.e., they represent the specification of the semantics and syntax of the data as well as their relationships.

Part 2—

Chapter 4 covers XSD alignment using the results developed in Chapter 3. Specifically, it shows how to apply basic data modeling concepts to perform semantic and syntactic alignment starting with the XSD’s created for data sources that will be integrated.

Chapter 5 covers the use of XML Style sheet Language/Transformation (XSLT) to convert the properly tagged data sources into XML instance documents that conforms to the semantics and syntax specification of the integrated data source (the target XSD). Again, the focus here is not to explore all the variations and applications of these two rich areas, but simply to provide you with the necessary tools required to merge data from different sources via XML in an efficient manner.

Part 3�

Chapter 6 describes the theory behind Bayesian algorithms.

Chapter 7 applies the Bayesian approach to the sample data that has been created in Chapter 4. The final data set is one which shows no data redundancy and reflects the data structure of the integrated data source. The chapter also shows how data load can be performed using XML.

Statement of the Problem

Health care providers collect and maintain large quantities of data. Except for rare cases the structure of the data of one health care provider bears no similarity to the structure of any other one. Yet data communication and data sharing is becoming more important as organizations see the advantages of integrating their activities and the cost benefits that accrue when data can be reused rather than recreated from scratch.

Given this diversity of data sources, how does one efficiently and reliably integrate them? One approach, described in this book, is to leverage the power and ease of implementation of eXtensible Markup Language (XML) technologies to accomplish that goal.

Role of XML Technologies

Figure 0-1 below depicts the roles and steps necessary to accomplish data integration using XML technologies. According to this methodology the first step consists in taking the raw data sources—spreadsheets, text, etc—and converting them into well-formed XML documents. This is needed to take advantage of all the other XML technologies which will enable the health provider to integrate relevant data for its organization.

Once data has been XMLized the next step is to analyze and document its structure. Within the context of XML technologies this is done by creating the XML schemas (XSD) for each of the data sources. As we will see, this is basically another form of doing data modeling, i.e., understanding and specifying the semantics and syntax of the data, as well as their relationships.

Once the different XSD’s—i.e., the ‘data models’—for each of the data sources are available the health care provider can begin to align them and generate a single integrated XSD, i.e., an integrated data model for all of them.

Out of this XSD we can create both the target data base as well as the specifications for how to recast the XMLized data sources in terms of this new vocabulary. This transformation from the original data structure into the target data structure is readily carried out with another XML technology, namely, XML Style sheet Language/Transformation (XSL/T).

Once the data transformation has taken place the data sources will have a consistent, common structure but most likely may not be completely free of redundancies. The assessment of data overlap and its elimination can be automatically performed using machine learning algorithms. In this book we describe how a Bayesian predictor approach can determine which records are duplicate and, therefore, need to be removed from the final data pool.

Figure 0-1. XML Technologies for Data Integration Roadmap

Once the transformed source data is free of duplications the health care provider can proceed to upload it into the target data base. In this book we show how XML can also facilitate data upload into data store.

Data Integration Activity Model

Figure 0-2 below shows the context activity diagram for the entire data integration activity model using the IDEF0 notation. The inputs (arrows coming into the box from the left) for this process are the raw data sources, which can be spreadsheets, text lists (e.g., tab delimited, or comma delimited files), as well as any other type of ASCII file containing the data that needs to be integrated. The controls (arrows coming from the top into the box) are the goals and objectives that a given health care provider has with respect to the data to be integrated. The output (arrow exiting the box on the right) is the fully aligned and integrated data once it has undergone the data integration process. The mechanisms (arrows coming into the box from the bottom) are the resources required to accomplish the stated activity.

Figure 0-2. XML Technologies for Data Integration Roadmap

Summary

Data sources constitute a valuable asset for health care providers. As these organizations coordinate and combine their activities they need to be able to reuse their data. Because in most cases data sources are structured in dissimilar ways the health care provider must first integrate them. XML technologies provide an easy and efficient way for accomplishing this goal. Most data sources can be XMLized in a few simple steps. Once in that form their structure can be defined and documented in the form of schemas (XSDs). From the individual XSDs the health care provider can generate a common, integrated specification of the semantics and syntax for all the pertinent data sources. This specification can generate not only the physical target data base, but serves to specify the transformations to recast the source data in a form conformant with the new semantics and syntax. This step is accomplished using XSL/T. The final step prior to loading the data is to remove redundancies. An efficient way to do this is to use Bayesian predictor methods. Once the source data is free of redundancies it can be loaded into the target data repository. This step can also be accomplished using XML since most commercial RDBMs are now capable of automatically importing XML documents.

Presentations

Following resources are available:

Power Point slides for introduction to XML
Listen to lecture on Introduction to XML.

Narrated lectures require use of Flash .

What do you know?

Before we get started, it is important to make an inventory of what you know so that by the end you can better understand your progress. Please take the following brief self assessment.

Record Linkage Strategies, Outpatient Procedures, and Administrative Data

This page is part of the course on Data Integration , the lecture on Introduction to Data Integration. This page was last edited on 10/22/2011. ©Copyright protected. For more information contact us.