George Mason University
Process Improvement




Previous section introduced the basic concepts of an XSD. Specifically, it discussed how the tags in an XML document can be specified both in terms of their semantics (using <xs:annotation> and <xs:documentation>) as well as their syntax (e.g., using either predefined data types such as xs:string, xs:date, or user-defined data types (for example by taking a basic data type such as xs:decimal and imposing a restriction: <xs:restriction base="xs:decimal"> <xs:minInclusive value="15.00"/>). This chapter will discuss the analytical principles that can be applied when two data sources must be brought together into a common format and we are basing our analyses on their respective schemas.

Scenarios and Initial Considerations

As we saw in Chapter 0 the main goal of this course is to teach you how to leverage the power of XML technologies to integrate health care data. When multiple sources of health care data need to be brought into a single real or virtual repository the first issue to resolve in our minds is whether the target repository is intended to support a unidirectional data flow (i.e., legacy data will be transformed and collected in the new format), or whether the target repository will be providing data back to its legacy sources.

This is a very important issue, since the choices we make for a unidirectional data flow may not be appropriate if the requirement is actually for a bidirectional data flow. For example, if the target repository specifies the patient's name as a single, comma-separated, character string, it is easy to take data from a source system where the name is broken into FirstName, MiddleInitial, and LastName, since all it would require is that we concatenate the three fields. On the other hand, patient's names entered into the target repository may not be easily parsed back to the original source system unless business rules are enforced when performing data entry, since both "Maertens, Philippe D" and "Philippe D Maertens" would be acceptable in the target repository, but would give two different results if one were to assume that the string has a predefined order.

QUESTION: How many solutions can you think of for resolving the data integration of record identifiers where one source system uses xs:decimal and another uses xs:string?

The second consideration is whether the level of detail provided in the XSD for the source data is sufficient to support the analysis. In other words, whether there is sufficient understanding of the meaning of the data as it is used in the current system, as well as its intended use in the target repository.

For example, if Health Care Corporation Lammont has acquired Clinic SmallTown, and as part of the merger Lammont will provide all employees of SmallTown health care coverage at one of Lammont hospitals, one could take the personnel records of SmallTown and convert them to EnsuredPatients in the Lammont information system.

In this example we see that the semantics of the personnel records in SmallTown has changed but since the intended use of the records in the target repository allows such change it may be acceptable.

QUESTION: What can you say about the directionality implicit in the previous example?

To recap, before starting an alignment analyses one must have a good understanding of the source data (both its meaning and syntax), as well as the intended use of the data to be used and the overall concept of operations of the new data repository (directionality).

Basic Alignment Principles

  1. If the flow is unidirectional then for each element in the target system choose the semantics that is more generic.

Example: If one of the data sources to be integrated defines <patient> as "an individual with severe cancer symptoms" and the target repository is intended to collect state-wide statistics on cancer patients in general then <patient> could be defined as "an individual who has been diagnosed with cancer". In this way, other data sources could be mapped to the same element without problem.

  1. If the flow is bidirectional then use a generalization and retain the source element as a subtype.

Example: In the example above this would mean that the original record for <patient> would retain its semantics, but would be mapped to a subtype of the element <patient> in the target repository instead (e.g., <SevSymptPatient>) while the element <patient> would be used for those cases where the source data does not imply that the severity of the symptoms is essential to its meaning.

  1. Whenever possible define the target data elements to conform to normalization rules even if the source data does not. 

  2. Device business rules and procedures for pre-processing of aggregate data.

Example: Given the choice between <PatientName> and <PatientFirstName>, <PatientMiddleIntial> and <PatientLastName> it is better to adopt the latter since it is always possible to map from it to an aggregate element such as <PatientName>.

  1. If the data type of the sources belong to the same class but differ in their facets, choose the larger facet value for the target system.

Example: If one source uses <xs:restriction base="xs:string"> <xs:maxLength value="35"/>, while another has <xs:restriction base="xs:string"> <xs:maxLength value="55"/>, then the target system should use at a minimum <xs:restriction base="xs:string"><xs:maxLength value="55"/>.

  1. If the data type of the sources belong to different classes and the flow is unidirectional then choose the more generic class. If the flow is bidirectional then create a new element to track the legacy data.

Example: If a data source uses for its element <PatientID> numeric strings specified as xs:decimal, whereas another uses for <PatientID> alphanumeric strings, then xs:string could accommodate both. The assumption here, however, is that the flow is unidirectional. If the flow is intended to be directional, then you may need to create a new element <PatientAltID> to retain the legacy data while redefining <PatientID> as the new record identifier.


  1. Whenever possible void the use of 'intelligent keys

So-called intelligent keys should be avoided because of the ripple effects that take place when either the range of the scheme is exhausted or new requirements make the record identification scheme inadequate. For example, let's imagine a clinic where the PatientId is an eight digit number. If one uses the first digit of the record identifier so that the first digit is set to "1" to indicate whether the patient is ambulatory or "2" when the patient requires hospitalization that means that instead of having room for 99,999,999 records we now have only 2 x 9,999,999 records. If, furthermore the next two digits are used to indicate the type of disease the patient is diagnosed with, then not only does this restrict the number of possible diagnoses to 100, but the actual number of records that can be identified per diagnosis is further reduced to 99,9999. Further partitioning of the key to accommodate other information may cause the actual number of records that can be uniquely identified to a very small number. When this occurs then the key structure will have to be modified. All applications that depended on the previous scheme built into the key will have to be rewritten. If the information system is large the cost of such a change can be quite high.

  1. Consider the use of a record identification scheme that is flexible, easily implementable and scalable

In our modern global economy it is becoming quite common to have contractors performing IT support that reside at locations other than the place where the data is being produced and used.

Example: The health care corporation Lammont has acquired clinics and hospitals in New York, Washington D.C. and San Francisco, and for cost efficiency has assigned the integration tasks to two outfits in Virginia to handle the Washington D.C. area, whereas the New York data integration is going to be performed by an outfit in New Delhi, and the San Francisco area will be performed by an outfit in Dublin Ireland. The integrated data must be retrievable throughout the U.S. from any of the new servers independent of the physical location of the user.

QUESTION: What record identification scheme should be adopted so that there is are no two records in the system with the same identifier?

What are the advantages and disadvantages of the two answers provided below.

ANSWER: (a) assign identifier blocks to each of the integrators; (b) adopt a scheme where the record identifiers are built from two segments, a 'seed' and a 'suffix'. Seeds are managed centrally and given out on a first-come-first-serve basis. Suffixes are managed locally with the only constraint that they do not repeat.

Final Remarks

Alignment of data sources encompasses both semantic and syntax aspects of the data to be integrated. The more precise and exhaustive the documentation and the understanding of the nature of the data sources the easier it will be to integrate them.

The alignment process is by nature iterative. All the principles stated above are likely to be applicable for the efficient resolution of an integration task in more or more of its refinement cycles. Both top-down and bottom-up techniques should be considered when approaching a data integration activity.

In this section we have seen some of the key principles that one should apply when dealing with a data integration problem. We have taken advantage of the documentation that an XSD can provide in order to discern the optimal alternatives for bringing legacy source data into a new target repository. We have also seen how the current data types provided in the XSD specification allow us to resolve the syntax component of data integration. We have also seen that both the standard techniques involved in normalizing data bases apply to this domain. Lastly, we have explored the issues related with record identification and how the choice of a robust key management scheme can support implementation and scalability without running into the problems inherent in the use of traditional intelligent keys.

Copyright � 1996.  For more information contact us.  Created on January 2003.  Most recent revision 10/22/2011.  This page is part of the course on Data Integration.