Essential XML Concepts

As shown in the roadmap discussed in the introduction Chapter, the first step towards data integration in the context of XML technologies is to create instance XML documents out of the raw data sources targeted for integration. To accomplish this health care providers need to learn the basic principles of XML so that they can generate well-formed XML documents.

This chapter gives an overview of the XML syntax and shows how raw data sources in the form of spreadsheets can be readily manipulated to generate a well-formed instance XML document. The methodology can be equally applied to comma delimited files, or textual sources, although the less structured the source the more manipulation it may require.

Why Data Integration?

We define a 'stove-pipe' data store as a data repository whose semantics and syntax do not conform to those of another data store engaged in a similar kind of business. There is nothing wrong with 'stove-pipe' data stores. In fact, every time we launch our favorite database application and create a database to organize patient records or keep track of our physician certification process we are likely to create a 'stove-pipe' data store, i.e., one whose tables and columns are likely to be different from those of someone else undertaking basically the same activity. At the personal level there is nothing wrong with that because it is highly unlikely that we will be interested in asking for someone else's data, i.e., we seldom share any type of stored data. At the business level the opposite is more common. When a company is acquired by a larger corporation its data may have to be merged with that of the acquiring enterprise in order for the parent company to take advantage of that resource and to minimize maintenance costs. When a company has multiple data systems, these systems need to be able to talk to each other. A merged enterprise-wide database cannot be created without integrating data from different types of databases with different structure and dissimilar meanings.

After the events of September 11 it has become quite apparent that data sharing is key to a successful Homeland Defense strategy and the Federal government is actively pursuing ways for making data more freely shared among the respective agencies. On the other hand the enactment of HIPAA legislation legislation restricts the sharing of patient identifiable information across health care organizations in the community so that there is a need to develop methodologies that permit data integration without relying on such things as the patient’s Social Security Number. The expected benefits of increased sharing, however, may not materialize unless data is restructured into a common format capable of automated processing because although humans are quite good at figuring out whether one kind of data is essentially the same as another they are notoriously slow and prone to error.

How does a 'Markup Language' help?

XML is probably the most robust syntax for performing 'markup'. If you are not familiar with XML this probably didn't tell you much. So let's back up a little. The idea behind markup is quite powerful because data which contains markup can be processed much more efficiently. Let's look at the example shown in Table 1.1 below.

The numbers in Table 1.1 above are both positive and negative and they could represent, for example, the results of a patient’s systolic blood pressure deviations from the ideal value of 120 over a period of time. If one were to enclose those numbers which are negative between one type of appropriate symbols and enclose the positive numbers with different ones as shown in Table 2 then one could automatically process the table to filter them into separate files.�

The pseudo-code shown below would filter the negative blood pressure measurements to a new file called LowBP.dat, and the high values to HighBP.dat respectively:

Question 1.1: Are all symbols equally appropriate for the example above? For example, could one use numbers as 'appropriate symbols' for this type of 'markup'?

Question 1.2: What is the relationship between 'markup languages' and 'browsers'?

X stands for eXtensible

The preceding paragraphs highlight the substantial benefits that accrue when data receives 'markup'. It also shows that with few exceptions pretty much anything goes when one is doing 'markup'. Therein lies the appeal of XML. It is both a 'standard' way of doing 'markup' but it is also flexible enough to accommodate the needs of anyone doing markup because unlike fixed vocabularies for special purposes such as HTML the user is free to invent any type of 'tag' (provided it does not conflict with a couple of syntactical restrictions discussed below) to introduce structure into the data.

Couple this with the ability to add attributes to the tags and now we have not just an arbitrary combination of symbols to delimit the data but potentially a complete way of expressing what the data means. XML therefore allows to specify not just the structure but the semantics of the data.

XML applications

When certain health care organizations agree on the need to exchange data in a certain way they may agree to restrict the XML tags that can be used for their exchanges and to define strictly the meaning and use of the tags. When this occurs we speak of an XML application. For example if hospital laboratories agree to use certain tags to indicate that the character string enclosed between a given tag represents test results then that is an XML application, namely, the Hospital Laboratory Markup Language (HLML). Once health care organizations create their own XML application they can develop specialized tools that are able to manipulate the markup for any purpose they want. The most typical tools are browsers for presentation or rendering of the data on a computer screen or formatted output to a printer. But XML applications also allow other types of manipulations such as importing data into databases.

XML Basics—the XML Tag and the XML Element

Figure 1.1 below shows the structure of a start XML tag. As shown therein an XML tag consists of (a) delimiters, (b) a name and (c) zero, one or more attributes. The value of the attribute is enclosed in quotation marks.

Figure 1.2 below shows the corresponding XML closing tag for the one shown in Figure 1 as well as its content. Note that the closing tag starts with "</" and does not contain any attributes. The content of the XML tag goes between the start tag and the end tag. In the example shown the content is ASCII. XML tags can contain both ASCII strings as well as other XML tags.

XML tags are the physical expression of the XML elements which constitute the structure of any well-formed XML instance document.

XML Syntax Rules

From Sections 1.1 through 1.4 above we now know that XML is a formalized way of performing markup: so what are the rules of the XML markup syntax?

Well, they are fortunately very simple, which is one of the reasons for the popularity of XML

 Naming Rules: element names, attribute names, as well as the names of several less common constructs:

XML Elements & Processing Instructions

As we briefly alluded in Section 1.5 above every XML instance document is made up of elements whose physical representation is in the form of XML tags, both start and end, as well as content.

The content of a tag can be NULL, in which case the element is said to be 'empty', and, if desired the start and end tags can be conflated into a single one. For example, if there had been no content for the element BloodPressure in Figure 2, then, instead of writing the start tag immediately followed by the end tag one could have written <BloodPressure/> to indicate that there was no content for this element.

Structurally speaking, all well-formed XML instance documents can be represented as a 'tree structure'. This means that they must have only one root element.

XML instance documents can contain comments in addition to elements. To signal the parser that the string is a comment one encloses it between "". Comments are useful for providing further context and information to the user of the XML document.

Finally, since XML documents are produced so that other applications can process them, for example for display on computer screens, etc., it is also possible to include in an XML instance document strings which are directed to the processing application, i.e., processing instructions. All processing instructions begin with "<?" immediately followed by the target—the application intended to process the instruction—and end with "?>". When a parser encounters a processing instruction it can either ignore it if it does not have a means for interpreting it, or process it. One of the most commonly used processing instructions is the style sheet which is used to tell browsers how to render on the screen the contents of the XML instance document.

The style sheet processing instruction must always be inserted before the root element of the XML instance document.

Arguably the XML declaration is a kind of processing instruction although for historical reasons one gives it its own name. The purpose of the XML declaration is to instruct the application about some important parameters concerning the document. Typical attributes of the XML declaration are the version the encoding as well as flags to signal the processing application whether the document contains validating instructions or whether they are external.

A Sample Document

The best way to remember all these rules is to use them right away and to test what happens when we try to break them. Using Notepad or some equivalent text editor type the following:

<?xml version="1.0" encoding="ASCII" ?>


<PatientRecords>



<Person>
<PersonName>Peter Jonas</PersonName>
<BloodPressure>139</BloodPressure>
<CholesterolLevel>209</CholesterolLevel>
<Triglycerides>90</Triglycerides>
<Date>9/12/99</Date>
<Clinician>Dr. John Martins</Clinician>
<Hospital>St. Michael Hospital</Hospital>
</Person>


<Person>
<personName>Sally Federsen</personName>
<BloodPressure>146</BloodPressure>
<CholesterolLevel>209</CholesterolLevel>
<Triglycerides>90</Triglycerides>
<Date>10/01/01</Date>
<Clinician>Dr. Sally Bielefeld</Clinician>
<Hospital>St. Michael Hospital</Hospital>
</Person>


</PatientRecords>
Listing 1.1. Example of an XML Document

Although one can use more sophisticated word processors one should always remember to save the final documents as text-only to ensure that all extraneous codes needed for formatting are removed.

You can omit any or all of the comments, and you can insert carriage returns to make the document more legible. They will be ignored by Internet Explorer when it processes the file for display on the screen.

Recommended Style Rules

It is good practice to develop a certain style for writing all XML documents since this facilitates review and minimizes errors of omission. The following are some suggested style rules:

Example:

Step 2. We can add a comment to explain what the document is about. In this example we are writing a description of influenza outbreak:

<?xml version="1.0" encoding="ASCII" ?>

Step 3. Once this is done we can add the root tag. For this example we will use the Object Oriented style alluded to above. For the root tag we can choose a descriptive tag such as <Portfolio>. We write both the start and the end tag to make sure that our document is well formed.

<?xml version="1.0" encoding="ASCII" ?>

<Influenza_2002>
</Influenza_2002>

Step 4.� The contents of the document are now going to be inserted between the start and end tags of the root tag <Influenza_2002>. As suggested in the style rules above we should have a record delimiter tag. We are going to choose a tag named <Report> for that purpose. Now we can think of what we want to say about each influenza report instance. For example we may want to track the city where it occurred, how many patients were hospitalized, average number of hospitalization days, number of deaths attributed to the outbreak.

<?xml version="1.0" encoding="ASCII" ?>

<Influenza_2002>
   <Report>
      <cityName>Atlanta</cityName>
      <patientQuantity>2,138</patientQuantity>
      <hospitalStay unit="days">8</hospitalStay>
      <casualtyQuantity>14</casualtyQuantity>
   </Report>
   <Report>
      <cityName>Los Angeles</cityName>
      <patientQuantity>15,764</patientQuantity>
      <hospitalStay unit="days">5</hospitalStay>
      <casualtyQuantity>77</casualtyQuantity>
   </Report>
   <Report>
� <cityName>New York</cityName>
      <patientQuantity>22,349</patientQuantity>
      <hospitalStay unit="days">7</hospitalStay>
      <casualtyQuantity>146</casualtyQuantity>
   </Report>
   <Report>
      <cityName>Chicago</cityName>
      <patientQuantity>12,808</patientQuantity>
      <hospitalStay unit="days">6</hospitalStay>
      <casualtyQuantity>94</casualtyQuantity>
   </Report>
</Influenza_2002>

Step 5. If the XML instance document is to be processed by a special application then the pertinent processing instructions can be added after the document section, i.e., after the closing tag of the root element. In Listing 1.2 above there are no processing instructions so the XML instance document is ready to be viewed using an XML-enabled browser such as Microsoft IE 5.0 or higher. Figure 1.4 below shows the results.

It should be noted that the above are recommended style rules; for example, it is perfectly valid to insert processing instructions other than at the very end of the document section. In fact, processing instructions can be inserted anywhere a comment can be inserted—see Listing 1.1. The only restriction is that a processing instruction should not be inside a tag itself. In other words, a processing instruction cannot be treated as an attribute of a tag.

A Simple Data Taxonomy

Before we proceed with further basic XML concepts and techniques it is probably worthwhile to spend a moment considering some of the general characteristics of data, since they impact to a certain degree the choices of technologies and applications. If we look at the data we deal with on a daily basis we can see that some of it is mostly in the form of lists, whereas other data is mostly in the form of narratives. Lists of hospital names, patient names, medical treatments, diagnoses, medical product suppliers, medical product classes, etc., is what we call structured data. The reason for using this label is that when we examine the attributes of this type of data we see that all the instances of a particular set seem to share the same characteristics. For example, if we look at a list of patient names—assuming the list does not contain extraneous entries—it is fair to say that every entry conveys a well-defined meaning, i.e., every name in the list is that of an individual who has received treatment from the health care provider. We express this fact by stating that the semantics of the members of the list is common to all its members. In addition, the way in which the data is encoded is probably also fixed. For example, the patient names are all character data that cannot exceed a certain number of characters, e.g., 255. We express this fact by stating that the syntax of the members of the list is common to all its members.

Lists are examples of data-centric documents. They are characterized by regular structure, as well as atomicity, i.e., they are semantically at the lowest level of required decomposition. In addition, the order of the elements that constitute a data-centric document is not essential. For example in Listing 1.2 above we could have chosen to put the <casualtyQuantity> element first and this would not have changed the overall meaning of the information conveyed by the records.

Spread sheets, and, even more so, databases are applications that have been optimized to handle structured data (i.e., lists with well-defined semantics and syntax).

At the other extreme there is narrative data or document-centric data. The main feature of this type of data is that it does not conform to the definition of a list. They are characterized by less regular or completely irregular structure. At the semantic level the content of the elements is not atomic. And, lastly, the order in which the elements occur is almost always significant. For example this entire chapter can be considered as narrative data. It clearly does not look anything like a list, and it matters whether this paragraph appears at the beginning or in its current position. Although spreadsheets and databases can also be used to store this type of data, the functions that work so well with lists such as searches and comparisons won't work with the same efficiency, and, in some cases, depending on the application, they may not be even available when the data stored is a narrative as opposed to a list.

It should be noted that multimedia constitutes a new and emerging category of data which poses its own unique type of challenges. For example, digitized X-ray images, MRI and CAT scans, as well as voice and video streams may require special applications that combine relational and object-oriented capabilities to perform comparisons and searches based on the actual content of the binary objects.

In addition to the division of data into structured and unstructured types, data also may have a temporal aspect dictated by its refresh rate. For example, when we look at lists we notice that some are fairly stable over time, e.g., the list of all the US hospitals can be considered static, if none are going bankrupt or going through mergers. Other data is highly dynamic, for example the list of patients treated by a health care provider changes from visit to visit. In between these two extremes we have semi-static data. A good example of this type of data are the employees telephone directory of a hospital, which may be updated every six months or every year. This type of data has a fixed periodicity which is shorter than the one of static data, but much longer than the one of dynamic data.

Applicability of XML

The implications of the data taxonomy discussed above when dealing with data integration are as follows. Historically, there have been two distinct communities applying XML. The community dealing with data-centric documents and the community producing narrative type of documents. The former has primarily looked at XML as a data transport mechanism, and the documents are meant to support automated messaging among applications, e.g., database to database exchanges. The fact that XML has been used in that way here is one of convenience rather than necessity. For data integration, however, XML constitutes the essential component for facilitating the merging of diverse data sources. It is also clear that data integration works best when the data has been distilled in the form of data-centric documents.

The community producing narrative type of documents has used XML as a means to define information content models. For example, an entire book can be described in terms of elements such as <Chapter>, <Section>, <Paragraph>. The use of this information content model permits the application of specific processing instructions to the content of each kind of element. For example if we want to display the document on a browser we may choose one type of fonts for the content of <Chapter> elements, and another for the content of <Section> elements. We may automatically create a table of contents for the document by selecting only the contents of the <Chapter> and <Section> elements but not that of the <Paragraph> elements. The lessons learned and techniques developed by this community are of secondary importance when doing data integration, and will not be covered in any detail in this book.

Inserting XML Tags Manually

Having stipulated that the main focus of XML for data integration is on structured data, we are going to show via a simple example how one could XMLize it so that all the other XML technologies can be brought to bear. Because many data sources exist in the form of spreadsheets, or can be ported readily to that form, we will work out the techniques for how to transform raw source data into well-formed XML documents using such a source. The following steps show the process.

Step 1. In the spreadsheet to be transformed analyze the columns to decide the name of of the tags you want to introduce. In the example, shown in Table 01 below, the columns already have headers which can serve as the names for the XML tags. We will, therefore, create three XML tags, namely, <Name>, <MedicalTest> and <Date>.

Name	Medical Test	Date
John Robertson	Blood Test	2-Jan-02
Maria Estrada	Cholesterol Test	27-Jul-02
Sandy Hellerman	X-Ray	13-Mar-02
Julia Moriarty	Chest MRI	23-Dec-01
Pedro Martinez	EKG	11-Feb-02
Sayeed Rashidi	Cholesterol Test	11-Nov-02

Step 2. Insert columns to the right and to the left of each column and paste the start and end XML tags chosen in Step 1 above. The resulting spreadsheet should look as shown in Table 02 below.

	name			Medical Test			Date
<name>	John Robertson	</name>	<MedTest>	Blood Test	</MedTest>	<Date>	02-Jan-03	</Date>
<name>	Maria Estrada	</name>	<MedTest>	Cholesterol Test	</MedTest>	<Date>	27-Jul-02	</Date>
<name>	Sandy Hellerman	</name>	<MedTest>	X-Ray	</MedTest>	<Date>	13-Mar-02	</Date>
<name>	Julia Moriarty	</name>	<MedTest>	Chest MRI	</MedTest>	<Date>	23-Dec-01	</Date>
<name>	Pedro Martinez	</name>	<MedTest>	EKG	</MedTest>	<Date>	11-Feb-03	</Date>
<name>	Sayeed Rashidi	</name>	<MedTest>	Cholesterol Test	</MedTest>	<Date>	11-Nov-02	</Date>

Step 3. Insert a record delimiter tag for each record (i.e., for each row). In our example we may use a record delimiter tag such as <PatRec>.

		name
<PatRec>	<Name>	John Robertson	</Name>	...	</PatRec>
<PatRec>	<Name>	Maria Estrada	</Name>	...	</PatRec>3�
<PatRec>	<Name>	Sandy Hellermand	</Name>	...	</PatRec>y�
<PatRec>	<Name>	Julia Moriarty	</Name>	...	</PatRec>�
<PatRec>	<Name>	Pedro Martinez	</Name>	...	</PatRec>'
<PatRec>	<Name>	Sayeed Rashidi	</Name>	...	</PatRec>

Step 4. Cut and paste the spreadsheet without including the column headers on your preferred word processing application. Remove the tabs and insert a paragraph return between each "><" sequence. For example, if you are using MSWord, paste the spreadsheet using Paste Special.../Unformatted Text. In Edit/Replace... enter "^t" in the Find what field, leave blank the Replace with field, and then press Replace All. Next while in Edit/Replace... enter "><" in the Find what field, ">^p<" in the Replace with field, and then press Replace All. The text should look now as shown in Listing 03 below.

<PatRec>
<Name>John Robertson</Name>
<MedTest>Blood Test</MedTest>
<Date>2002-01-03</Date>
</PatRec>
<PatRec>
<Name>Maria Estrada</Name>
<MedTest>Cholesterol Test</MedTest>
<Date>2002-07-27</Date>
</PatRec>
<PatRec>
<Name>Sandy Hellerman</Name>
<MedTest>X-Ray</MedTest>
<Date>2002-04-13</Date>
</PatRec>
<PatRec>
<Name>Julia Moriarty</Name>
<MedTest>Chest MRI</MedTest>
<Date>2002-12-23</Date>
</PatRec>
<PatRec>
<Name>Pedro Martinez</Name>
<MedTest>EKG</MedTest>
<Date>2003-02-11</Date>
</PatRec>
<PatRec>
<Name>Sayeed Rashidi</Name>
<MedTest>Cholesterol Test</MedTest>
<Date>2002-11-11</Date>
</PatRec>e5

Step 5. Insert the XML declaration, as well as the root element tags. In the example one could use <MedRecords> for the root element. The resulting document should now look as shown in Listing 1.4 below.

<?xml version="1.0" encoding="ASCII" ?>
<MedRecords>
<PatRec>
<Name>John Robertson</Name>
<MedTest>Blood Test</MedTest>
<Date>2002-01-03</Date>
</PatRec>
<PatRec>
<Name>Maria Estrada</Name>
<MedTest>Cholesterol Test</MedTest>
<Date>2002-07-27</Date>
</PatRec>
<PatRec>
<Name>Sandy Hellerman</Name>
<MedTest>X-Ray</MedTest>
<Date>2002-04-13</Date>
</PatRec>
<PatRec>
<Name>Julia Moriarty</Name>
<MedTest>Chest MRI</MedTest>
<Date>2002-12-23</Date>
</PatRec>
<PatRec>
<Name>Pedro Martinez</Name>
<MedTest>EKG</MedTest>
<Date>2003-02-11</Date>
</PatRec>
<PatRec>
<Name>Sayeed Rashidi</Name>
<MedTest>Cholesterol Test</MedTest>
<Date>2002-11-11</Date>
</PatRec>
</MedRecords>

Step 6. Save the document in text-only format with the appropriate name and extension. For example, you can save the document as "MedRecords.xml". Figure 1.5 below shows how the instance XML document would appear when viewed with IE 5.0.

Summary

In this chapter we have learned the rationale for including 'markup' in our document-centric and data-centric documents, namely, that it facilitates their automated processing. We have also learned that XML is a very powerful markup syntax whose strength lies in the fact that it is extensible. The syntax of XML is very simple and in this chapter we have seen what the naming conventions are as well as the requirements for well-formedness.

We also have learned that XML instance documents are made up of elements (with or without attributes), as well as comments and processing instructions. A special kind of 'processing instruction' is the XML declaration which is normally the first entry in every complete XML document. When viewed structurally XML instance documents can be considered as 'tree structures', thus they cannot have more than one root and they must not contain overlapping nesting structures.

We have seen how data contained in a spreadsheet (or for that matter in a word document as comma or tab delimited text) can be readily transformed into an XML document that can be viewed on an XML-enabled browser such as IE 5 or higher.