|
Table 1.1 |
25 |
-10 |
30 |
-15 |
The numbers in Table 1.1 above are both positive and negative and they could represent, for example, the results of a patient’s systolic blood pressure deviations from the ideal value of 120 over a period of time. If one were to enclose those numbers which are negative between one type of appropriate symbols and enclose the positive numbers with different ones as shown in Table 2 then one could automatically process the table to filter them into separate files.�
Table 1.2: Low BP.dat |
||
$$ |
25 |
$$ |
## |
-10 |
## |
$$ |
30 |
$$ |
## |
-15 |
## |
The pseudo-code shown below would filter the negative blood pressure measurements to a new file called LowBP.dat, and the high values to HighBP.dat respectively:
if number_string is enclosed between '##' then
copy number_string to LowBP.dat
else
copy number_string to HighBP.dat
end if
The resulting LowBP.dat table would now look as follows:
## |
-10 |
## |
### |
-15 |
## |
Question 1.1: Are all symbols equally appropriate for the example above? For example, could one use numbers as 'appropriate symbols' for this type of 'markup'?
Question 1.2: What is the relationship between 'markup languages' and 'browsers'?
The preceding paragraphs highlight the substantial benefits that accrue when data receives 'markup'. It also shows that with few exceptions pretty much anything goes when one is doing 'markup'. Therein lies the appeal of XML. It is both a 'standard' way of doing 'markup' but it is also flexible enough to accommodate the needs of anyone doing markup because unlike fixed vocabularies for special purposes such as HTML the user is free to invent any type of 'tag' (provided it does not conflict with a couple of syntactical restrictions discussed below) to introduce structure into the data.
Couple this with the ability to add attributes to the tags and now we have not just an arbitrary combination of symbols to delimit the data but potentially a complete way of expressing what the data means. XML therefore allows to specify not just the structure but the semantics of the data.
When certain health care organizations agree on the need to exchange data in a certain way they may agree to restrict the XML tags that can be used for their exchanges and to define strictly the meaning and use of the tags. When this occurs we speak of an XML application. For example if hospital laboratories agree to use certain tags to indicate that the character string enclosed between a given tag represents test results then that is an XML application, namely, the Hospital Laboratory Markup Language (HLML). Once health care organizations create their own XML application they can develop specialized tools that are able to manipulate the markup for any purpose they want. The most typical tools are browsers for presentation or rendering of the data on a computer screen or formatted output to a printer. But XML applications also allow other types of manipulations such as importing data into databases.
Figure 1.1 below shows the structure of a start XML tag. As shown therein an XML tag consists of (a) delimiters, (b) a name and (c) zero, one or more attributes. The value of the attribute is enclosed in quotation marks.
Figure 1.1. Structure of the XML Start Tag
Figure 1.2 below shows the corresponding XML closing tag for the one shown in Figure 1 as well as its content. Note that the closing tag starts with "</" and does not contain any attributes. The content of the XML tag goes between the start tag and the end tag. In the example shown the content is ASCII. XML tags can contain both ASCII strings as well as other XML tags.
XML tags are the physical expression of the XML elements which constitute the structure of any well-formed XML instance document.
Figure 1.2. Example of an XML Closing Tag and Content
From Sections 1.1 through 1.4 above we now know that XML is a formalized way of performing markup: so what are the rules of the XML markup syntax?
Well, they are fortunately very simple, which is one of the reasons for the popularity of XML
Naming Rules: element names, attribute names, as well as the names of several less common constructs:
Well-Formedness:
As we briefly alluded in Section 1.5 above every XML instance document is made up of elements whose physical representation is in the form of XML tags, both start and end, as well as content.
The content of a tag can be NULL, in which case the element is said to be 'empty', and, if desired the start and end tags can be conflated into a single one. For example, if there had been no content for the element BloodPressure in Figure 2, then, instead of writing the start tag immediately followed by the end tag one could have written <BloodPressure/> to indicate that there was no content for this element.
Structurally speaking, all well-formed XML instance documents can be represented as a 'tree structure'. This means that they must have only one root element.
XML instance documents can contain comments in addition to elements. To signal the parser that the string is a comment one encloses it between "<!--" and "-->". Comments are useful for providing further context and information to the user of the XML document.
Finally, since XML documents are produced so that other applications can process them, for example for display on computer screens, etc., it is also possible to include in an XML instance document strings which are directed to the processing application, i.e., processing instructions. All processing instructions begin with "<?" immediately followed by the target—the application intended to process the instruction—and end with "?>". When a parser encounters a processing instruction it can either ignore it if it does not have a means for interpreting it, or process it. One of the most commonly used processing instructions is the style sheet which is used to tell browsers how to render on the screen the contents of the XML instance document.
Style sheet Processing Instruction:
<?xml-stylesheet href="expenses.css" type "text/css"?>
The style sheet processing instruction must always be inserted before the root element of the XML instance document.
Arguably the XML declaration is a kind of processing instruction although for historical reasons one gives it its own name. The purpose of the XML declaration is to instruct the application about some important parameters concerning the document. Typical attributes of the XML declaration are the version the encoding as well as flags to signal the processing application whether the document contains validating instructions or whether they are external.
XML Declaration Example:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
The XML declaration precedes any other processing instruction.
The best way to remember all these rules is to use them right away and to test what happens when we try to break them. Using Notepad or some equivalent text editor type the following:
<?xml version="1.0" encoding="ASCII" ?>
<!-- This is our first XML document. We begin with the
XML Declaration as shown above -->
<!-- Next we declare the root tag of the document -->
<PatientRecords>
<!-- The records are organized by person -->
<!-- The person's name, medical tests, date, clinician
and hospital are currently tracked -->
<!-- begin of first record -->
<Person>
<PersonName>Peter Jonas</PersonName>
<BloodPressure>139</BloodPressure>
<CholesterolLevel>209</CholesterolLevel>
<Triglycerides>90</Triglycerides>
<Date>9/12/99</Date>
<Clinician>Dr. John Martins</Clinician>
<Hospital>St. Michael Hospital</Hospital>
</Person>
<!-- end of first record -->
<!-- begin of second record -->
<Person>
<personName>Sally Federsen</personName>
<BloodPressure>146</BloodPressure>
<CholesterolLevel>209</CholesterolLevel>
<Triglycerides>90</Triglycerides>
<Date>10/01/01</Date>
<Clinician>Dr. Sally Bielefeld</Clinician>
<Hospital>St. Michael Hospital</Hospital>
</Person>
<!-- end of second record -->
<!-- end of document -->
</PatientRecords>
Listing 1.1. Example of an XML Document
Although one can use more sophisticated word processors one should always remember to save the final documents as text-only to ensure that all extraneous codes needed for formatting are removed.
�
Figure 1.3. Rendering of XML Document from Listing 1.1 Using IE 5.0
You can omit any or all of the comments, and you can insert carriage returns to make the document more legible. They will be ignored by Internet Explorer when it processes the file for display on the screen.
It is good practice to develop a certain style for writing all XML documents since this facilitates review and minimizes errors of omission. The following are some suggested style rules:
Step 1.�Let's begin with the prolog by writing the XML declaration:
<?xml version="1.0" encoding="ASCII" ?>
Step 2. We can add a comment to explain what the document is about. In this example we are writing a description of influenza outbreak:
<?xml version="1.0" encoding="ASCII" ?>
<!-- data from public health agencies on influenza outbreak in 2002 -->
Step 3. Once this is done we can add the root tag. For this example we will use the Object Oriented style alluded to above. For the root tag we can choose a descriptive tag such as <Portfolio>. We write both the start and the end tag to make sure that our document is well formed.
<?xml version="1.0" encoding="ASCII" ?>
<!-- data from public health agencies on influenza outbreak in 2002 -->
<Influenza_2002>
</Influenza_2002>
Step 4.� The contents of the document are now going to be inserted between the start and end tags of the root tag <Influenza_2002>. As suggested in the style rules above we should have a record delimiter tag. We are going to choose a tag named <Report> for that purpose. Now we can think of what we want to say about each influenza report instance. For example we may want to track the city where it occurred, how many patients were hospitalized, average number of hospitalization days, number of deaths attributed to the outbreak.
<?xml version="1.0" encoding="ASCII" ?>
<!-- data from public health agencies on influenza outbreak in 2002 -->
<Influenza_2002>
<Report>
<cityName>Atlanta</cityName>
<patientQuantity>2,138</patientQuantity>
<hospitalStay unit="days">8</hospitalStay>
<casualtyQuantity>14</casualtyQuantity>
</Report>
<Report>
<cityName>Los Angeles</cityName>
<patientQuantity>15,764</patientQuantity>
<hospitalStay unit="days">5</hospitalStay>
<casualtyQuantity>77</casualtyQuantity>
</Report>
<Report>
� <cityName>New York</cityName>
<patientQuantity>22,349</patientQuantity>
<hospitalStay unit="days">7</hospitalStay>
<casualtyQuantity>146</casualtyQuantity>
</Report>
<Report>
<cityName>Chicago</cityName>
<patientQuantity>12,808</patientQuantity>
<hospitalStay unit="days">6</hospitalStay>
<casualtyQuantity>94</casualtyQuantity>
</Report>
</Influenza_2002>
Listing 1.2. Application of Style Rules
Step 5. If the XML instance document is to be processed by a special application then the pertinent processing instructions can be added after the document section, i.e., after the closing tag of the root element. In Listing 1.2 above there are no processing instructions so the XML instance document is ready to be viewed using an XML-enabled browser such as Microsoft IE 5.0 or higher. Figure 1.4 below shows the results.
Figure 1.4. Rendering of XML Document from Listing 1.2 Using IE 5.0
It should be noted that the above are recommended style rules; for example, it is perfectly valid to insert processing instructions other than at the very end of the document section. In fact, processing instructions can be inserted anywhere a comment can be inserted—see Listing 1.1. The only restriction is that a processing instruction should not be inside a tag itself. In other words, a processing instruction cannot be treated as an attribute of a tag.
Before we proceed with further basic XML concepts and techniques it is probably worthwhile to spend a moment considering some of the general characteristics of data, since they impact to a certain degree the choices of technologies and applications. If we look at the data we deal with on a daily basis we can see that some of it is mostly in the form of lists, whereas other data is mostly in the form of narratives. Lists of hospital names, patient names, medical treatments, diagnoses, medical product suppliers, medical product classes, etc., is what we call structured data. The reason for using this label is that when we examine the attributes of this type of data we see that all the instances of a particular set seem to share the same characteristics. For example, if we look at a list of patient names—assuming the list does not contain extraneous entries—it is fair to say that every entry conveys a well-defined meaning, i.e., every name in the list is that of an individual who has received treatment from the health care provider. We express this fact by stating that the semantics of the members of the list is common to all its members. In addition, the way in which the data is encoded is probably also fixed. For example, the patient names are all character data that cannot exceed a certain number of characters, e.g., 255. We express this fact by stating that the syntax of the members of the list is common to all its members.
Lists are examples of data-centric documents. They are characterized by regular structure, as well as atomicity, i.e., they are semantically at the lowest level of required decomposition. In addition, the order of the elements that constitute a data-centric document is not essential. For example in Listing 1.2 above we could have chosen to put the <casualtyQuantity> element first and this would not have changed the overall meaning of the information conveyed by the records.
Spread sheets, and, even more so, databases are applications that have been optimized to handle structured data (i.e., lists with well-defined semantics and syntax).
At the other extreme there is narrative data or document-centric data. The main feature of this type of data is that it does not conform to the definition of a list. They are characterized by less regular or completely irregular structure. At the semantic level the content of the elements is not atomic. And, lastly, the order in which the elements occur is almost always significant. For example this entire chapter can be considered as narrative data. It clearly does not look anything like a list, and it matters whether this paragraph appears at the beginning or in its current position. Although spreadsheets and databases can also be used to store this type of data, the functions that work so well with lists such as searches and comparisons won't work with the same efficiency, and, in some cases, depending on the application, they may not be even available when the data stored is a narrative as opposed to a list.
It should be noted that multimedia constitutes a new and emerging category of data which poses its own unique type of challenges. For example, digitized X-ray images, MRI and CAT scans, as well as voice and video streams may require special applications that combine relational and object-oriented capabilities to perform comparisons and searches based on the actual content of the binary objects.
In addition to the division of data into structured and unstructured types, data also may have a temporal aspect dictated by its refresh rate. For example, when we look at lists we notice that some are fairly stable over time, e.g., the list of all the US hospitals can be considered static, if none are going bankrupt or going through mergers. Other data is highly dynamic, for example the list of patients treated by a health care provider changes from visit to visit. In between these two extremes we have semi-static data. A good example of this type of data are the employees telephone directory of a hospital, which may be updated every six months or every year. This type of data has a fixed periodicity which is shorter than the one of static data, but much longer than the one of dynamic data.
The implications of the data taxonomy discussed above when dealing with data integration are as follows. Historically, there have been two distinct communities applying XML. The community dealing with data-centric documents and the community producing narrative type of documents. The former has primarily looked at XML as a data transport mechanism, and the documents are meant to support automated messaging among applications, e.g., database to database exchanges. The fact that XML has been used in that way here is one of convenience rather than necessity. For data integration, however, XML constitutes the essential component for facilitating the merging of diverse data sources. It is also clear that data integration works best when the data has been distilled in the form of data-centric documents.
The community producing narrative type of documents has used XML as a means to define information content models. For example, an entire book can be described in terms of elements such as <Chapter>, <Section>, <Paragraph>. The use of this information content model permits the application of specific processing instructions to the content of each kind of element. For example if we want to display the document on a browser we may choose one type of fonts for the content of <Chapter> elements, and another for the content of <Section> elements. We may automatically create a table of contents for the document by selecting only the contents of the <Chapter> and <Section> elements but not that of the <Paragraph> elements. The lessons learned and techniques developed by this community are of secondary importance when doing data integration, and will not be covered in any detail in this book.
Having stipulated that the main focus of XML for data integration is on structured data, we are going to show via a simple example how one could XMLize it so that all the other XML technologies can be brought to bear. Because many data sources exist in the form of spreadsheets, or can be ported readily to that form, we will work out the techniques for how to transform raw source data into well-formed XML documents using such a source. The following steps show the process.
Step 1. In the spreadsheet to be transformed analyze the columns to decide the name of of the tags you want to introduce. In the example, shown in Table 01 below, the columns already have headers which can serve as the names for the XML tags. We will, therefore, create three XML tags, namely, <Name>, <MedicalTest> and <Date>.
Table 01. Initial Sample Spreadsheet
Name |
Medical Test |
Date |
John Robertson |
Blood Test |
2-Jan-02 |
Maria Estrada |
Cholesterol Test |
27-Jul-02 |
Sandy Hellerman |
X-Ray |
13-Mar-02 |
Julia Moriarty |
Chest MRI |
23-Dec-01 |
Pedro Martinez |
EKG |
11-Feb-02 |
Sayeed Rashidi |
Cholesterol Test |
11-Nov-02 |
�*
Step 2. Insert columns to the right and to the left of each column and paste the start and end XML tags chosen in Step 1 above. The resulting spreadsheet should look as shown in Table 02 below.
Table 02. Sample Spreadsheet with XML Tags Inserted�.
|
name |
Medical |
Date |
|
||||
<name> |
John Robertson |
</name> |
<MedTest> |
Blood |
</MedTest> |
<Date> |
02-Jan-03 |
</Date> |
<name> |
Maria Estrada |
</name> |
<MedTest> |
Cholesterol Test |
</MedTest> |
<Date> |
27-Jul-02 |
</Date> |
<name> |
Sandy Hellerman |
</name> |
<MedTest> |
X-Ray |
</MedTest> |
<Date> |
13-Mar-02 |
</Date> |
<name> |
Julia Moriarty |
</name> |
<MedTest> |
Chest MRI |
</MedTest> |
<Date> |
23-Dec-01 |
</Date> |
<name> |
Pedro Martinez |
</name> |
<MedTest> |
EKG |
</MedTest> |
<Date> |
11-Feb-03 |
</Date> |
<name> |
Sayeed Rashidi |
</name> |
<MedTest> |
Cholesterol Test |
</MedTest> |
<Date> |
11-Nov-02 |
</Date> |
Step 3. Insert a record delimiter tag for each record (i.e., for each row). In our example we may use a record delimiter tag such as <PatRec>.
name |
|||||
<PatRec> |
<Name> |
John |
</Name> |
|
</PatRec> |
<PatRec> |
<Name> |
Maria |
</Name> |
... |
</PatRec>3� |
<PatRec> |
<Name> |
Sandy Hellermand |
</Name> |
... |
</PatRec>y� |
<PatRec> |
<Name> |
Julia Moriarty |
</Name> |
... |
</PatRec>� |
<PatRec> |
<Name> |
Pedro Martinez |
</Name> |
... |
</PatRec>' |
<PatRec> |
<Name> |
Sayeed Rashidi |
</Name> |
... |
</PatRec> |
Step 4. Cut and paste the spreadsheet without including the column headers on your preferred word processing application. Remove the tabs and insert a paragraph return between each "><" sequence. For example, if you are using MSWord, paste the spreadsheet using Paste Special.../Unformatted Text. In Edit/Replace... enter "^t" in the Find what field, leave blank the Replace with field, and then press Replace All. Next while in Edit/Replace... enter "><" in the Find what field, ">^p<" in the Replace with field, and then press Replace All. The text should look now as shown in Listing 03 below.
<PatRec>
<Name>John Robertson</Name>
<MedTest>Blood Test</MedTest>
<Date>2002-01-03</Date>
</PatRec>
<PatRec>
<Name>Maria Estrada</Name>
<MedTest>Cholesterol Test</MedTest>
<Date>2002-07-27</Date>
</PatRec>
<PatRec>
<Name>Sandy Hellerman</Name>
<MedTest>X-Ray</MedTest>
<Date>2002-04-13</Date>
</PatRec>
<PatRec>
<Name>Julia Moriarty</Name>
<MedTest>Chest MRI</MedTest>
<Date>2002-12-23</Date>
</PatRec>
<PatRec>
<Name>Pedro Martinez</Name>
<MedTest>EKG</MedTest>
<Date>2003-02-11</Date>
</PatRec>
<PatRec>
<Name>Sayeed Rashidi</Name>
<MedTest>Cholesterol Test</MedTest>
<Date>2002-11-11</Date>
</PatRec>e5
Listing 1.3. Tagged Text for the Sample Spreadsheet
Step 5. Insert the XML declaration, as well as the root element tags. In the example one could use <MedRecords> for the root element. The resulting document should now look as shown in Listing 1.4 below.
<?xml version="1.0" encoding="ASCII" ?>
<MedRecords>
<PatRec>
<Name>John Robertson</Name>
<MedTest>Blood Test</MedTest>
<Date>2002-01-03</Date>
</PatRec>
<PatRec>
<Name>Maria Estrada</Name>
<MedTest>Cholesterol Test</MedTest>
<Date>2002-07-27</Date>
</PatRec>
<PatRec>
<Name>Sandy Hellerman</Name>
<MedTest>X-Ray</MedTest>
<Date>2002-04-13</Date>
</PatRec>
<PatRec>
<Name>Julia Moriarty</Name>
<MedTest>Chest MRI</MedTest>
<Date>2002-12-23</Date>
</PatRec>
<PatRec>
<Name>Pedro Martinez</Name>
<MedTest>EKG</MedTest>
<Date>2003-02-11</Date>
</PatRec>
<PatRec>
<Name>Sayeed Rashidi</Name>
<MedTest>Cholesterol Test</MedTest>
<Date>2002-11-11</Date>
</PatRec>
</MedRecords>
Listing 1.4. Finalized XML document for "MedRecords.xml"
Step 6. Save the document in text-only format with the appropriate name and extension. For example, you can save the document as "MedRecords.xml". Figure 1.5 below shows how the instance XML document would appear when viewed with IE 5.0.
Figure 1.5. The MedRecords.xml document viewed with IE 5.0
In this chapter we have learned the rationale for including 'markup' in our document-centric and data-centric documents, namely, that it facilitates their automated processing. We have also learned that XML is a very powerful markup syntax whose strength lies in the fact that it is extensible. The syntax of XML is very simple and in this chapter we have seen what the naming conventions are as well as the requirements for well-formedness.
We also have learned that XML instance documents are made up of elements (with or without attributes), as well as comments and processing instructions. A special kind of 'processing instruction' is the XML declaration which is normally the first entry in every complete XML document. When viewed structurally XML instance documents can be considered as 'tree structures', thus they cannot have more than one root and they must not contain overlapping nesting structures.
We have seen how data contained in a spreadsheet (or for that matter in a word document as comma or tab delimited text) can be readily transformed into an XML document that can be viewed on an XML-enabled browser such as IE 5 or higher.
Following resources are available:
Narrated lectures and video examples require use of Flash.
Following the steps described in in this section convert the list below into a well-formed, complete XML document in which the names are broken into <firstName>, <middleIntial> and <lastName>. Use the root element <PersonNames> and the record delimiter <PersonName>.� Save the document as "PersonNames.xml" and render it with your preferred browser.
Frederick M. Wheelock
Martin A. Gardner
Denise G. Robertson
Claudia P. Gonzales
Martina H. Ramirez
Leilah A. Rashidi
Igmar S. Bergstrom
Mishiko Takahashi
See a video on how to answer this question.
See a video on how to answer this question.
Please bring your work to class and be prepared to present it
In this section you will find links to other resources.
This page is part of the course on Data Integration the lecture on Essential XML Concepts. It was last edited on 05/12/2003. For more information contact us. © Copyright protected.