George Mason University
Process Improvement
 


Essential XML Concepts


Preliminary Remarks

As shown in the roadmap discussed in the introduction Chapter, the first step towards data integration in the context of XML technologies is to create instance XML documents out of the raw data sources targeted for integration. To accomplish this health care providers need to learn the basic principles of XML so that they can generate well-formed XML documents.

This chapter gives an overview of the XML syntax and shows how raw data sources in the form of spreadsheets can be readily manipulated to generate a well-formed instance XML document. The methodology can be equally applied to comma delimited files, or textual sources, although the less structured the source the more manipulation it may require.

Why Data Integration?

We define a 'stove-pipe' data store as a data repository whose semantics and syntax do not conform to those of another data store engaged in a similar kind of business. There is nothing wrong with 'stove-pipe' data stores. In fact, every time we launch our favorite database application and create a database to organize patient records or keep track of our physician certification process we are likely to create a 'stove-pipe' data store, i.e., one whose tables and columns are likely to be different from those of someone else undertaking basically the same activity. At the personal level there is nothing wrong with that because it is highly unlikely that we will be interested in asking for someone else's data, i.e., we seldom share any type of stored data. At the business level the opposite is more common. When a company is acquired by a larger corporation its data may have to be merged with that of the acquiring enterprise in order for the parent company to take advantage of that resource and to minimize maintenance costs. When a company has multiple data systems, these systems need to be able to talk to each other. A merged enterprise-wide database cannot be created without integrating data from different types of databases with different structure and dissimilar meanings.

After the events of September 11 it has become quite apparent that data sharing is key to a successful Homeland Defense strategy and the Federal government is actively pursuing ways for making data more freely shared among the respective agencies. On the other hand the enactment of HIPAA legislation legislation restricts the sharing of patient identifiable information across health care organizations in the community so that there is a need to develop methodologies that permit data integration without relying on such things as the patient’s Social Security Number. The expected benefits of increased sharing, however, may not materialize unless data is restructured into a common format capable of automated processing because although humans are quite good at figuring out whether one kind of data is essentially the same as another they are notoriously slow and prone to error.

How does a 'Markup Language' help?

XML is probably the most robust syntax for performing 'markup'. If you are not familiar with XML this probably didn't tell you much. So let's back up a little. The idea behind markup is quite powerful because data which contains markup can be processed much more efficiently. Let's look at the example shown in Table 1.1 below.

Table 1.1

25

-10

30

-15

The numbers in Table 1.1 above are both positive and negative and they could represent, for example, the results of a patient’s systolic blood pressure deviations from the ideal value of 120 over a period of time. If one were to enclose those numbers which are negative between one type of appropriate symbols and enclose the positive numbers with different ones as shown in Table 2 then one could automatically process the table to filter them into separate files.�

Table 1.2:  Low BP.dat

$$

25

$$

##

-10

##

$$

30

$$

##

-15

##

The pseudo-code shown below would filter the negative blood pressure measurements to a new file called LowBP.dat, and the high values to HighBP.dat respectively:

if number_string is enclosed between '##' then

copy number_string to LowBP.dat

else

copy number_string to HighBP.dat

end if

The resulting LowBP.dat table would now look as follows:

Table 1.3:  LowBP.dat

##

-10

##

###

-15

##

Question 1.1:  Are all symbols equally appropriate for the example above? For example, could one use numbers as 'appropriate symbols' for this type of 'markup'?

Question 1.2:  What is the relationship between 'markup languages' and 'browsers'?

X stands for eXtensible

The preceding paragraphs highlight the substantial benefits that accrue when data receives 'markup'. It also shows that with few exceptions pretty much anything goes when one is doing 'markup'. Therein lies the appeal of XML. It is both a 'standard' way of doing 'markup' but it is also flexible enough to accommodate the needs of anyone doing markup because unlike fixed vocabularies for special purposes such as HTML the user is free to invent any type of 'tag' (provided it does not conflict with a couple of syntactical restrictions discussed below) to introduce structure into the data.

Couple this with the ability to add attributes to the tags and now we have not just an arbitrary combination of symbols to delimit the data but potentially a complete way of expressing what the data means. XML therefore allows to specify not just the structure but the semantics of the data.

XML applications

When certain health care organizations agree on the need to exchange data in a certain way they may agree to restrict the XML tags that can be used for their exchanges and to define strictly the meaning and use of the tags. When this occurs we speak of an XML application. For example if hospital laboratories agree to use certain tags to indicate that the character string enclosed between a given tag represents test results then that is an XML application, namely, the Hospital Laboratory Markup Language (HLML). Once health care organizations create their own XML application they can develop specialized tools that are able to manipulate the markup for any purpose they want. The most typical tools are browsers for presentation or rendering of the data on a computer screen or formatted output to a printer. But XML applications also allow other types of manipulations such as importing data into databases.

XML Basics—the XML Tag and the XML Element

Figure 1.1 below shows the structure of a start XML tag. As shown therein an XML tag consists of (a) delimiters, (b) a name and (c) zero, one or more attributes. The value of the attribute is enclosed in quotation marks.

Figure 1.1.  Structure of the XML Start Tag

Figure 1.2 below shows the corresponding XML closing tag for the one shown in Figure 1 as well as its content. Note that the closing tag starts with "</" and does not contain any attributes. The content of the XML tag goes between the start tag and the end tag. In the example shown the content is ASCII. XML tags can contain both ASCII strings as well as other XML tags.

XML tags are the physical expression of the XML elements which constitute the structure of any well-formed XML instance document.

Figure 1.2.  Example of an XML Closing Tag and Content

XML Syntax Rules

From Sections 1.1 through 1.4 above we now know that XML is a formalized way of performing markup: so what are the rules of the XML markup syntax?

Well, they are fortunately very simple, which is one of the reasons for the popularity of XML

 Naming Rules: element names, attribute names, as well as the names of several less common constructs:

  • XML names are case sensitive
  • XML names cannot begin with a number
  • They may contain essentially any alphanumeric character. This includes the standard English letters A through Z and a through z as well as the digits 0 through 9. XML names may also include non-English letters, numbers, and ideograms. They may also include these three punctuation characters:
    • the underscore
    • the hyphen
    • the period
  • XML names may not contain other punctuation characters such as quotation marks, apostrophes, dollar signs, carets, percent symbol, and semicolons. The colon is allowed, but its use is reserved for namespaces.
  • white space of any kind, i.e., a space, a carriage return, a line feed, a non-breaking space, and so forth is strictly forbidden in XML names.
  • XML names cannot begin with the string XML in any combination of lower and upper case whatsoever.

Well-Formedness:

  • Every XML start-tag must have a matching end-tag.
  • XML elements may nest, but may not overlap.
  • There must be exactly one root element.
  • Attribute values must be quoted.
  • An element may not have two attributes with the same name.
  • Comments and processing instructions may not appear inside tags.
  • No un-escaped "<" or "&" signs may occur in the character data of an element or attribute.

XML Elements & Processing Instructions

As we briefly alluded in Section 1.5 above every XML instance document is made up of elements whose physical representation is in the form of XML tags, both start and end, as well as content.

The content of a tag can be NULL, in which case the element is said to be 'empty', and, if desired the start and end tags can be conflated into a single one. For example, if there had been no content for the element BloodPressure in Figure 2, then, instead of writing the start tag immediately followed by the end tag one could have written <BloodPressure/> to indicate that there was no content for this element.

Structurally speaking, all well-formed XML instance documents can be represented as a 'tree structure'. This means that they must have only one root element.

XML instance documents can contain comments in addition to elements. To signal the parser that the string is a comment one encloses it between "<!--" and "-->". Comments are useful for providing further context and information to the user of the XML document.

Finally, since XML documents are produced so that other applications can process them, for example for display on computer screens, etc., it is also possible to include in an XML instance document strings which are directed to the processing application, i.e., processing instructions. All processing instructions begin with "<?" immediately followed by the target—the application intended to process the instruction—and end with "?>". When a parser encounters a processing instruction it can either ignore it if it does not have a means for interpreting it, or process it. One of the most commonly used processing instructions is the style sheet which is used to tell browsers how to render on the screen the contents of the XML instance document.

Style sheet Processing Instruction:

<?xml-stylesheet href="expenses.css" type "text/css"?>

The style sheet processing instruction must always be inserted before the root element of the XML instance document.

Arguably the XML declaration is a kind of processing instruction although for historical reasons one gives it its own name. The purpose of the XML declaration is to instruct the application about some important parameters concerning the document. Typical attributes of the XML declaration are the version the encoding as well as flags to signal the processing application whether the document contains validating instructions or whether they are external.

XML Declaration Example:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

The XML declaration precedes any other processing instruction.

 

A Sample Document

The best way to remember all these rules is to use them right away and to test what happens when we try to break them.  Using Notepad or some equivalent text editor type the following:

<?xml version="1.0" encoding="ASCII" ?>
<!-- This is our first XML document. We begin with the
XML Declaration as shown above -->
<!-- Next we declare the root tag of the document -->
<PatientRecords>
<!-- The records are organized by person -->
<!-- The person's name, medical tests, date, clinician
and hospital are currently tracked -->
<!-- begin of first record -->
<Person>
<PersonName>Peter Jonas</PersonName>
<BloodPressure>139</BloodPressure>
<CholesterolLevel>209</CholesterolLevel>
<Triglycerides>90</Triglycerides>
<Date>9/12/99</Date>
<Clinician>Dr. John Martins</Clinician>
<Hospital>St. Michael Hospital</Hospital>
</Person>
<!-- end of first record -->
<!-- begin of second record -->
<Person>
<personName>Sally Federsen</personName>
<BloodPressure>146</BloodPressure>
<CholesterolLevel>209</CholesterolLevel>
<Triglycerides>90</Triglycerides>
<Date>10/01/01</Date>
<Clinician>Dr. Sally Bielefeld</Clinician>
<Hospital>St. Michael Hospital</Hospital>
</Person>
<!-- end of second record -->
 <!-- end of document -->
</PatientRecords>
 
Listing 1.1.  Example of an XML Document

Although one can use more sophisticated word processors one should always remember to save the final documents as text-only to ensure that all extraneous codes needed for formatting are removed.

 �

Figure 1.3.  Rendering of XML Document from Listing 1.1 Using IE 5.0

You can omit any or all of the comments, and you can insert carriage returns to make the document more legible. They will be ignored by Internet Explorer when it processes the file for display on the screen.

Recommended Style Rules

It is good practice to develop a certain style for writing all XML documents since this facilitates review and minimizes errors of omission. The following are some suggested style rules:

  • Create a 'prolog' section consisting of the XML declaration plus any comments related to the purpose, contents, or general description of the document you are about to write
  • Create the document section by writing the root tag and closing it. This way when writing very long documents you will not forget to terminate the document proper
  • Insert between the start and the end root tag the actual content of the document. Whenever possible create record delimiter tags to indicate clearly where the record begins and where it ends
  • Use indentation to indicate the different levels of the document
  • Use letter case and the separator consistently. For example you may want to write all your tag names in upper-case and use underscore to separate the nouns. This makes it easy to import into databases such as ORACLE. Alternatively, you may want to use a style similar to that employed in Object Oriented programming, the so-called 'UpperCamel-lowerCamel' style, where there are no separators between the nouns and the first letter of the nouns corrensponding to a table name in a database are capitalized, whereas the elements corresponding to columns in a table start with lower-case and the first letter of any successive is capitalized. (See example below)
  • Use meaningful tag names. XML does not penalize you for being verbose. Cryptic tag names may be appropriate in some circumstances, but long, explicit names go a long way in aiding understanding of the document by users not intimately familiar with it.
  • Adopt a consistent use of 'attributes'. In general processing is easier when using elements as opposed to encapsulating information as attributes of the tags.
  • Use comments when appropriate. Remember that comments are not processed, therefore, if they contain relevant information you should consider creating elements to capture it so that it can be accessible to your application.

Example:

Step 1.�Let's begin with the prolog by writing the XML declaration:

<?xml version="1.0" encoding="ASCII" ?>

Step 2. We can add a comment to explain what the document is about.  In this example we are writing a description of influenza outbreak:

<?xml version="1.0" encoding="ASCII" ?>
<!-- data from public health agencies on influenza outbreak in 2002 -->

Step 3. Once this is done we can add the root tag.  For this example we will use the Object Oriented style alluded to above.  For the root tag we can choose a descriptive tag such as <Portfolio>.  We write both the start and the end tag to make sure that our document is well formed.

<?xml version="1.0" encoding="ASCII" ?>
<!-- data from public health agencies on influenza outbreak in 2002 -->
 <Influenza_2002>
</Influenza_2002>

Step 4.� The contents of the document are now going to be inserted between the start and end tags of the root tag <Influenza_2002>.  As suggested in the style rules above we should have a record delimiter tag.  We are going to choose a tag named <Report> for that purpose.  Now we can think of what we want to say about each influenza report instance.  For example we may want to track the city where it occurred, how many patients were hospitalized, average number of hospitalization days, number of deaths attributed to the outbreak.

 <?xml version="1.0" encoding="ASCII" ?>
<!-- data from public health agencies on influenza outbreak in 2002 -->
<Influenza_2002>
   <Report>
      <cityName>Atlanta</cityName>

      <patientQuantity>2,138</patientQuantity>
      <hospitalStay unit="days">8</hospitalStay>
      <casualtyQuantity>14</casualtyQuantity>
   </Report>
   <Report>
      <cityName>Los Angeles</cityName>
      <patientQuantity>15,764</patientQuantity>
      <hospitalStay unit="days">5</hospitalStay>
      <casualtyQuantity>77</casualtyQuantity>
   </Report>
   <Report>

 �
  <cityName>New York</cityName> 
      <patientQuantity>22,349</patientQuantity>
      <hospitalStay unit="days">7</hospitalStay>
      <casualtyQuantity>146</casualtyQuantity>
   </Report>
   <Report>
      <cityName>Chicago</cityName>
      <patientQuantity>12,808</patientQuantity>
      <hospitalStay unit="days">6</hospitalStay>
      <casualtyQuantity>94</casualtyQuantity>
   </Report>
</Influenza_2002>

Listing 1.2.  Application of Style Rules

Step 5. If the XML instance document is to be processed by a special application then the pertinent processing instructions can be added after the document section, i.e., after the closing tag of the root element. In Listing 1.2 above there are no processing instructions so the XML instance document is ready to be viewed using an XML-enabled browser such as Microsoft IE 5.0 or higher. Figure 1.4 below shows the results.

Figure 1.4.  Rendering of XML Document from Listing 1.2 Using IE 5.0

It should be noted that the above are recommended style rules; for example, it is perfectly valid to insert processing instructions other than at the very end of the document section.  In fact, processing instructions can be inserted anywhere a comment can be inserted—see Listing 1.1.  The only restriction is that a processing instruction should not be inside a tag itself.  In other words, a processing instruction cannot be treated as an attribute of a tag.

A Simple Data Taxonomy

Before we proceed with further basic XML concepts and techniques it is probably worthwhile to spend a moment considering some of the general characteristics of data, since they impact to a certain degree the choices of technologies and applications. If we look at the data we deal with on a daily basis we can see that some of it is mostly in the form of lists, whereas other data is mostly in the form of narratives.  Lists of hospital names, patient names, medical treatments, diagnoses, medical product suppliers, medical product classes, etc., is what we call structured data.  The reason for using this label is that when we examine the attributes of this type of data we see that all the instances of a particular set seem to share the same characteristics.  For example, if we look at a list of patient names—assuming the list does not contain extraneous entries—it is fair to say that every entry conveys a well-defined meaning, i.e., every name in the list is that of an individual who has received treatment from the health care provider.  We express this fact by stating that the semantics of the members of the list is common to all its members.  In addition, the way in which the data is encoded is probably also fixed.  For example, the patient names are all character data that cannot exceed a certain number of characters, e.g., 255.  We express this fact by stating that the syntax of the members of the list is common to all its members.

Lists are examples of data-centric documents.  They are characterized by regular structure, as well as atomicity, i.e., they are semantically at the lowest level of required decomposition.  In addition, the order of the elements that constitute a data-centric document is not essential.  For example in Listing 1.2 above we could have chosen to put the <casualtyQuantity> element first and this would not have changed the overall meaning of the information conveyed by the records.

Spread sheets, and, even more so, databases are applications that have been optimized to handle structured data (i.e., lists with well-defined semantics and syntax).

At the other extreme there is narrative data or document-centric data.  The main feature of this type of data is that it does not conform to the definition of a list.  They are characterized by less regular or completely irregular structure.  At the semantic level the content of the elements is not atomic.  And, lastly, the order in which the elements occur is almost always significant.  For example this entire chapter can be considered as narrative data.  It clearly does not look anything like a list, and it matters whether this paragraph appears at the beginning or in its current position.  Although spreadsheets and databases can also be used to store this type of data, the functions that work so well with lists such as searches and comparisons won't work with the same efficiency, and, in some cases, depending on the application, they may not be even available when the data stored is a narrative as opposed to a list.

It should be noted that multimedia constitutes a new and emerging category of data which poses its own unique type of challenges. For example, digitized X-ray images, MRI and CAT scans, as well as voice and video streams may require special applications that combine relational and object-oriented capabilities to perform comparisons and searches based on the actual content of the binary objects.

In addition to the division of data into structured and unstructured types, data also may have a temporal aspect dictated by its refresh rate.  For example, when we look at lists we notice that some are fairly stable over time, e.g., the list of all the  US hospitals can be considered static, if none are going bankrupt or going through mergers.  Other data is highly dynamic, for example the list of patients treated by a health care provider changes from visit to visit.  In between these two extremes we have semi-static data.  A good example of this type of data are the employees telephone directory of a hospital, which may be updated every six months or every year.  This type of data has a fixed periodicity which is shorter than the one of static data, but much longer than the one of dynamic data.

 

Applicability of XML

 

The implications of the data taxonomy discussed above when dealing with data integration are as follows. Historically, there have been two distinct communities applying XML.  The community dealing with data-centric documents and the community producing narrative type of documents.  The former has primarily looked at XML as a data transport mechanism, and the documents are meant to support automated messaging among applications, e.g., database to database exchanges.  The fact that XML has been used in that way here is one of convenience rather than necessity. For data integration, however, XML constitutes the essential component for facilitating the merging of diverse data sources.  It is also clear that data integration works best when the data has been distilled in the form of data-centric documents.

The community producing narrative type of documents has used XML as a means to define information content models.  For example, an entire book can be described in terms of elements such as <Chapter>, <Section>, <Paragraph>.  The use of this information content model permits the application of specific processing instructions to the content of each kind of element.  For example if we want to display the document on a browser we may choose one type of fonts for the content of <Chapter> elements, and another for the content of <Section> elements.  We may automatically create a table of contents for the document by selecting only the contents of the <Chapter> and <Section> elements but not that of the <Paragraph> elements. The lessons learned and techniques developed by this community are of secondary importance when doing data integration, and will not be covered in any detail in this book.

Inserting XML Tags Manually

Having stipulated that the main focus of XML for data integration is on structured data, we are going to show via a simple example how one could XMLize it so that all the other XML technologies can be brought to bear.  Because many data sources exist in the form of spreadsheets, or can be ported readily to that form, we will work out the techniques for how to transform raw source data into well-formed XML documents using such a source.  The following steps show the process.

Step 1.  In the spreadsheet to be transformed analyze the columns to decide the name of of the tags you want to introduce.  In the example, shown in Table 01 below, the columns already have headers which can serve as the names for the XML tags.  We will, therefore, create three XML tags, namely, <Name>, <MedicalTest> and <Date>.

Table 01.  Initial Sample Spreadsheet

Name

Medical Test

Date

John Robertson

Blood Test

2-Jan-02

Maria Estrada

Cholesterol Test

27-Jul-02

Sandy Hellerman

X-Ray

13-Mar-02

Julia Moriarty

Chest MRI

23-Dec-01

Pedro Martinez

EKG

11-Feb-02

Sayeed Rashidi

Cholesterol Test

11-Nov-02

 �*

Step 2. Insert columns to the right and to the left of each column and paste the start and end XML tags chosen in Step 1 above. The resulting spreadsheet should look as shown in Table 02 below.

Table 02.  Sample Spreadsheet with XML Tags Inserted�.

 

name

   

Medical
Test

   

Date

 

<name>

John Robertson

</name>

<MedTest>

Blood
Test

</MedTest>

<Date>

02-Jan-03

</Date>

<name>

Maria Estrada

</name>

<MedTest>

Cholesterol Test

</MedTest>

<Date>

27-Jul-02

</Date>

<name>

Sandy Hellerman

</name>

<MedTest>

X-Ray

</MedTest>

<Date>

13-Mar-02

</Date>

<name>

Julia Moriarty

</name>

<MedTest>

Chest MRI

</MedTest>

<Date>

23-Dec-01

</Date>

<name>

Pedro Martinez

</name>

<MedTest>

EKG

</MedTest>

<Date>

11-Feb-03

</Date>

<name>

Sayeed Rashidi

</name>

<MedTest>

Cholesterol Test

</MedTest>

<Date>

11-Nov-02

</Date>

 

Step 3.  Insert a record delimiter tag for each record (i.e., for each row).  In our example we may use a record delimiter tag such as <PatRec>.

name

 

<PatRec>

<Name>

John
Robertson

</Name>

...

</PatRec>

<PatRec>

<Name>

Maria
Estrada

</Name>

...

</PatRec>3�

<PatRec>

<Name>

Sandy Hellermand

</Name>

...

</PatRec>y�

<PatRec>

<Name>

Julia Moriarty

</Name>

...

</PatRec>�

<PatRec>

<Name>

Pedro Martinez

</Name>

...

</PatRec>'

<PatRec>

<Name>

Sayeed Rashidi

</Name>

...

</PatRec>

 

 

 

 

 

 

 

 

 

 

 

 

 

Step 4.  Cut and paste the spreadsheet without including the column headers on your preferred word processing application.  Remove the tabs and insert a paragraph return between each "><" sequence.  For example, if you are using MSWord, paste the spreadsheet using Paste Special.../Unformatted Text.  In Edit/Replace... enter "^t" in the Find what field, leave blank the Replace with field, and then press Replace All.  Next while in Edit/Replace... enter "><" in the Find what field, ">^p<" in the Replace with field, and then press Replace All.  The text should look now as shown in Listing 03 below.

<PatRec>
<Name>John Robertson</Name>
<MedTest>Blood Test</MedTest>
<Date>2002-01-03</Date>
</PatRec>
<PatRec>
<Name>Maria Estrada</Name>
<MedTest>Cholesterol Test</MedTest>
<Date>2002-07-27</Date>
</PatRec>
<PatRec>
<Name>Sandy Hellerman</Name>
<MedTest>X-Ray</MedTest>
<Date>2002-04-13</Date>
</PatRec>
<PatRec>
<Name>Julia Moriarty</Name>
<MedTest>Chest MRI</MedTest>
<Date>2002-12-23</Date>
</PatRec>
<PatRec>
<Name>Pedro Martinez</Name>
<MedTest>EKG</MedTest>
<Date>2003-02-11</Date>
</PatRec>
<PatRec>
<Name>Sayeed Rashidi</Name>
<MedTest>Cholesterol Test</MedTest>
<Date>2002-11-11</Date>
</PatRec>e5

Listing 1.3.  Tagged Text for the Sample Spreadsheet

Step 5.  Insert the XML declaration, as well as the root element tags.  In the example one could use <MedRecords> for the root element.  The resulting document should now look as shown in Listing 1.4 below.

<?xml version="1.0" encoding="ASCII" ?>
<MedRecords>
<PatRec>
<Name>John Robertson</Name>
<MedTest>Blood Test</MedTest>
<Date>2002-01-03</Date>
</PatRec>
<PatRec>
<Name>Maria Estrada</Name>
<MedTest>Cholesterol Test</MedTest>
<Date>2002-07-27</Date>
</PatRec>
<PatRec>
<Name>Sandy Hellerman</Name>
<MedTest>X-Ray</MedTest>
<Date>2002-04-13</Date>
</PatRec>
<PatRec>
<Name>Julia Moriarty</Name>
<MedTest>Chest MRI</MedTest>
<Date>2002-12-23</Date>
</PatRec>
<PatRec>
<Name>Pedro Martinez</Name>
<MedTest>EKG</MedTest>
<Date>2003-02-11</Date>
</PatRec>
<PatRec>
<Name>Sayeed Rashidi</Name>
<MedTest>Cholesterol Test</MedTest>
<Date>2002-11-11</Date>
</PatRec>
</MedRecords>

Listing 1.4.  Finalized XML document for "MedRecords.xml"

Step 6.  Save the document in text-only format with the appropriate name and extension.  For example, you can save the document as "MedRecords.xml".  Figure 1.5 below shows how the instance XML document would appear when viewed with IE 5.0.

Figure 1.5.  The MedRecords.xml document viewed with IE 5.0

Summary

 

In this chapter we have learned the rationale for including 'markup' in our document-centric and data-centric documents, namely, that it facilitates their automated processing.  We have also learned that XML is a very powerful markup syntax whose strength lies in the fact that it is extensible.  The syntax of XML is very simple and in this chapter we have seen what the naming conventions are as well as the requirements for well-formedness.

 

We also have learned that XML instance documents are made up of elements (with or without attributes), as well as comments and processing instructions.  A special kind of 'processing instruction' is the XML declaration which is normally the first entry in every complete XML document.  When viewed structurally XML instance documents can be considered as 'tree structures', thus they cannot have more than one root and they must not contain overlapping nesting structures.

 

We have seen how data contained in a spreadsheet (or for that matter in a word document as comma or tab delimited text) can be readily transformed into an XML document that can be viewed on an XML-enabled browser such as IE 5 or higher.

 

Presentations

 

Following resources are available:

  1. Power Point slides for lecture on Essential XMLconcepts 
  2. Narrated slides for essential XML concepts (SWF file)
  3. Example of an XML document
  4. Example of converting Excel files into XML
  5. Example of merging text files using XML

Narrated lectures and video examples require use of Flash.

What Do You Know?

  1. Which of the following XML tags does not conform to the naming rules: 

    <_9 cats>
      <_myId>
      <Personnel_Record.Alias>
      <xml_records>
      <1_chairman>
      <G_number>
      <mbel>
      <paymonth/payday/dept>
  2. Which of the following statements is X true (Hint: build a X simple XML document and test each statement using Internet Explorer or some X other XML-enabled browser—review Listing 02 above if you don't recall how to X build an XML document): b

The XML declaration must always specify the encoding

The XML declaration must always specify the version

The XML declaration can appear anywhere in the document

The XML declaration cannot be preceded by any comments

The XML declaration is an empty tag and therefore must end with "/>"

The root element can never be called <root_element>

The content of an XML document must always have record delimiters

XML tag names longer than 32 characters are illegal

All long comments must be broken into multiple lines

The xml-stylesheet processing instruction must precede the root tag

Your Contact Information

Please provide us with your contact information so that we can contact you X and discuss your personal improvement with you. 

Your first name:

Your last name:

Enter your email, for example yourname@server.com:

Submit your work by email to your instructor. 

 

Assignment

  1. Following the steps described in in this section convert the list below into a well-formed, complete XML document in which the names are broken into <firstName>, <middleIntial> and <lastName>.  Use the root element <PersonNames> and the record delimiter <PersonName>.� Save the document as "PersonNames.xml" and render it with your preferred browser.

Frederick M. Wheelock

Martin A. Gardner

Denise G. Robertson

Claudia P. Gonzales

Martina H. Ramirez

Leilah A. Rashidi

Igmar S. Bergstrom

Mishiko Takahashi
 

See a video on how to answer this question.

  1. Enclosed is a sample of data from people on probation tested for substance abuse.  Transform this spreadsheet into a well-formed XML document using the procedures described in this section.   You can create the XML tags you wish.    See a video on how to answer this question. 
     
  2. Create a common XML version for these two curriculum vitae:

    See a video on how to answer this question.

Please bring your work to class and be prepared to present it

 

More
 

In this section you will find links to other resources. 

 


This page is part of the course on Data Integration the lecture on Essential  XML Concepts.  It was last edited on 05/12/2003.  For more information contact us.  © Copyright protected.