Have you ever had a debate with someone only to find that they are missing an important piece of knowledge? I humbly submit that the concept of "mixed content" in XML is one such concept. I have been involved in debates about XML processing techniques that seemed to be going around in circles. More often than not, the disagreement stemmed from a different conceptual model of XML processing and, more often than not, that difference revolved around the important concept of mixed content in XML. If one party to a debate sees it in their mind-map of XML and the other does not, communication problems are likely to ensue. What follows is a debug trace of a conversation bug attributable to mixed content myopia: A: You know what XML is? B: Sure. XML is a way of adding tags to text to denote the text's meaning. Tags come in pairs -- one at the start and one at the end. A: Okay, but what about the text that is not tagged? B: What do you mean, "not tagged"? If you add tags, you tag up all the text; there is no text outside of the tagging. Adding tags allows you to retrieve the values of those tags (i.e., the text between the start-tag and the end-tag) using APIs. A: So you think of tag names as field names then? B: Yes. Furthermore, XML APIs should allow me to read the value of any particular field. A: Ah, okay. I see where you are coming from. Oh dear! Is that the time? Must dash! The fact is, not all text is guaranteed to be tightly tagged in XML. Text occurring in, so-called, "mixed content models" can sit outside of any direct tagging. Further, such freestanding text can be freely intermixed with tags. Such mixed content models offer both advantages and disadvantages. Being a natural model for narrative text offers a big advantage. For example, the p element in XHTML can contain mixed content as shown in the following fragment:
This is a para with a link in it
The value of the a element is the text "link", but what is the value of the p element? Is it the text before the a start-tag? The text before and after the a element? Does it include the word "link", which is in the a sub-element? We could remove this ambiguity by fully tagging the text like this:
The advantage of this style is that each element either contains a fragment of raw text or a collection of other elements, but never a *mixture* of both. On the downside, i is more complex and ugly to tag by hand. Now you may be inclined to think that some hairs are being split here. Let us move over from doc-land to data-land and disprove this thought. In the following XML fragment, what is the value of the
This story, "Mixed content myopia" was originally published by ITworld.