|
|
WordML for Stylesheet DesignersUnderstanding the Microsoft Word SchemaWhat it is WordML is nothing more than Word in XML. That isn't meant to downplay the significance; it merely is meant to point out that Word is still Word. Unlike past problems in trying to understand RTF (rich text format), the WordML schema is published, documented, licensed, and freely available. The easiest way to learn about the structure of a Word 2003 document is to save it as XML then view it in a text editor. However, be forewarned—there is a significant amount of overhead associated with each file. If you have ever examined the RTF version of a Word document, or saved a Word document as HTML, you'll be familiar with the enormous size of a file that only contains a line or two of text. The overhead is what makes it possible for another user to open your Word document and see it exactly as it appears on your screen; while unwieldy, it has its purpose. What it isn't WordML is nothing more than Word in XML. It bears repeating. Word has not been morphed into a structured editor (alhtough there is other cool technology that enables that), nor has the way in which Word creates and maintains information really changed. Only the vocabulary used to identify the infobits is new. There is no hierarchical structure; no nesting. It's still a flat structure.
Namespaces Microsoft Office Word 2003 takes advantage of several namespaces. For those of us that spend most of our time working with documents in editors like Epic Editor, XMetaL, or FrameMaker+XML, namespaces can be a bit overwhelming at first. Namespaces allow the mixing of data from disparate sources into a single instance. Identified by their prefix, each element follows its own set of rules. Thanks to namespaces, Word is able to handle customer-specific XML. All of the information relevant to Word is maintained in the various Microsoft namespaces; your specific schema maintains its own namespace (or is given one by default). Since a Word document includes a number of items that are common to any Office document, there's a special namespace for those elements. The vast majority of elements within a Word document fall within the w namespace.
Document Properties The <o:DocumentProperties> element and its children belong to the Office (o) namespace and include information about the document itself, such as the title, author name, creation date, date last edited, number of pages, etc. This is the XML representation of the data that can be seen by viewing the Properties pop-up window. A second element, <o:CustomDocumentProperties>, contains the information found on the Custom tab within the Properties window. Note that specific element names are undefined; CustomDocumentProperties is defined with a content model of "any." This allows the application to assign element names based on the Custom Property name and use the value as the element content. It's not necessary to write these elements when creating a WordML instance; Word will automatically populate them when the new file is first opened in Word. Similarly, in most instances this information will be discarded when transforming a Word XML instance to another format. Fonts The <w:fonts> element has two children: <w:defaultFonts> and <w:font>. The default font is basically the same as would come up if you were creating a new document based on the Normal template; i.e. Times New Roman. Each grouping of the <w:font> element contains details about a particular font used within the actual document instance. Again, it is not necessary to write these elements when creating a WordML instance; Word will automatically populate them when the new file is first opened in Word. Lists The <w:lists> element also has two children: <w:listDef> and <w:list>. These two are interrelated. Each listDef has an ID attribute associated with it that links it to the appropriate list element. The list definition element is described as referring to "base list definitions." They are not used directly, but are instead referred to by an individual list element, which is referred to by a paragraph property. Sounds a bit confusing? That's because it is. Basically, a paragraph will contain an attribute named "ilfo", whose value points to the ilfo attribute on an individual list element. The individual list element has another attribute, "ilst", whose value points to the listDefID attribute of the listdef element. If, instead, the list is created as part of a paragraph style, the style name is referenced in the listDef element hierarchy. Styles The <w:styles> element consists of <w:style> children. Each style group contains all of the details about one of four specific style types: paragraph, character, list or table. While each set of child elements is particular to the type of style, there are values that represent each of the options available on the style panes. Like above, it's not necessary to write all of the details about a particular style when creating a new WordML document instance. It is, however, critical that you have at least the <w:style> element and the style name for each style referenced in the body of the document. Word will automatically pick up the rest of the information from the referenced template. Styles will be explained in more detail in the next section. Body The <w:body> element contains what it typically thought of as the document content. Everything that appears on a printed page is contained within the start and end body element, including headers, footers, footnotes, images, and textboxes. It can get pretty wild in here with binary data, proofing errors, grammar errors, and change tracking interrupting the text runs. These are the areas most likely to cause problems when trying to convert a WordML document instance into something else. Styles in Microsoft Word are inherited; that is, any of the style characteristics associated with the "based on" style name are automatically associated with the style unless overridden. If your application needs to replicate styles, you'll need to be able to navigate through the style hierarchy to ensure that you've captured each of the relevant characteristics. The next element identifies what is to happen when the enter key is pressed; by default, the new paragraph will take on the current style; however, by indicating a different style name here, the designer can have more control over the document's look and feel. For instance, if defining a heading style, the "next" style might be set to normal. The paragraph and run properties are the same elements that are used within the body of document; any local settings (that is, settings within the body element) would override those in the style definition. There are more than two dozen child elements of the paragraph property element (<w:pPr>) ; the most common are listed here. Refer to the actual Word 2003 schema documentation for the complete list.
The run properties are also the same as those used for character styles. See Styles—character below for an overview. Amazingly, there are more elements associated with the run property (<w:rPr>) than with the paragraph property, above. Over three dozen unique child elements are possible. Bold, italic, caps, small caps, strike through, underline, outline, shadow, emboss, color, size, and of course, font, are the ones you're most likely to encounter, unless you're working with Asian or right-to-left fonts.
Of course, if any of these run properties are set as part of a paragraph style, their effects will be seen on the entire paragraph, rather than a portion of one. For instance, if you have a paragraph style associated with a heading, it's likely that the entire paragraph will have certain characteristics—such as bold and a larger point size—associated with it. content here content here content here content here content here Immediately following a paragraph element will be a paragraph properties (<w:pPr>) element. Similar to the the paragraph properties element used in the style model, in this case it most likely contains the child style (<w:pStyle>) element, which tells Word which paragraph style it is to use for this specific paragraph instance. If there is list formatting involved (without using an actual named list style), the style element will be followed by a list properties (<w:listPr>) element. While you may be expecting to see some actual text, that won't appear until we get to the run element, below. Finally, we get to the last line of defense—the text (<w:t>) element. This is where the actual text you see in a document is stored. You must remember that in many cases an entire paragraph consists of multiple text runs. This is due to interruptions by proofing errors, comments, change tracking, tables, pictures, font changes, or any of the other elements allowable within a paragraph element. If converting a WordML document to some other structured format, it is recommended that you first open the document in Word and turn off grammar and spell checking, and resolve any change tracking. Then save the document. This will eliminate much of the extraneous markup that's not really needed for the task at hand.
|
|
|||||||||
|
© 2004 Mary P McRae All Rights Reserved |
articles...Read the first installment of our article on Creating Smart Document solutions!
|
tutorials...the best way to learn is by doing ... our first tutorial is up (and in progress). Learn how to convert WordML to your specific schema.
|
products...there's lots of great products that can help you with Office 2003 and XML. Here's our favorites!
|
book reviews...there's a lot of books out there ... how to choose? Let us help. Here's the scoop.
|
events...There's nothing quite like a geekfest! Check here often to get the scoop on the latest conferences, seminars, webinars, and workshops!
|
information...office-xml.com127 Old Revolutionary Rd. phone: 603.557.7985
|