|
|
Converting Legacy Word Documents to XMLThe first step is to open each of the legacy documents in Word 2003 and save as XML. This will allow me to use XSLT for the actual conversion. No styles were used to create these lists; the only unique component is that the actual member name is in boldface type. They look like this:
Notice that last line; there's nothing to associate that line with the previous other than it's slightly indented (two spaces on average in the source files). Rather than trying to create an XSLT that will result in a perfect XML instance on the first pass, I'm instead going to build this in steps, each time refining the markup until I have an instance that will be valid against the schema. Here's what I want to accomplish on the first pass:
This will give me the hooks I need for the next pass. Gotchas:
My result documents are now much easier to read. Here's the next set of tasks to be accomplished in the second XSLT:
When you create your next stylesheet, you'll no longer need all of the namespace declarations required for WordML. You do need to make sure the markup continues to pass through and doesn't get lost. Since all four tasks involve the <r> element, we'll need a big choice group. By creating various tests, we should be able to apply the appropriate markup. The most interesting part of this so far is testing to see if the <r> element contains a <b> child element and then varying the output depending on whether the test is true or false. This is handled in XSLT by the choose element: <xsl:template match="r"> The above code first matches the <r> element. We created these from each of the runs within the WordML markup. The XSL:choose element can contain one or more when elements. Otherwise is the fall through; that is, if none of the previous when tests match, the otherwise rule is processed. We have three tests and a fall through defined above. The first test looks to see if the <r> element has a <b> child element. In WordML, the <b> represents bold, which in our document represents the member name. Therefore, if we find the child, we output the start/end <Name> element along with the content. The second test searches for the string "(Alt" within the text node. If it finds the string, it then outputs an open <VotingMemberName> element, processes the text node, and then outputs the close element. The third test is also a string test, but this time we need to make sure we output all of the content preceding the open square bracket as the <Organization> element. We then output the element <Classification> followed by everything betwen the open and close brackets, and then the close element. Our last rule processes everything else within the <AdditionalInfo> element. The next step is to put the proper wrappers on the elements:
content here content here content here content here content here content here content here content here content here content here content here content here |
|
|||||||||
|
© 2004 Mary P McRae All Rights Reserved |
articles...Read the first installment of our article on Creating Smart Document solutions!
|
tutorials...the best way to learn is by doing ... our first tutorial is up (and in progress). Learn how to convert WordML to your specific schema.
|
products...there's lots of great products that can help you with Office 2003 and XML. Here's our favorites!
|
book reviews...there's a lot of books out there ... how to choose? Let us help. Here's the scoop.
|
events...There's nothing quite like a geekfest! Check here often to get the scoop on the latest conferences, seminars, webinars, and workshops!
|
information...office-xml.com127 Old Revolutionary Rd. phone: 603.557.7985
|