********** ** TABLE OF CONTENTS - copyright and usage - verbosity controls - output controls - revision history - XML format ********** ** COPYRIGHT AND USAGE XML parser v1.3 written by Guy Boo, August 2005 USAGE: perl [--usage] [--help] [-s|-w|-v|-d] [-f ] [-h ] if the file is executable, you can also use this alternate form: USAGE: ./ [--usage] [--help] [-s|-w|-v|-d] [-f ] [-h ] The field must be the complete filename of the parser, including any suffix it may have. The order of the arguments does not matter, except that a destination file name must follow the -f and -h options, and that the last argument must be a file to parse. Entering the arguments '--help' or '--usage' cause the program to display this dialogue and exit. By default, the program will display nothing while it is running, and this may cause the terminal to appear hung. Please allow up to 30 seconds to allow the program to finish. If you want more information about the program's progress, consider setting it to a higher verbosity. ********** ** VERBOSITY CONTROLS If more than one of these options is specified, the last one is used. The default is -s. Verbosity settings are not specific to a type of output, a higher verbosity setting will also output messages that appear for all lower verbosity settings. -s silent mode. Only fatal errors appear on stdout during parsing. -w warning mode. Important warnings and events will be output during parsing, such as when the program encounters an unknown node or when each stage of processing begins and completes. -v verbose mode. A full trace of all events encountered during parsing will be output, along with a trace of the current path in the XML document tree and a diagram of the current XML tree in memory. -d debug mode. Outputs debugging strings. ********** ** OUTPUT CONTROLS This program creates two HTML files that link to each other, so it needs to know the names of both in advance. This version of the parser does not support an option to send resulting HTML to STDOUT; it always creates two files. If the output file names are not specified with the following two options, then they are 'index.html' for the pure HTML file and 'index-flash.html' for the flash-playing file. If files already exist in the current directory with the output file names, they are renamed to .old#. where # is any number such that no file already exists with that name. Note that while the Perl parser doesn't care what you call the source file, the flash parser will always look for a file called 'content.xml' in the same directory as the .swf file. Note also that if you specify a filename, it must be in the CURRENT DIRECTORY or some links in the generated HTML and javascript will be broken. -f specifies the name of the file to create containing the HTML that plays the flash version of the site. -h specifies the name of the file to create containing the pure HTML version of the site. ********** ** REVISION HISTORY (partial) 1.0 - initial debugged version. 1.1 - removed the ability to send output to stdout. 1.2 - added javascript redirection to the output HTML files from a third file, added the -j option to specify the name of this third file. - filenames that include characters illegal in URL composition are no longer accepted. 1.3 - moved the javascript from the .js file into the HTML of the other two files - changed the default HTML filename to 'index.html' and the default flash filename to 'index-flash.html'. - improved progress reporting via the -w verbosity tag. ********** ** XML FORMAT The brackets in this example, except those of the CDATA node, are used to demarcate optional node attributes and should not appear in the actual XML. XML is case sensitive; all node names and attributes must be specified in lowercase. The interpreters are whitespace insensitive unless presented with a text node (see below), so you are free to adopt any formatting you choose. Attribute values are whitespace sensitive, but cannot be broken across lines. All URL attributes may be either relative or absolute. regular text >>> PAGE <<< There must be exactly one of this kind of node in the XML, forming the root of the tree the Flash and Perl parsers must interpret. Accordingly, it cannot be contained by any other node - it has to be the node on the outside. The optional title attribute specifies the title of the HTML pages generated by the program. In more explicit terms, it is used to fill in the tag in the HTML headers. This attribute also is used to fill in the text banner above the navbar on the Flash page. The optional flash and html attributes specify the output filenames for both the automatic linking behavior and the explicit link tags %html, %flash, %html-noredir, and %flash-noredir. When you wish to add a link from the flash file to the html file or vice-versa in the url attribute of any node, you should use the explicit link tags instead of copying the values of the flash and html attributes. The reason for this is that the link tags make sure you don't have to find and update every explicit link every time you want to change the filename. The optional bgcolor, text, link, alink, hlink, and vlink attributes allow you to specify, respectively, the colors of the background, plain text, unused links, currently clicked link, currently hovered-over link, and previously visited links. They should be specified in six digit hexadecimal with no leading #, 0x, or any other form of demarcation. Their default values are, respectively, 333333, ffffff, ffffff, ff2f2f, ff2f2f, ffffff. >>> META <<< This node is used to include a meta tag in the resulting HTML (allowing search engines to crawl it, for example). This node must be a child of the page node - it will not be respected anywhere else - and cannot contain children. The required name and content attributes are used to fill in the corresponding fields in the HTML. >>> IMG <<< This node is used to include an image. The required src attribute must be the URL of the source image to include. If the optional url attribute is specified, then the image will function as a hyperlink and, when clicked, take the browser to the specified URL. Text entered via the optional caption attribute will be included beneath the image when it is displayed. This node may contain children. In HTML, children of this node will be included in an unordered list beneath it. The same is true in flash, provided the image is not a child of the page node. (This special case is somewhat complicated and applies to everything else, so it is explained below in the explanation of panel nodes.) Unfortunately, Flash can only display the .jpg image format. >>> LINE <<< This node is used to include a regular line of text. The required text attribute specifies the actual text of the line. Like img nodes, if the optional url attribute is included, the line will function as a hyperlink and take a browser to the specified URL. If the line is a hyperlink, it will demonstrate this to the user by fading in when the mouse hovers over it. The optional icon attribute can be used to attach an image to the right of the line's text. The value of this attribute must be the URL of the image source. This node may contain children, which are handled identically to those of img nodes. >>> CDATA / text <<< These two nodes are interpreted as containing pure HTML that is to be directly cut-and-pasted into the resulting HTML and Flash pages. A text node is entered simply by typing something that can't be recognized as any other kind of node. (In overly simplified terms, anything that doesn't have '<' or '>' in it.) A character data - or "CDATA" - node is specified with the opening tag <![CDATA[ and the closing tag ]]>. The primary difference between them is that a text node is subject to many more constraints than a CDATA node - you can't directly include the &, <, >, ", or ' characters, for instance. A CDATA node, on the other hand, is subject to the single constraint that it cannot include the ]]> sequence of characters. Another slightly irritating difference between the CDATA and text nodes is that there is typically no way to specify that a text node should not include all of the leading whitespace since the last > and all of the trailing whitespace until the next <. One major distinguishing feature of CDATA and text nodes is that there is no way to directly give them attributes, so if you need to specify an owner on a certain block of HTML included as a text node, you must make it a child of another node such as a group node. Including HTML in HTML is trivial, but including it in Flash is much more complicated. Flash has some support for including pure HTML in a .swf document via HTML text fields. When a block of HTML is encountered by the Flash parser, it adds a text field and turns on its HTML flag, thereby setting aside a certain region of screen space for that HTML text, and then leaves the interpretation of the HTML up to Flash. In general: - Flash claims to understand only a very limited number of HTML tags, and *actually* understands a lot fewer. - You cannot get around the .jpg limitation with an HTML node. - Text starts off with a Flash-determined default color and font, completely unrelated to the font and color of the rest of the text in the flash. - There is no direct way to watch for when the images of an HTML node are loaded, and, more importantly, there is no direct way to resize the space to accomodate a newly-loaded image. If you wish to use the <img> HTML tag in your text or CDATA node, be sure to also specify the height and width attributes or you'll end up with an extremely ugly page. - HTML text cannot be cleanly animated, but that doesn't mean the flash won't give it a good try. The end result is that whenever the HTML area is displayed with anything less than 100% opacity, or at any angle other than completely upright, or at any scale other than 100% it simply won't display at all. Watch out for flickering. >>> PANEL <<< Panel nodes are intended to represent the primary divisions of your page, and accordingly demonstrate some special behaviors. In HTML: panel nodes are handled identically to line nodes without url or icon attributes, and may contain children and be children. Panel nodes that are children of the page node, however, cause the Perl interpreter to generate horizontal "table of contents" style quick links. In more explicit terms, every run of consecutive panel nodes generates a table of quick links that link to only those nodes. For example, if I have: <page> <panel name="one" /> <panel name="two" /> <![CDATA[ ]]> <panel name="three" /> panel name="one" /> </page> Then the resulting HTML would have a table of quick links for the two panels above the CDATA node and a distinct table of quick links for the two panels below the CDATA node, even though the empty CDATA node contributed absolutely nothing to the appearance of the page. In Flash: direct children of the page node are somewhat privileged and somewhat handicapped. They are privileged in the sense that they are elevated above all other nodes and constantly on display in the main horizontal navbar, but because of this they cannot have children in the classic "just indent 'em and be done with it" sense. In fact, the flash interpreter won't even glance at grandchildren of the page node via an img or line. This is where panels come in. The textual representation of a panel in flash has the special property that it causes the currently displayed list of elements to change when rolled over by the mouse. The list that displays when the name of a panel is rolled over is generated from the descendants of that panel's node. Panels, then, are ideally suited for use in the navbar, though flash won't mind a bit if you include one somewhere else. (The visitors to your page might be another story...) Note that this exclusionary behavior is *entirely* different from the HTML parser! The upshot is that anything that is a grandchild of page, but not via a group or panel node might as well be owned by html! >>> GROUP <<< The group node is unique in that it can take any attributes at all, and corresponds to absolutely no graphical representation in any medium. In a sense, group nodes aren't even really there - a child of a group node is interpreted as being a child of the group node's parent. For example: <page title="This is my lovely title"> <line text="hi! How ya doin?" /> <group> <panel name="I am a panel" /> <img src="bob_dylan.jpg"> <img src="bob_jr.jpg" /> </img> </group> </page> is completely equivalent to: <page title="This is my lovely title"> <line text="hi! How ya doin?" /> <panel name="I am a panel" /> <img src="bob_dylan.jpg"> <img src="bob_jr.jpg" /> </img> </page> Far from being useless, however, group nodes are used to provide an 'environment' for their children. Any attributes specified on a group node are propagated down to every one of their descendants that doesn't specify that attribute already. For example: <page> <group text="howdy doo?" src="jumbalaya.jpg" owner="html"> <img /> <!-- img 1 --> <line> <!-- line 1 --> <line /> <!-- line 2 --> <group text="alternate!"> <line text="rebuffed!"> <!-- line 3 --> <line /> <!-- line 4 --> </line> <img /> <!-- img 2 --> </group> </line> <![CDATA[This is some text that I'm getting away with.]]> </group> </page> This XML is legal by the parser because the required src attribute of both img nodes and the required text attribute of line nodes 1 and 2 is provided by the outer group node. The inner group node *overrides* the text attribute of the outer group node, making the text value "alternate!" available to all of its descendants. Line 3 already has a value specified on it, so only line 4 accepts "alternate!" as the value of its text attribute. Notice also that the outer group node was used to specify an owner for every one of the nodes in the tree, even the CDATA node. In general, if any node specifies an owner, all of its descendants are reserved for parsing by that owner as well, regardless of what they themselves specify. So in: <page> <panel owner="html" name="panel boy"> <line text="THIS WILL NEVER BE SEEN. EVER." owner="flash"> <img src="whocares?" /> </line> </panel> </page> Only the html parser is willing to look at the panel, but it refuses to notice the only element in there. Ultimately, the source for the image node is irrelevant, so it can be as invalid as we want it to be! (As long as it remains well-formed XML, that is...) The following case, though, is somewhat more interesting: <page> <group owner="flash"> <group owner="html"> <line text="Does html see me?" /> </group> <line text="My owner is obviously flash." /> </group> </page> The solution here follows the rules that children of group nodes are interpreted as children of their parents, and that all their attribute values are determined by their innermost 'environment'. Therefore, both lines are actually children of page, one of them is owned by html and the other is owned by flash. This is different, however, from the following scenario: <page> <group owner="flash"> <panel name="Hi!"> <line text="Does html see me?" owner="html" /> </panel> </group> </page> Here, the owner attribute is propagated down to panel, which follows the standard rules that the entire subtree starting from that node is parsed only by flash. All this complexity aside, you really only *need* the group node when you want to specify the owner of a text or CDATA node with no presentational consequences. It can, however, be a useful tool. With it, for example you can conveniently author entirely different pages for each owner from the same file. Note that the root of the XML tree still MUST be a page node - you can't legally get fancy with group nodes there. >>> OTHER NODES <<< Only the above nodes are recognized by the interpreters. Other nodes and their descendants are ignored completely, though if you let the interpreters complain about them they will.