Left bracket inside attribute value
Type: bug fix
Component: Parser
Description: The old parser allowed the < character inside a quoted
attribute value. The XML spec says this is illegal, and the new parser gives an error.
Impact: Some erroneous files that used to parse will no longer parse.
Omitted right bracket
Type: bug fix
Component: Parser
Description: The old parser allowed the right bracket closing a tag to be
omitted. The new parser gives an error.
Impact: Some erroneous files that used to parse will no longer parse.
Illegal tag syntax
Type: bug fix
Component: Parser
Description: The old parser allowed a tag to be written as <foo = "bar" >.
The new parser gives an error.
Impact: Documents containing this illegal syntax will now receive a parse error.
Illegal characters
Type: difference
Component: Parser
Description: Characters with codes in the range 0x80 to 0xBF are not legal UTF-8
characters. However, the old parser replaced them with spaces instead of giving errors.
Since some existing CDF files contain these characters, the new parser does not give an
error either, but it leaves the incorrect character in the text instead of replacing it
with a space. In the European code pages (e.g. windows-1252, windows-1254, etc.) this
character range includes right and left quote characters, trademark symbols, etc.
Impact: Some erroneous files will now work better than before. Nothing should break.
Illegal PCDATA characters
Type: difference
Component: Parser
Description: The XML Spec says PCDATA must be 0x9, 0xA, 0xD, 0x20-0xD7FF, 0xE000-0xFFFD.
Anything not in this range is an error. The old parser did not check this at all.
Impact: Some files containing this erroneous characters will break.
Element text
Type: difference
Component: IXMLElement::get_text
Description: There were inconsistencies in what the old parser returned from get_text
where there is no text. For a container element, it returned S_FALSE and null. For a leaf
element, it returned S_OK and didn't alter the text pointer (bug!). The new parser returns
S_OK and an empty string.
Impact: Should not break any clients.
Poorly formed attribute values
Type: bug fix
Component: Parser
Description: The old parser ignored extra tokens following in an element's opening tag.
The new parser gives an error. <foo value="first" "second" 34 >
Impact: This change could cause files that previously parsed without errors to have errors.
Processing instructions
Type: bug fix
Component: Parser
Description: There are three differences between the old parser and the new one related
to processing instructions (PI's):
The old parser stored PI contents as upper case; the new one doesn't.
The old parser did not fold PI target names to upper case; the new one does.
Impact: There is no change to which files parse or not, but the target names and text of
PIs may be different. CDF should not be affected.
Error reporting
Type: difference
Component: IXMLError
Description: The interface for reporting errors is bogus and not fully supported by
the new parser. First, the string representing the line where the error occurred is not
returned. In the worst case, the parsed file could consist of a single line of text, so
there's no limit to how big this string could be. Since the parser does not keep the whole
file in memory as it is being parsed, the string might not be available. Rather than keeping
a copy of the text around in case it's needed for error reporting, the parser does not
return the string if an error occurs. Second, the interface where an "expected token" and
a "found token" are returned is not appropriate for all types of errors. For those types
where it is appropriate, the new parser returns this information. Otherwise, it does not.
There are three possible combinations of returned strings:
Both found and expected strings are returned.
Found string is returned and expected string is empty. This means there are several
or many possible expected tokens.
Found string is empty. The expected string contains a descriptive error message.
Impact: Some code that processes IXMLError information may not work as well as before.
XML declaration attributes
Type: difference
Component: Parser
Description: The old parser ignored unrecognized attributes in the XML declaration.
The new one gives an error.
Impact: Some erroneous files that used to parse will no longer parse. Any files that
contain the obsolete XML declaration attribute "RMD" will no longer parse. (The RMD
attribute was never supported, so it should not be in use.)
Normalizing attribute values
Type: bug fix
Component: Parser
Description: The old parser did not correctly normalize attribute values. The XML
spec says that runs of white space should be replaced by a single space. The new parser
does this correctly.
Impact: Documents containing attributes whose values contain multiple white space
sequences or newlines will be parsed differently. It's possible but unlikely that this
will break any CDF files.
CDATA now supported
Type: new feature
Component: Parser
Description: The old parser did not support CDATA; the new one does. If a document
contains CDATA, the old parser will appear to process it (put_url appears to work), but
get_root will then fail.
Impact: Some documents that did not parse before will now parse.
DOCTYPE name
Type: bug fix
Component: IXMLDocument::get_doctype
Description: The old parser did fold the DOCTYPE name to upper case; the new one does.
Impact: Doctype was not used by the old parser, so the impact should be minimal.
Version
Type: difference
Component: IXMLDocument::get_version
Description: The old parser returned S_FALSE for get_version if the document did
not have an XML declaration. This was inconsistent with the behavior of get_encoding,
which always returned S_OK, and returned "UTF-8" (the default) if there was no XML
declaration. The new parser returns S_OK and "1.0" if there is no XML declaration.
Impact: Should be minimal.