Left bracket inside attribute value

Type: bug fix
Component: Parser
Description: The old parser allowed the < character inside a quoted attribute value. The XML spec says this is illegal, and the new parser gives an error.
Impact: Some erroneous files that used to parse will no longer parse.

Omitted right bracket

Type: bug fix
Component: Parser
Description: The old parser allowed the right bracket closing a tag to be omitted. The new parser gives an error.
Impact: Some erroneous files that used to parse will no longer parse.

Illegal tag syntax

Type: bug fix
Component: Parser
Description: The old parser allowed a tag to be written as <foo = "bar" >. The new parser gives an error.
Impact: Documents containing this illegal syntax will now receive a parse error.

Illegal characters

Type: difference
Component: Parser
Description: Characters with codes in the range 0x80 to 0xBF are not legal UTF-8 characters. However, the old parser replaced them with spaces instead of giving errors. Since some existing CDF files contain these characters, the new parser does not give an error either, but it leaves the incorrect character in the text instead of replacing it with a space. In the European code pages (e.g. windows-1252, windows-1254, etc.) this character range includes right and left quote characters, trademark symbols, etc.
Impact: Some erroneous files will now work better than before. Nothing should break.

Illegal PCDATA characters

Type: difference
Component: Parser
Description: The XML Spec says PCDATA must be 0x9, 0xA, 0xD, 0x20-0xD7FF, 0xE000-0xFFFD. Anything not in this range is an error. The old parser did not check this at all. Impact: Some files containing this erroneous characters will break.

Element text

Type: difference
Component: IXMLElement::get_text
Description: There were inconsistencies in what the old parser returned from get_text where there is no text. For a container element, it returned S_FALSE and null. For a leaf element, it returned S_OK and didn't alter the text pointer (bug!). The new parser returns S_OK and an empty string.
Impact: Should not break any clients.

Poorly formed attribute values

Type: bug fix
Component: Parser
Description: The old parser ignored extra tokens following in an element's opening tag. The new parser gives an error. <foo value="first" "second" 34 >
Impact: This change could cause files that previously parsed without errors to have errors.

Processing instructions

Type: bug fix
Component: Parser
Description: There are three differences between the old parser and the new one related to processing instructions (PI's):
  • The old parser stored PI contents as upper case; the new one doesn't.
  • The old parser did not fold PI target names to upper case; the new one does.

  • Impact: There is no change to which files parse or not, but the target names and text of PIs may be different. CDF should not be affected.

    Error reporting

    Type: difference
    Component: IXMLError
    Description: The interface for reporting errors is bogus and not fully supported by the new parser. First, the string representing the line where the error occurred is not returned. In the worst case, the parsed file could consist of a single line of text, so there's no limit to how big this string could be. Since the parser does not keep the whole file in memory as it is being parsed, the string might not be available. Rather than keeping a copy of the text around in case it's needed for error reporting, the parser does not return the string if an error occurs. Second, the interface where an "expected token" and a "found token" are returned is not appropriate for all types of errors. For those types where it is appropriate, the new parser returns this information. Otherwise, it does not. There are three possible combinations of returned strings:
  • Both found and expected strings are returned.
  • Found string is returned and expected string is empty. This means there are several or many possible expected tokens.
  • Found string is empty. The expected string contains a descriptive error message.

  • Impact: Some code that processes IXMLError information may not work as well as before.

    XML declaration attributes

    Type: difference
    Component: Parser
    Description: The old parser ignored unrecognized attributes in the XML declaration. The new one gives an error.
    Impact: Some erroneous files that used to parse will no longer parse. Any files that contain the obsolete XML declaration attribute "RMD" will no longer parse. (The RMD attribute was never supported, so it should not be in use.)

    Normalizing attribute values

    Type: bug fix
    Component: Parser
    Description: The old parser did not correctly normalize attribute values. The XML spec says that runs of white space should be replaced by a single space. The new parser does this correctly.
    Impact: Documents containing attributes whose values contain multiple white space sequences or newlines will be parsed differently. It's possible but unlikely that this will break any CDF files.

    CDATA now supported

    Type: new feature
    Component: Parser
    Description: The old parser did not support CDATA; the new one does. If a document contains CDATA, the old parser will appear to process it (put_url appears to work), but get_root will then fail.
    Impact: Some documents that did not parse before will now parse.

    DOCTYPE name

    Type: bug fix
    Component: IXMLDocument::get_doctype
    Description: The old parser did fold the DOCTYPE name to upper case; the new one does.
    Impact: Doctype was not used by the old parser, so the impact should be minimal.

    Version

    Type: difference
    Component: IXMLDocument::get_version
    Description: The old parser returned S_FALSE for get_version if the document did not have an XML declaration. This was inconsistent with the behavior of get_encoding, which always returned S_OK, and returned "UTF-8" (the default) if there was no XML declaration. The new parser returns S_OK and "1.0" if there is no XML declaration.
    Impact: Should be minimal.