Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8145969

SAX parser reports incorrect attribute value in the presence of surrogate pairs

    Details

    • Type: Bug
    • Status: Closed
    • Priority: P4
    • Resolution: Duplicate
    • Affects Version/s: 8u45
    • Fix Version/s: None
    • Component/s: xml
    • Labels:

      Description

      FULL PRODUCT VERSION :
      1.8.0_25. Also reproduced with 1.6.0_27

      ADDITIONAL OS VERSION INFORMATION :
      OS X 10.10.5

      EXTRA RELEVANT SYSTEM CONFIGURATION :
      Tested in multiple configurations

      A DESCRIPTION OF THE PROBLEM :
      Parsing an XML file in UTF-8 encoding, containing a single element with a single attribute; the attribute contains two non-BMP characters (U+1D6A4 repeated twice). In the string as reported to the SAX ContentHandler, the attribute contains three non-BMP characters (U+1D6A4 repeated thrice).

      The problem does not occur with Apache Xerces.

      The problem occurs with all known versions of the JDK XML parser.

      We have been aware of occasional corruptions of XML attribute values for years but this is the first time a client has provided such a simple demonstration of the problem.

      Reported (incorrectly) as a bug on the Saxon product here: https://saxonica.plan.io/issues/2533

      ADDITIONAL REGRESSION INFORMATION:
      1.8.0_25

      STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
      Ensure that Apache Xerces is on the classpath. Run the attached program supplying the name of the attached XML file as the only argument:

      java commands.JDKParserBug input.xml

      The program gives output for the JDK parser and for the Apache Xerces parser. The Apache output is correct, the JDK output is incorrect.

      (Alternatively, comment out the reference to Apache Xerces. It's only there to provide additional verification).

      EXPECTED VERSUS ACTUAL BEHAVIOR :
      EXPECTED -
      SAXParser: ....
      ELEMENT: name
        ATTRIBUTE sortable: d835 dea4 d835 dea4
      ACTUAL -
      SAXParser: com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl
      ELEMENT: name
        ATTRIBUTE sortable: d835 dea4 d835 dea4 d835 dea4

      REPRODUCIBILITY :
      This bug can be reproduced always.

      ---------- BEGIN SOURCE ----------
      package commands;


      import org.xml.sax.Attributes;
      import org.xml.sax.InputSource;
      import org.xml.sax.SAXException;
      import org.xml.sax.XMLReader;
      import org.xml.sax.helpers.XMLFilterImpl;

      import javax.xml.parsers.ParserConfigurationException;
      import javax.xml.parsers.SAXParser;
      import javax.xml.parsers.SAXParserFactory;
      import java.io.*;

      public class JDKParserBug {


          public static void main(String[] args) {
              try {
                  System.err.println(System.getProperty("java.version"));

                  String xml = "<?xml version=\"1.0\" encoding=\"UTF-8\"?><name sortable=\"\uD835\uDEA4\uD835\uDEA4\"/>";

                  for (String factoryName : new String[]{
                          "org.apache.xerces.jaxp.SAXParserFactoryImpl",
                          "com.sun.org.apache.xerces.internal.jaxp.SAXParserFactoryImpl"}) {

                      SAXParserFactory factory = SAXParserFactory.newInstance(factoryName, "".getClass().getClassLoader());
                      factory.setNamespaceAware(true);
                      SAXParser parser = factory.newSAXParser();
                      System.err.println("SAXParser: " + parser.getClass().getName());
                      XMLReader reader = parser.getXMLReader();
                      reader.setContentHandler(new XMLFilterImpl() {
                          @Override
                          public void startElement(String uri, String localName, String qName, Attributes atts) throws SAXException {
                              System.err.println("ELEMENT: " + localName);
                              for (int i=0; i<atts.getLength(); i++) {
                                  System.err.println(" ATTRIBUTE " + atts.getLocalName(i) + ": " +
                                      showString(atts.getValue(i)));
                              }
                          }
                      });
                      reader.parse(new InputSource(new StringReader(xml)));
                  }
              } catch (ParserConfigurationException e) {
                  e.printStackTrace();
              } catch (SAXException e) {
                  e.printStackTrace();
              } catch (IOException e) {
                  e.printStackTrace();
              }
          }

          public static String showString(String s) {
              StringBuilder result = new StringBuilder();
              for (int i=0; i<s.length(); i++) {
                  int c = s.charAt(i);
                  result.append(Integer.toHexString(c)).append(" ");
              }
              return result.toString();
          }
      }

      ---------- END SOURCE ----------

      CUSTOMER SUBMITTED WORKAROUND :
      Use Apache Xerces in place of the JDK parser. (I have been advising my clients to do this for years, largely because of this bug)

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                aefimov Aleksej Efimov
                Reporter:
                webbuggrp Webbug Group
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: