Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-6836089

Swing HTML parser can't properly decode codepoints outside the Unicode Plane 0 into a surrogate pair

    Details

    • Subcomponent:
    • Resolved In Build:
      b07
    • CPU:
      generic
    • OS:
      generic
    • Verification:
      Verified

      Backports

        Description

        The statement

           System.out.println("\ud840\udc00".codePointAt(0));

        returns

           131072, because both \ud840 and \udc00 are surrogate characters.

        If one say
         
           JTextPane htmlPane = new JTextPane();
           htmlPane.setEditorKit(new HTMLEditorKit());

           htmlPane.setText("<html><head></head><body>&#131072;</body></html>");

        the entity reference won't be parsed correctly into a surrogate pair.

           System.out.println(htmlPane.getText());

        returns

        <html>
          <head>
            
          </head>
          <body>
            &#0;
          </body>
        </html>

        rather than

        <html>
          <head>
            
          </head>
          <body>
            &#55360;&#56320;
          </body>
        </html>


        or at least

        <html>
          <head>
            
          </head>
          <body>
            &#131072;
          </body>
        </html>

          Activity

          Hide
          vkarnauk Vladislav Karnaukhov added a comment -
          BT2:SUGGESTED FIX

          There's no check if code point is within BMP inside Parser.parseEntityReference() method, which is part of HTML parsing.

          Suggested fix is to check if code point is within BMP and convert it into surrogate pair otherwise. The pseudocode looks as follows:

          if(codepoint <= BMP_HIGHER_LIMIT) {
              //default behaviour
          } else {
              //convert into surrogate pair
              //form string "&#HIGH_SURROGATE;&#LOW_SURROGATE;"
              //return toCharArray()
          }
          Show
          vkarnauk Vladislav Karnaukhov added a comment - BT2:SUGGESTED FIX There's no check if code point is within BMP inside Parser.parseEntityReference() method, which is part of HTML parsing. Suggested fix is to check if code point is within BMP and convert it into surrogate pair otherwise. The pseudocode looks as follows: if(codepoint <= BMP_HIGHER_LIMIT) {     //default behaviour } else {     //convert into surrogate pair     //form string "&#HIGH_SURROGATE;&#LOW_SURROGATE;"     //return toCharArray() }
          Hide
          vkarnauk Vladislav Karnaukhov added a comment -
          BT2:EVALUATION

          There is no check if code point is within BMP (Base Multilingual Plane) inside Parser.parseEntityReference() method, which is part of HTML parsing. So the parser is not able to convert CP into corresponding surrogate pair.
          Show
          vkarnauk Vladislav Karnaukhov added a comment - BT2:EVALUATION There is no check if code point is within BMP (Base Multilingual Plane) inside Parser.parseEntityReference() method, which is part of HTML parsing. So the parser is not able to convert CP into corresponding surrogate pair.

            People

            • Assignee:
              vkarnauk Vladislav Karnaukhov
              Reporter:
              jloefflm Johann Löfflmann (Inactive)
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:
                Imported:
                Indexed: