Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8227584

treating some characters as double-byte un-mappable for Windows-31J charset

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: P4
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: core-libs
    • Labels:
      None

      Description

      Attached reproducer demonstrates the issue. The test converts two sequences of bytes from Windows-31J to UTF_16BE, below is the output

      Windows-31J : 81 E8 81 E8
      UTF_16BE : 22 2C 22 2C

      Windows-31J : 81 E8 81 E9 81 E8
      UTF_16BE : 22 2C FF FD 9A 55 FF FD

      The first sequence consists of two identical characters (“multiple integral”).
      This character is represented in code chart at
      https://en.wikipedia.org/wiki/JIS_X_0208#Character_set_0x22_(row_number_2,_special_characters)
      Its position is 2-74. This sequence converted to 222C 222C and this result looks expected.

      The second sequence consists of three characters, at positions 2-74 2-75 2-74 (“empty cell” at position 2-75 added). One option to treat this case would be to convert this empty cell to replacement character (FFFD) and this sequence would be converted to 222C FFFD 222C. But the current behavior is that only first-byte of empty cell is converted to FFFD and the sequence converted to 222C FFFD 9A55 FFFD


      After digging into the source code, my understanding is that the current behavior is implemented as a part of the patch for https://bugs.openjdk.java.net/browse/JDK-8008386

      The specific change is in DoubleByte.java (http://hg.openjdk.java.net/jdk8/jdk8/jdk/rev/3b00bf85a6f5#l1.43) and the fallback logic is that it’s treated as first-byte invalid if one of the following conditions is met: 1) first byte is not leading byte, 2) second byte is leading byte, 3) second byte could be decoded as single

      For the scenario above (with empty cell), the second byte is valid leading byte and hence only first-byte is replaced with FFFD. It might make sense to slightly relax this check by avoiding the condition 2) so that the empty cell will be treated double-byte invalid.

        Attachments

          Activity

            People

            • Assignee:
              naoto Naoto Sato
              Reporter:
              dcherepanov Dmitry Cherepanov
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: