Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-4296969

Incorrect behaviours of several character converters

    XMLWordPrintable

    Details

    • Subcomponent:
    • CPU:
      generic, x86
    • OS:
      generic, windows_nt, windows_2000

      Description

      \u001A' character.
      'UN3' indicates a mapping to no character ('').
      'MIS' indicates a mapping from one character to an entirely different character
      (other than UN1 or UN2).

      (For the MacDingbat encoding, every mismatch mapping was to '\u271F'.)

      For 8859_1, EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
      For 8859_2, EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
      For 8859_3, EX1 = 0, EX2 = 0, UN1 = 64262, UN2 = 0, UN3 = 1024, MIS = 0.
      For 8859_4, EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
      For 8859_5, EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
      For 8859_6, EX1 = 0, EX2 = 0, UN1 = 64300, UN2 = 0, UN3 = 1024, MIS = 0.
      For 8859_7, EX1 = 0, EX2 = 0, UN1 = 64261, UN2 = 0, UN3 = 1024, MIS = 0.
      For 8859_8, EX1 = 0, EX2 = 0, UN1 = 64293, UN2 = 0, UN3 = 1024, MIS = 0.
      For 8859_9, EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
      For Big5, EX1 = 1024, EX2 = 0, UN1 = 50680, UN2 = 0, UN3 = 0, MIS = 0.
      Exc1: [java.lang.InternalError: Converter malfunction:
      sun.io.CharToByteBig5]

      For CNS11643, EX1 = 0, EX2 = 0, UN1 = 47696, UN2 = 0, UN3 = 1, MIS = 0.
      For Cp037, EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64255, UN3 = 1024, MIS = 0.
      For Cp1006, EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
      For Cp1025, EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64255, UN3 = 1024, MIS = 0.
      For Cp1026, EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64255, UN3 = 1024, MIS = 0.
      For Cp1046, EX1 = 0, EX2 = 0, UN1 = 64256, UN2 = 0, UN3 = 1024, MIS = 0.
      For Cp1097, EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64255, UN3 = 1024, MIS = 0.
      For Cp1098, EX1 = 0, EX2 = 0, UN1 = 64258, UN2 = 0, UN3 = 1024, MIS = 0.
      For Cp1112, EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64255, UN3 = 1024, MIS = 0.
      For Cp1122, EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64255, UN3 = 1024, MIS = 0.
      For Cp1123, EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64255, UN3 = 1024, MIS = 0.
      For Cp1124, EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
      For Cp1250, EX1 = 0, EX2 = 0, UN1 = 64261, UN2 = 0, UN3 = 1024, MIS = 0.
      For Cp1251, EX1 = 0, EX2 = 0, UN1 = 64256, UN2 = 0, UN3 = 1024, MIS = 0.
      For Cp1252, EX1 = 0, EX2 = 0, UN1 = 64262, UN2 = 0, UN3 = 1024, MIS = 0.
      For Cp1253, EX1 = 0, EX2 = 0, UN1 = 64272, UN2 = 0, UN3 = 1024, MIS = 0.
      For Cp1254, EX1 = 0, EX2 = 0, UN1 = 64262, UN2 = 0, UN3 = 1024, MIS = 0.
      For Cp1255, EX1 = 0, EX2 = 0, UN1 = 64284, UN2 = 0, UN3 = 1024, MIS = 0.
      For Cp1256, EX1 = 0, EX2 = 0, UN1 = 64263, UN2 = 0, UN3 = 1024, MIS = 0.
      For Cp1257, EX1 = 0, EX2 = 0, UN1 = 64267, UN2 = 0, UN3 = 1024, MIS = 0.
      For Cp1258, EX1 = 0, EX2 = 0, UN1 = 64264, UN2 = 0, UN3 = 1024, MIS = 0.
      For Cp1381, EX1 = 0, EX2 = 0, UN1 = 55022, UN2 = 0, UN3 = 1024, MIS = 0.
      For Cp1383, EX1 = 0, EX2 = 0, UN1 = 55517, UN2 = 0, UN3 = 1024, MIS = 0.
      For Cp273, EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64255, UN3 = 1024, MIS = 0.
      For Cp277, EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64255, UN3 = 1024, MIS = 0.
      For Cp278, EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64255, UN3 = 1024, MIS = 0.
      For Cp280, EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64255, UN3 = 1024, MIS = 0.
      For Cp284, EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64255, UN3 = 1024, MIS = 0.
      For Cp285, EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64255, UN3 = 1024, MIS = 0.
      For Cp297, EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64255, UN3 = 1024, MIS = 0.
      For Cp33722, EX1 = 0, EX2 = 0, UN1 = 55140, UN2 = 0, UN3 = 1024, MIS = 0.
      For Cp420, EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64263, UN3 = 1024, MIS = 0.
      For Cp424, EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64293, UN3 = 1024, MIS = 0.
      For Cp437, EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
      For Cp500, EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64255, UN3 = 1024, MIS = 0.
      For Cp737, EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
      For Cp775, EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
      For Cp838, EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64260, UN3 = 1024, MIS = 0.
      For Cp850, EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
      For Cp852, EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
      For Cp855, EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
      For Cp857, EX1 = 0, EX2 = 0, UN1 = 64258, UN2 = 0, UN3 = 1024, MIS = 0.
      For Cp860, EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
      For Cp861, EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
      For Cp862, EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
      For Cp863, EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
      For Cp864, EX1 = 0, EX2 = 0, UN1 = 64261, UN2 = 0, UN3 = 1024, MIS = 0.
      For Cp865, EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
      For Cp866, EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
      For Cp868, EX1 = 0, EX2 = 0, UN1 = 64256, UN2 = 0, UN3 = 1024, MIS = 0.
      For Cp869, EX1 = 0, EX2 = 0, UN1 = 64264, UN2 = 0, UN3 = 1024, MIS = 0.
      For Cp870, EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64255, UN3 = 1024, MIS = 0.
      For Cp871, EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64255, UN3 = 1024, MIS = 0.
      For Cp874, EX1 = 0, EX2 = 0, UN1 = 64291, UN2 = 0, UN3 = 1024, MIS = 0.
      For Cp875, EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64261, UN3 = 1024, MIS = 0.
      For Cp918, EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64255, UN3 = 1024, MIS = 0.
      For Cp921, EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
      For Cp922, EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
      For Cp930, EX1 = 11635, EX2 = 0, UN1 = 52648, UN2 = 0, UN3 = 1026, MIS =
      0.
      Exc1: [java.lang.InternalError: Converter malfunction:
      sun.io.CharToByteCp930]

      For Cp933, EX1 = 10888, EX2 = 0, UN1 = 53406, UN2 = 0, UN3 = 1026, MIS =
      0.
      Exc1: [java.lang.InternalError: Converter malfunction:
      sun.io.CharToByteCp933]

      For Cp935, EX1 = 9356, EX2 = 0, UN1 = 54990, UN2 = 0, UN3 = 1026, MIS =
      0.
      Exc1: [java.lang.InternalError: Converter malfunction:
      sun.io.CharToByteCp935]

      For Cp937, EX1 = 20075, EX2 = 0, UN1 = 44273, UN2 = 0, UN3 = 1026, MIS =
      0.
      Exc1: [java.lang.InternalError: Converter malfunction:
      sun.io.CharToByteCp937]

      For Cp939, EX1 = 11635, EX2 = 0, UN1 = 52648, UN2 = 0, UN3 = 1026, MIS =
      0.
      Exc1: [java.lang.InternalError: Converter malfunction:
      sun.io.CharToByteCp939]

      For Cp942, EX1 = 0, EX2 = 0, UN1 = 55170, UN2 = 0, UN3 = 1024, MIS = 0.
      For Cp948, EX1 = 0, EX2 = 0, UN1 = 44305, UN2 = 0, UN3 = 1024, MIS = 0.
      For Cp949, EX1 = 0, EX2 = 0, UN1 = 54144, UN2 = 0, UN3 = 1024, MIS = 130.
      For Cp950, EX1 = 0, EX2 = 0, UN1 = 44308, UN2 = 0, UN3 = 1024, MIS = 0.
      For Cp964, EX1 = 0, EX2 = 0, UN1 = 44278, UN2 = 0, UN3 = 1024, MIS = 0.
      For Cp970, EX1 = 0, EX2 = 0, UN1 = 55819, UN2 = 0, UN3 = 1024, MIS = 122.
      For EUCJIS, EX1 = 1024, EX2 = 0, UN1 = 51372, UN2 = 0, UN3 = 0, MIS = 2.
      Exc1: [java.lang.InternalError: Converter malfunction:
      sun.io.CharToByteEUC_JP]

      For GB2312, EX1 = 1024, EX2 = 0, UN1 = 56938, UN2 = 0, UN3 = 0, MIS = 0.
      Exc1: [java.lang.InternalError: Converter malfunction:
      sun.io.CharToByteEUC_CN]

      For GBK, EX1 = 1024, EX2 = 0, UN1 = 40443, UN2 = 0, UN3 = 0, MIS = 0.
      Exc1: [java.lang.InternalError: Converter malfunction:
      sun.io.CharToByteGBK]

      For ISO2022CN_CNS, EX1 = 7650, EX2 = 57885, UN1 = 0, UN2 = 0, UN3 = 0,
      MIS = 0.
      Exc1: [java.lang.ArrayIndexOutOfBoundsException]
      Exc2: [java.io.UnsupportedEncodingException: ISO2022CN_CNS]

      For ISO2022CN_GB, EX1 = 0, EX2 = 65535, UN1 = 0, UN2 = 0, UN3 = 0, MIS =
      0.
      Exc2: [java.io.UnsupportedEncodingException: ISO2022CN_GB]

      For ISO2022KR, EX1 = 0, EX2 = 8224, UN1 = 0, UN2 = 0, UN3 = 57186, MIS = 0.
      Exc2: [java.lang.NullPointerException]

      For JIS, EX1 = 1024, EX2 = 0, UN1 = 57439, UN2 = 0, UN3 = 3, MIS = 0.
      Exc1: [java.lang.InternalError: Converter malfunction:
      sun.io.CharToByteISO2022JP]

      For JIS0208, EX1 = 1024, EX2 = 0, UN1 = 0, UN2 = 0, UN3 = 57632, MIS = 0.
      Exc1: [java.lang.InternalError: Converter malfunction:
      sun.io.CharToByteJIS0208]

      For KOI8_R, EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
      For KSC5601, EX1 = 1024, EX2 = 0, UN1 = 56159, UN2 = 0, UN3 = 0, MIS = 0.
      Exc1: [java.lang.InternalError: Converter malfunction:
      sun.io.CharToByteEUC_KR]

      For MS874, EX1 = 0, EX2 = 0, UN1 = 64287, UN2 = 0, UN3 = 1024, MIS = 0.
      For MacArabic, EX1 = 0, EX2 = 0, UN1 = 64281, UN2 = 0, UN3 = 1024, MIS = 0.
      For MacCentralEurope, EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024,
      MIS = 0.
      For MacCroatian, EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024,
      MIS = 0.
      For MacCyrillic, EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024,
      MIS = 0.
      For MacDingbat, EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 0, UN3 = 1024, MIS = 64290.
      For MacGreek, EX1 = 0, EX2 = 0, UN1 = 64256, UN2 = 0, UN3 = 1024, MIS = 0.
      For MacHebrew, EX1 = 0, EX2 = 0, UN1 = 64297, UN2 = 0, UN3 = 1024, MIS = 0.
      For MacIceland, EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
      For MacRoman, EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
      For MacRomania, EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
      For MacSymbol, EX1 = 0, EX2 = 0, UN1 = 64311, UN2 = 0, UN3 = 1024, MIS = 0.
      For MacThai, EX1 = 0, EX2 = 0, UN1 = 64261, UN2 = 0, UN3 = 1024, MIS = 0.
      For MacTurkish, EX1 = 0, EX2 = 0, UN1 = 64256, UN2 = 0, UN3 = 1024, MIS = 0.
      For MacUkraine, EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
      For SJIS, EX1 = 1024, EX2 = 0, UN1 = 57439, UN2 = 0, UN3 = 0, MIS = 2.
      Exc1: [java.lang.InternalError: Converter malfunction:
      sun.io.CharToByteSJIS]

      For UTF8, EX1 = 0, EX2 = 0, UN1 = 1024, UN2 = 0, UN3 = 0, MIS = 0.

      I came across this bug while trying to convert between diffent encodings. I was
      trying to get some idea of the data loss, but because so many different methods
      are used to indicate 'no mapping', this was made very difficult. Much of this
      would be addressed by bug 4241124. I also read several bugs indicating that not
      all encodings are not 'reversible', which address many of the 'EX2' errors.
      However, what I can not understand is how I can map a character from unicode to
      byte[] and back to unicode, and get an entirely different character! This must
      be a error in the underlying conversion tables.

      I think that in the very least these inconsistencies between encodings should be
      documented somewhere. I had been under the impression that 'no mapping' whould
      be indicated by '?' in the native form, and with the SUBSTITUTE character in
      unicode. I was not aware that some characters would be ommitted in the
      conversion, that different methods would be used to indicate 'no mapping'
      within the same encoding, that all sorts of errors could be generated, or that
      conversions were not reversible.
      (Review ID: 100000)
      ======================================================================

      Name: skT45625 Date: 05/09/2000


      java version "1.3.0rc1"
      Java(TM) 2 Runtime Environment, Standard Edition (build 1.3.0rc1-T)
      Java HotSpot(TM) Client VM (build 1.3.0rc1-S, mixed mode)

      1. open the command prompt in Korean Windows 2000.
      (the default codepage is 949)
      2. run below code like this way.
      C:> java -Duser.language=en -Duser.region=US -classpath . ShowLocale
      import java.util.Locale;
      public class ShowLocale {
              public static void main(String[] args) {
                      System.out.println("default locale is " + Locale.getDefault());
              }
      }
      3. then the result is
      default locale is ko_KR
      I should expect en_US.
      4. but if I change the codepage in console prompt like this way,
      C:> chcp 1252
      then, all works fine.
      C:> java -Duser.language=blah -Duser.region=YADDA -classpath . ShowLocale
      the result is
      default locale is blah_YADDA

      This problem happens also in JDK 1.2.2.
      (Review ID: 102774)
      ======================================================================


      Name: krT82822 Date: 12/05/99


      12/5/99 eval1127@eng -- kestrel RA produces errors for several of the codepages. Submitting this to supplement existing encoding bugs open for kestrel.

      /*
      J:\borsotti\jtest>java -version
      java version "1.2"
      Classic VM (build JDK-1.2-V, native threads)

      There are several problems with the character converters.
      They can be summarized as follows:

        - converters which are listed in the jdk documentation,
          but do not exist,
        - converters which do not map all Unicode characters, or
          do not decode back (to) what they encoded,
        - converters which crash

      This java program tests each converter in turn and reports
      the errors found:
      */

      import java.io.*;
      import java.util.*;
      public class EncErr {

          /**
           * This is the list of encodings reported in
           *
      http://java.sun.com/products/jdk/1.2/docs/guide/internat/encoding.doc.html
           */

          private static String[] encodings = new String[] {
               "ASCII", // ASCII
               "ISO8859_1", // ISO 8859-1
               "ISO8859_2", // ISO 8859-2
               "ISO8859_3", // ISO 8859-3
               "ISO8859_4", // ISO 8859-4
               "ISO8859_5", // ISO 8859-5
               "ISO8859_6", // ISO 8859-6
               "ISO8859_7", // ISO 8859-7
               "ISO8859_8", // ISO 8859-8
               "ISO8859_9", // ISO 8859-9
               "Big5", // Big5, Traditional Chinese
               "Cp037", // USA, Canada(Bilingual, French), Netherlands, Portugal, Brazil, Australia
               "Cp1006", // IBM AIX Pakistan (Urdu)
               "Cp1025", // IBM Multilingual Cyrillic: Bulgaria, Bosnia, Herzegovinia, Macedonia(FYR)
               "Cp1026", // IBM Latin-5, Turkey
               "Cp1046", // IBM Open Edition US EBCDIC
               "Cp1097", // IBM Iran(Farsi)/Persian
               "Cp1098", // IBM Iran(Farsi)/Persian (PC)
               "Cp1112", // IBM Latvia, Lithuania
               "Cp1122", // IBM Estonia
               "Cp1123", // IBM Ukraine
               "Cp1124", // IBM AIX Ukraine
               "Cp1250", // Windows Eastern European
               "Cp1251", // Windows Cyrillic
               "Cp1252", // Windows Latin-1
               "Cp1253", // Windows Greek
               "Cp1254", // Windows Turkish
               "Cp1255", // Windows Hebrew
               "Cp1256", // Windows Arabic
               "Cp1257", // Windo",ws Baltic
               "Cp1258", // Windows Vietnamese
               "Cp1381", // IBM OS/2, DOS People's Republic of China (PRC)
               "Cp1383", // IBM AIX People's Republic of China (PRC)
               "Cp273", // IBM Austria, Germany
               "Cp277", // IBM Denmark, Norway
               "Cp278", // IBM Finland, Sweden
               "Cp280", // IBM Italy
               "Cp284", // IBM Catalan/Spain, Spanish Latin America
               "Cp285", // IBM United Kingdom, Ireland
               "Cp297", // IBM France
               "Cp33722", // IBM-eucJP - Japanese (superset of 5050)
               "Cp420", // IBM Arabic
               "Cp424", // IBM Hebrew
               "Cp437", // MS-DOS United States, Australia, New Zealand, South Africa
               "Cp500", // EBCDIC 500V1
               "Cp737", // PC Greek
               "Cp775", // PC Baltic
               "Cp838", // IBM Thailand extended SBCS
               "Cp850", // MS-DOS Latin-1
               "Cp852", // MS-DOS Latin-2
               "Cp855", // IBM Cyrillic
               "Cp857", // IBM Turkish
               "Cp860", // MS-DOS Portuguese
               "Cp861", // MS-DOS Icelandic
               "Cp862", // PC Hebrew
               "Cp863", // MS-DOS Canadian French
               "Cp864", // PC Arabic
               "Cp865", // MS-DOS Nordic
               "Cp866", // MS-DOS Russian
               "Cp868", // MS-DOS Pakistan
               "Cp869", // IBM Modern Greek
               "Cp870", // IBM Multilingual Latin-2
               "Cp871", // IBM Iceland
               "Cp874", // IBM Thai
               "Cp875", // IBM Greek
               "Cp918", // IBM Pakistan(Urdu)
               "Cp921", // IBM Latvia, Lithuania (AIX, DOS)
               "Cp922", // IBM Estonia (AIX, DOS)
               "Cp930", // Japanese Katakana-Kanji mixed with 4370 UDC, superset of 5026
               "Cp933", // Korean Mixed with 1880 UDC, superset of 5029
               "Cp935", // Simplified Chinese Host mixed with 1880 UDC, superset of 5031
               "Cp937", // Traditional Chinese Host miexed with 6204 UDC, superset of 5033
               "Cp939", // Japanese Latin Kanji mixed with 4370 UDC, superset of 5035
               "Cp942", // Japanese (OS/2) superset of 932
               "Cp948", // OS/2 Chinese (Taiwan) superset of 938
               "Cp949", // PC Korean
               "Cp950", // PC Chinese (Hong Kong, Taiwan)
               "Cp964", // AIX Chinese (Taiwan)
               "Cp970", // AIX Korean
               "EUC_CN", // GB2312, EUC encoding, Simplified Chinese
               "EUC_JP", // JIS0201, 0208, 0212, EUC Encoding, Japanese
               "EUC_KR", // KS C 5601, EUC Encoding, Korean
               "EUC_TW", // CNS11643 (Plane 1-3), T. Chinese, EUC encoding
               "GBK", // GBK, Simplified Chinese
               "ISO2022CN", // ISO 2022 CN, Chinese
               "ISO2022CN_CNS", // CNS 11643 in ISO-2022-CN form, T. Chinese
               "ISO2022CN_GB", // GB 2312 in ISO-2022-CN form, S. Chinese
               "ISO2022JP", // JIS0201, 0208, 0212, ISO2022 Encoding, Japanese
               "ISO2022KR", // ISO 2022 KR, Korean
               "JIS0201", // JIS 0201, Japanese
               "JIS0208", // JIS 0208, Japanese
               "JIS0212", // JIS 0212, Japanese
               "KOI8_R", // KOI8-R, Russian
               "MS874", // Windows Thai
               "MacArabic", // Macintosh Arabic
               "MacCentralEurope", // Macintosh Latin-2
               "MacCroatian", // Macintosh Croatian
               "MacCyrillic", // Macintosh Cyrillic
               "MacDingbat", // Macintosh Dingbat
               "MacGreek", // Macintosh Greek
               "MacHebrew", // Macintosh Hebrew
               "MacIceland", // Macintosh Iceland
               "MacRoman", // Macintosh Roman
               "MacRomania", // Macintosh", Romania
               "MacSymbol", // Macintosh Symbol
               "MacThai", // Macintosh Thai
               "MacTurkish", // Macintosh Turkish
               "MacUkraine", // Macintosh Ukraine
               "SJIS", // Shift-JIS, Japanese
               "UTF8", // UTF-8
               };

          /**
           * Test an encoding. The following tests are done:
           * <ol>
           * <li>the existence of the encoder
           * <li>the existence of the decoder
           * <li>each character which is defined in Unicode is encoded,
           * and then the result is decoded. The number of characters which
           * are not encoded, or an encoded into an empty sequence of octects,
           * or are encoded into a sequence which, once decoded, produces
           * a character different from the original one or different from
           * '?' is rekoned.
           * <li>several long strings are encoded and then decoded, and checked
           * to be equal (apart from characters mapped into '?') to the original.
           * </ol>
           * The third and fourth steps are done only if the previous are successful.
           * In the last step, only characters which are encoded correctly are
           * used.
           *
           * @param enc name of the encoding
           */

          private static void test(String enc){
              System.err.println("------ test ------- " + enc);

              // test existence of encoder

              boolean both = true;
              try {
                  byte[] bb = new byte[] {0};
                  String str = new String(bb,enc);
              } catch (UnsupportedEncodingException th){
                  System.err.println("encoder " + enc + " not available");
                  both = false;
              }

              // test existence of decoder

              try {
                  byte[] bb = "abc".getBytes(enc);
              } catch (UnsupportedEncodingException th){
                  System.err.println("decoder " + enc + " not available");
                  both = false;
              }
              if (!both) return;

              // test mapping

              // remember which character is valid for the round-trip test

              boolean[] valid = new boolean[Character.MAX_VALUE+1];
              try {
                  int nrEmpty = 0;
                  int nrUnmapped = 0;
                  int nrNoBack = 0;
                  int nrDiffBack = 0;
                  for (int c = Character.MIN_VALUE; c <= Character.MAX_VALUE; c++){
                      if (!Character.isDefined((char)c)) continue;
                      valid[c] = true;
                      String s = String.valueOf((char)c);
                      byte[] bb = null;
                      try {
                          bb = s.getBytes(enc);
                          if (bb.length == 0){
                              nrEmpty++;
                              valid[c] = false;
                              continue;
                          }
                      } catch (InternalError tr){
                          nrUnmapped++;
                          valid[c] = false;
                          continue;
                      }
                      try {
                          String str = new String(bb,enc);
                          if (str.length() != 1){
                              nrNoBack++;
                              valid[c] = false;
                              continue;
                          }
                          if ((str.charAt(0) != (char)c) &&
                             (str.charAt(0) != '?')){
                              nrDiffBack++;
                              valid[c] = false;
                              continue;
                          }
                      } catch (InternalError tr){
                          nrNoBack++;
                      }
                  }
                  if (nrUnmapped > 0){
                      System.err.println(enc + " has " + nrUnmapped + " unmapped characters");
                  }
                  if (nrEmpty > 0){
                      System.err.println(enc + " has " + nrEmpty + " empty mapped characters");
                  }
                  if (nrNoBack > 0){
                      System.err.println(enc + " does not convert back " + nrNoBack + " characters");
                  }
                  if (nrDiffBack > 0){
                      System.err.println(enc + " converts back " + nrDiffBack + " characters into a different one");
                  }
                  if (nrDiffBack > Character.MAX_VALUE / 2) return;
              } catch (Throwable th){
                  System.err.println("encoding " + enc + " mapping error " + th);
                  th.printStackTrace(System.err);
              }

              // test round-trip

              trip: for (int k = 0; k < 100; k++){
                  byte[] bb = null;
                  char[] ca = new char[10000];
                  Random r = new Random();
                  for (int i = 0; i < ca.length; i++){
                      do {
                          ca[i] = (char)r.nextInt(Character.MAX_VALUE);
                      } while (!valid[ca[i]]);
                  }
                  String old = String.valueOf(ca);
                  try {
                      bb = old.getBytes(enc);
                      if (bb == null){
                          System.err.println(enc + " empty encoding");
                          return;
                      }
                  } catch (InternalError th){
                      System.err.println(enc + " round-trip decoding error");
                      break trip;
                  } catch (UnsupportedEncodingException th){
                  }
                  try {
                      String str = new String(bb,enc);
                      if (!old.equals(str)){
                          if (old.length() != str.length()){
                              System.err.println("encoding " + enc +
                                  " round-trip " + old.length() +
                                  " back to " + str.length());
                              break trip;
                          }
                          for (int i = 0; i < ca.length && i < str.length(); i++){
                              if ((old.charAt(i) != str.charAt(i)) &&
                                  (str.charAt(i) != '?')){
                                  System.err.println(enc + " round-trip compare error");
                                  break trip;
                              }
                          }
                      }
                  } catch (InternalError th){
                      System.err.println(enc + " round-trip encoding error ");
                      break trip;
                  } catch (UnsupportedEncodingException th){
                  }
              }
          }

          /**
           * Tests all encodings. On all encodings the tests defined above
           * are performed. Moreover, some specific tests are done on ISO2022CN
           * and ISO2022KR.
           */

          public static void main(String[] args){

              for (int i = 0; i < encodings.length; i++){
                  test(encodings[i]);
              }

              try {
                  byte[] bb = new byte[] {(byte)0x1b, (byte)')', (byte)'x'};
                  String str = new String(bb,"ISO2022CN");
              } catch (Throwable th){
                  System.err.println("ISO2022CN error " + th);
              }

              try {
                  byte[] bb = new byte[] {(byte)0x1b, (byte)')', (byte)'x'};
                  String str = new String(bb,"ISO2022KR");
              } catch (Throwable th){
                  System.err.println("ISO2022KR error " + th);
              }

          }
      }

      /*
      When run, it reports a considerable amount of errors.

      Feel free to use it, and include in your test suite if you
      like.
      */
      (Review ID: 98558)
      ======================================================================

      Name: krT82822 Date: 02/08/2000


      java version "1.2.2"
      HotSpot VM (1.0.1, mixed mode, build g)

      When using the String to convert from native encodings to unicode and back
      again, different encodings behave erratically when dealing with characters for
      which there is not a direct match. Specifically, some encodings indicate a
      mismatch with by mapping the character to '\u003F', '\u001A', or even no
      character ''. What is worse is that within a single encoding, multiple methods
      are used. In some cases, conversions throw undocumented exceptions. The worst
      behavior is when a conversion from unicode to byte and back again does not
      generate an 'unkown' mapping or an exception, but maps to an entirely different
      character.

      My general technique for identifing these bugs was to step through all the
      unicode characters for every encoding, and document the results. For each
      character, I'd convert from unicode to byte, and then from byte back to unicode.

      'EX1' indicates an error converting from unicode to byte[].
      'EX2' indicates an error converting from byte[] to unicode.
      'UN1' indicates a mapping to the '\u003F' character.
      'UN2' indicates a mapping to the '

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              ilittlesunw Ian Little (Inactive)
              Reporter:
              kryansunw Kevin Ryan (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:
                Imported:
                Indexed: