Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8177951

Charset problem when the name of the sound device contains Chinese character.

    Details

    • Subcomponent:
    • Resolved In Build:
      b23
    • CPU:
      x86
    • OS:
      other

      Description

      FULL PRODUCT VERSION :
      java version "1.8.0_121"
      Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
      Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)

      ADDITIONAL OS VERSION INFORMATION :
      Windows 10 64-bit build 15061
      Simplified Chinese version. System default charset is GBK.

      EXTRA RELEVANT SYSTEM CONFIGURATION :
      Realtek built-in sound card.

      A DESCRIPTION OF THE PROBLEM :
      I'm working on a program that uses Java Sound API. When I want to enumerate the names of the sound devices, the Chinese characters (encoding in system default charset GBK) in the name became messy code.
      For example:
      The name in Control Pane: "扬声器 (Realtek High Definition Audio)"
      The name Mixer.getMixerInfo().getName() returns:"ÑïÉùÆ÷ (Realtek High Definition Audio)"

      I don't have a Linux platform so I can't test under that. :(

      The problem is caused by a mistake that Java made: Java "encoded" the GBK code into UTF-8 code. For example. the GBK code of char '扬' is '0xD1EF', (2 bytes). = '0b11010001 0b11101111'
      So what happened when Java read it? Java encoded it as UTF-8 encoding(see https://en.m.wikipedia.org/wiki/UTF-8#Description): for first byte, it cut it into '0b11' and '0b010001', added '0b110'(and padded zeros) and '0b10', then we got '0b11000011 0b10010001' = '0xc391'. Do so with the second bytes, we got '0b11000011 0b10101111'.
      So problem is here: the code that Java encodes is not an Unicode code. Instead, it's a GBK code. So it's hard to recover it. It should be fixed in future release.

      P.S. Thank dram who found the way that java encodes the code wrongly and the way to solve it.

      STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
      use code below:
      Mixer.Info[] mi = AudioSystem.getMixerInfo();
                  for (Mixer.Info info : mi) {
                      System.out.println("info: " + info);
                      Mixer m = AudioSystem.getMixer(info);
                      System.out.println("mixer " + m);
                      Line.Info[] sl = m.getSourceLineInfo();
                      for (Line.Info info2 : sl) {
                          System.out.println(" info: " + info2);
                          Line line = AudioSystem.getLine(info2);
                          if (line instanceof SourceDataLine) {
                              SourceDataLine source = (SourceDataLine) line;

                              DataLine.Info i = (DataLine.Info) source.getLineInfo();
                              for (AudioFormat format : i.getFormats()) {
                                  System.out.println(" format: " + format);
                              }
                          }
                      }
                      System.out.println("");
                  }
      ( from http://stackoverflow.com/questions/12863081/how-do-i-get-mixer-channels-layout-in-java)

      EXPECTED VERSUS ACTUAL BEHAVIOR :
      EXPECTED -
      on my computer:
      主声音驱动?
      扬声器 (Realtek High Definition Audio)
      Line 1 (Virtual Audio Cable)
      Line 2 (Virtual Audio Cable)

      p.s. the name of the first line is originally broken. or broke in reading.
      ACTUAL -
      Ö÷ÉùÒôÇý¶¯³
      ÑïÉùÆ÷ (Realtek High Definition Audio)
      Line 1 (Virtual Audio Cable)
      Line 2 (Virtual Audio Cable)

      REPRODUCIBILITY :
      This bug can be reproduced always.

      ---------- BEGIN SOURCE ----------
      import javax.sound.sampled.*;
      import java.util.Arrays;

      import static javax.sound.sampled.AudioFormat.Encoding.PCM_SIGNED;

      public class SoundEnumerator {
          public static void main(String[] args) throws LineUnavailableException {
              Mixer.Info[] mis = AudioSystem.getMixerInfo();
              for (Mixer.Info mi : mis) {
                  Mixer m = AudioSystem.getMixer(mi);
                  if (isMixerUsable(m)) {
                      System.out.println(mi.getName());
                  }
              }
          }

          private static boolean isMixerUsable(Mixer m) throws LineUnavailableException {
              final int[] count = {0};
              Arrays.stream(m.getSourceLineInfo())
                      .filter((it) -> it instanceof SourceDataLine.Info)
                      .filter((it) -> {
                          try {
                              return m.getLine(it) instanceof SourceDataLine;
                          } catch (LineUnavailableException e) {
                              return false;
                          }
                      })
                      .forEach((it) -> Arrays.stream(((DataLine.Info) it).getFormats())
                              .filter((af) -> !af.isBigEndian())
                              .filter((af) -> af.getEncoding() == PCM_SIGNED)
                              .filter((af) -> af.getSampleSizeInBits() != 24)
                              .forEach((af) -> count[0]++));
              return count[0] != 0;
          }
      }

      ---------- END SOURCE ----------

      CUSTOMER SUBMITTED WORKAROUND :
      Use method below to decode the broken utf-8 code and reconvert it as system charset string.

      private static final byte HIGH_BIT = (byte) 0b11000000;
          private static String deMessyCode(String messyCode) {
              ByteOutputStream buf = new ByteOutputStream(messyCode.length());
              byte[] originalBytes = messyCode.getBytes(Charset.forName("UTF-8"));

              for (int i = 0; i < originalBytes.length; i++) {
                  if ((byte) (originalBytes[i] & HIGH_BIT) == HIGH_BIT) {
                      buf.write((originalBytes[i] << 6) | (originalBytes[i + 1] << 2 >>> 2));
                      // DELETE 0b110000 and move 6, then delete the 0b10 prefix of the second byte.
                      i++;
                  } else {
                      buf.write(originalBytes[i]);
                  }
              }

              return new String(buf.getBytes(), Charset.forName(System.getProperty("file.encoding")));
          }

      The problem of this solution is, you must know the system default encoding, and there may be real 2-byte unicode that encodes into UTF-8 that get "decoded" by my method.

        Attachments

          Activity

            People

            • Assignee:
              serb Sergey Bylokhov
              Reporter:
              webbuggrp Webbug Group
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: