Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8124977

cmdline encoding challenges on Windows

    Details

    • Type: Bug
    • Status: Open
    • Priority: P3
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: tbd_major
    • Component/s: core-libs
    • Labels:
      None
    • OS:
      windows, windows_nt, windows_2008, windows_vista, windows_7, windows_2012, windows_8

      Description

      Motivated by the discussion here:
        http://stackoverflow.com/questions/11927518/java-unicode-utf-8-and-windows-command-prompt

      As well as this code:

      =====
      class Main {
        public static void main(String[] args) throws Exception {
          for (int i = 0; i < args.length; ++i) {
            if (i > 0) {
              System.out.print(' ');
            }
            System.out.print(args[i]);
          }
          System.out.println();
        }
      }
      =====
       
      Create a batch file with the following text and with UTF-8 encoding without BOM. Now execute the batch file using CLI. ‘f.txt’ does not contain the same characters as the input characters.
       
      =========
      chcp 65001
      java Main 﨨狝 﨨狝 > f.txt
      ==========

      A good start on language issues in the windows console is in this post and elsewhere in this blog: http://www.siao2.com/2010/10/07/10072032.aspx

      There are multiple areas involved in this problem.

      First is how the command arguments are passed to an app. Powershell appears to pass them differently than cmd.exe. With cmd.exe after calling chcp 65001, I see that the args are kept in wchar_t as ucs2. With powershell [Console].OutputEncoding as 437, 1252 and utf8 they appeared to be in char as utf8 encoding.
      NOTE: Chcp is a commandline tool to call SetConsoleOutputCP(). As far as I can see a process should not call SetConsoleOutputCP

      The second is how the command arguments are retrieved by an app

      int main(int argc, char**argv)
      vs
      int wmain(int argc, wchar_t**argv)
      vs
      char* GetCommandLineA()
      vs
      wchar_t* GetCommandLineW()

      The JDK uses GetCommandLineA and should use GetCommandLineW to support Unicode args. This change should be controlled by the java commandline to ensure compatibility.

      Second are the output streams (stdout, stderr) – These are involved when using > or | to put the results in a file and when writing to the console. This turns out to involve complex logic around using WriteConsoleW for console output, WriteFile for > and | with a final fallback to writing ascii in the GetConsoleOutputCP().

      Third is getting the consoles to display gyphs for the Unicode characters being tested. The font selected in the cmd and powershell windows must be Lucida Console or Consolas. Also, additional language packs must be installed to get fallback fonts for the characters needed. Finally, using a console app (conemu, Console+, ..) should enable the proper display of Unicode glyphs for cmd and powershell windows that they start. PowerShellISE worked when the [Console].OutputEncoding is set to utf8.
       

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                kshoop Kirk Shoop
                Reporter:
                kshoop Kirk Shoop
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated: