Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8187041

JEP 400: UTF-8 by Default

    XMLWordPrintable

    Details

    • Type: JEP
    • Status: Candidate
    • Priority: P3
    • Resolution: Unresolved
    • Fix Version/s: None
    • Component/s: core-libs
    • Labels:
      None
    • Author:
      Alan Bateman
    • JEP Type:
      Feature
    • Exposure:
      Open
    • Subcomponent:
    • Scope:
      SE
    • Discussion:
      core dash libs dash dev at openjdk dot java dot net
    • Effort:
      XS
    • Duration:
      XS
    • JEP Number:
      400

      Description

      Summary

      Specify UTF-8 as the default charset for the Java SE APIs, so that APIs which depend on the default charset behave consistently across all JDK implementations and independently of the user’s operating system, locale, and configuration.

      Non-Goals

      It is not a goal to define new Java SE or JDK-specific APIs, although this effort may identify opportunities where new convenience methods might make existing APIs more approachable or easier to use.

      Motivation

      Several Java SE APIs allow a charset to be specified when reading and writing files and processing text. Supported charsets include US-ASCII, UTF-8, and ISO-8859-1. However, developers often overlook the choice of charset, so APIs are usually capable of functioning without one being specified. Typically, APIs will use the default charset in this case. The JDK chooses a charset to serve as the default charset, based on the operating system, locale, and other factors known at startup.

      Since the default charset is not the same everywhere, APIs that use the default charset pose many non-obvious hazards, even to experienced developers.

      Consider an application that creates a java.io.FileWriter with its one-argument constructor and then uses it to write some text to a file. The resulting file will contain a sequence of bytes encoded using the default charset of the JDK running the application. A second application, run on a different machine or by a different user on the same machine, creates a java.io.FileReader with its one-argument constructor and uses it to read the bytes in that file. The resulting text contains a sequence of characters decoded using the default charset of the JDK running the second application. If the default charset differs between the JDK of the first application and the JDK of the second application, then the resulting text may be silently corrupted or incomplete, since these APIs replace erroneous input rather than fail.

      Developers familiar with such hazards can use methods that take a charset argument explicitly. However, having to pass an argument prevents the methods from being used via method references (::) in Java 8-style streams.

      Sometimes developers attempt to set the default charset via the system property file.encoding, but this has never been supported and may not actually work, especially if modified after the Java virtual machine is initialized.

      Not all Java SE APIs defer to the JDK's choice of default charset. For example, the methods in java.nio.file.Files that read or write files without a Charset argument are specified to always use UTF-8. The fact that newer APIs default to using UTF-8 while older APIs default to using the default charset is a hazard for applications that use a mix of APIs.

      The entire Java ecosystem would benefit if the default charset was specified to be the same everywhere: applications that are not concerned with portability will see little impact, while applications that embrace portability by specifying charsets will see no impact. Since UTF-8 is standard for the XML and JSON files processed by vast numbers of Java programs, and since Java's own APIs increasingly favor UTF-8, e.g., in the NIO API and for properties files, it makes sense to specify UTF-8 as the default charset.

      Description

      The default charset is currently determined when the Java virtual machine starts. On macOS, it is UTF-8 except in the POSIX C locale; on other platforms, it depends upon the user's locale and the default encoding. The method java.nio.charsets.Charset.defaultCharset() exposes which charset was determined as the default. Several Java SE APIs use the default charset, including:

      • In the java.io package: InputStreamReader, FileReader, OutputStreamWriter, and FileWriter define constructors to create readers or writers that encode or decode using the default charset.

      • In the java.util package: Formatter and Scanner define constructors whose results use the default charset.

      • In the java.net package: URLEncoder and URLDecoder define deprecated constructors whose results use the default charset.

      We propose to change the specification of Charset.defaultCharset() to say that the default charset is UTF-8 unless configured otherwise by an implementation-specific means. This explicit support for non-standard configuration means that Java programs may detect something other than UTF-8 as the default charset. The UTF-8 charset is specified by RFC 2279; the transformation format upon which it is based is specified in Amendment 2 of ISO 10646-1 and is also described in the Unicode Standard. It is not to be confused with "Modified UTF-8".

      We will update the specifications of all Java SE APIs that use the default charset, including those listed above, to cross-reference Charset.defaultCharset(). The choice of UTF-8 applies only to Java SE APIs and not to the Java language, which will continue to use UTF-16.

      There are four system properties related to the default charset:

      • file.encoding — the name of the default charset.

      • sun.stdout.encoding and sun.stderr.encoding — the names of the charsets used for the standard output (System.out) and error (System.err) streams, and in the java.io.Console API.

      • sun.jnu.encoding — the name of the charset used when encoding or decoding filename paths, as opposed to file contents. It is used in the implementation of java.nio.fs classes and in particular in the native utility method JNU_NewStringPlatform. On macOS its value is "UTF-8"; on other platforms, it is typically the default charset.

      The values of these system properties can be set on the command line, although doing so has never been supported and often has no effect. To mitigate the compatibility impact of this JEP, we will revise the treatment of the system property file.encoding so that setting it on the command line is a supported means of configuring the default charset (as envisaged by the specification of Charset.defaultCharset()). This will be documented by an implementation note in System.getProperties, as follows:

      • If file.encoding is set to COMPAT (i.e., java -Dfile.encoding=COMPAT), then the default charset will be the charset chosen by the JDK's current algorithm, based on the user's operating system, locale, and other factors. (This may mean that the default charset is UTF-8.) The value of this property will be replaced with the name of that charset.

      • If file.encoding is set to UTF-8 (i.e., java -Dfile.encoding=UTF-8), then the default charset will be UTF-8. This no-op value is defined in order to preserve the behavior of existing command lines.

      • The treatment of values other than COMPAT and UTF-8 will not be specified. They are not supported, but if such a value worked before then it will likely continue to work.

      The other system properties (sun.stdout.encoding, sun.stderr.encoding, sun.jnu.encoding) will remain unspecified and unsupported.

      Testing

      • Significant testing will be required to understand the extent of the compatibility impact of this change. Testing by developers or organizations with geographically diverse user populations will be needed.

      • Developers can check for issues with an existing JDK release by running with -Dfile.encoding=UTF-8 in advance of any early-access or JDK release with this change.

      • Some existing unit and regression tests may need to be updated.

      Alternatives

      • Preserve the status quo. This doesn't eliminate the hazards for new developers.

      • Deprecate all methods in the Java SE API that use the default charset. This would encourage developers to use constructors and methods that take a charset parameter, but the resulting code is more verbose.

      • Specify UTF-8 as the default charset without providing any means to change it. The compatibility impact of this change would be too high.

      Risks and Assumptions

      The risk of specifying the default charset as UTF-8 is that applications do not behave correctly when processing data produced when the default charset was unspecified. However, this risk is not wholly new; applications which are inattentive to charsets (for example, by not specifying explicit charset to APIs) have always run the risk of incorrect behavior and/or data corruption.

      Fortunately, applications in many environments can expect very low risk from Java's choice of UTF-8:

      • On macOS, the default charset has been UTF-8 for several releases (except when configured to use the POSIX C locale).

      • In many Linux distributions (though not all), the default charset is UTF-8, so no change will be discernible in those environments.

      • Many server applications are already started with -Dfile.encoding=UTF-8, so they will not experience any change.

      In other environments, the risk of changing the default charset to UTF-8 after more than twenty years may be significant. We expect the main impact will be to users of Windows in Asian locales, and possibly some server environments in Asian and other locales. Possible scenarios include:

      • If an application that has been running for years with SJIS as the default charset is upgraded to a JDK release that uses UTF-8 as the default charset then it will experience problems when reading files that are encoded in SJIS. In this case the application could be changed to specify SJIS when opening such files. If the code cannot be changed, then running with -Dfile.encoding=COMPAT will force the default charset to be SJIS until the application is updated or the file is converted to UTF-8.

      • In environments in which several JDK versions are in use, users might not be able to exchange file data. If, e.g., one user uses an older JDK release where SJIS is the default and another uses a newer JDK where UTF-8 is the default, then text files created by one user might not be readable by the other. In this case the user on the old JDK release could specify -Dfile.encoding=UTF-8 when starting applications, or the user on the new release could specify -Dfile.encoding=COMPAT.

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              naoto Naoto Sato
              Reporter:
              alanb Alan Bateman
              Owner:
              Naoto Sato Naoto Sato
              Reviewed By:
              Alex Buckley, Brian Goetz
              Votes:
              1 Vote for this issue
              Watchers:
              19 Start watching this issue

                Dates

                Created:
                Updated: