Details

    • Type: JEP
    • Status: Draft
    • Priority: P3
    • Resolution: Unresolved
    • Fix Version/s: None
    • Component/s: core-libs
    • Labels:
      None
    • Author:
      alanb
    • JEP Type:
      Feature
    • Exposure:
      Open
    • Subcomponent:
    • Scope:
      SE

      Description

      Summary

      Use UTF-8 as the Java virtual machine's default charset so that APIs that depend on the default charset behave consistently across all platforms.

      Goals

      The goal of this JEP is for APIs that use the default charset behave consistently across platforms, and not depend on the user's locale and configuration.

      Non-Goals

      It is also not the goal of this JEP to define new Java SE or JDK specific APIs although the effort may identify opportunities where convenience methods might make existing APIs more approachable or easier to use.

      Motivation

      APIs that use the default charset are a hazard for developers that are new to the Java platform. They are also a bugbear for experienced developers. Consider an application that creates a java.io.FileWriter with its 1-arg constructor and uses it to writes some text to a file. Writing the text encodes it into a sequence of bytes using the default charset. Another application, run on a different machine or by a different user on the same machine, creates a java.io.FileReader with its 1-arg constructor and uses it to read the text from the file. Reading the file decodes the bytes to a sequence of characters/text using the default charset. If the default charset is different when reading then the resulting text may be silently corrupted or incomplete (as these APIs replace erroneous input, they don't fail).

      Developers that are familiar with the hazard may choose to use methods that specify the charset (either by charset name or Charset) but the resulting code is more verbose. Furthermore, using APIs that specify the charset may inhibit the use of some Java Language features (Method References in particular). Sometimes developers attempt to set the default charset by means of the system property file.encoding but this has never been a supported mechanism (and may not actually be effective, especially when changed after the Java virtual machine has been initialized).

      Description

      The default charset is currently determined when the Java virtual machine starts. On macOS it is UTF-8, on other platforms it depends on the user's locale and the default encoding. The determination of the default charset results in the creation of two JDK internal (and undocumented) system properties:

      • file.encoding - the value of this system property is the name of the default charset. The java.nio.charsets.Charset.defaultCharset() API returns the Charset object for this charset.

      • sun.jnu.encoding - the value of this system property is the name of the charset used when encoding/decoding file paths (as opposed to file contents). It is also used in in the JDK native code, JNU_NewStringPlatform in particular. On macOS its value is "UTF-8", on other platforms it is typically the default charset.

      The value of these system properties can be overridden on the command line although doing so has never been supported.

      The default charset is used by several Java SE API, e.g.

      • java.io package: InputStreamReader, FileReader, OutputStreamWriter, and FileWriter define constructors to create readers or writers that encode or decode using the default charset.
      • java.util package: Formatter and Scanner define constructors where the resulting objects using the default charset.
      • java.net package: URLEncoder and URLDecoder define constructors that uses the default charset (although these constructors are deprecated).

      Note that the APIs in java.nio.file.Files do not use the default charset. The methods read or write character streams without a Charset parameter are specified to use UTF-8 rather than the default charset. (Newer APIs using UTF-8 is arguably a hazard for applications that use a mix of both old and new APIs).

      The specification of the Charset.defaultCharset() API will be changed to specify that the default charset is UTF-8 unless configured otherwise by an implementation specific means. All APIs, including those listed above, that use the default charset will link to Charset.defaultCharset() if they don't already do so.

      To mitigate the compatibility impact, the file.encoding property will be documented (in an implementation note) so that it can be set on the command line to the value "SYSTEM" (i.e. -Dfile.encoding=SYSTEM). When started with this value the default charset will be determined based on the locale and default encoding as long standing behavior.

      In addition, the file.encoding property will be also be documented to allow it be set on the command line with the value "UTF-8", essentially a no-op.

      The system property sun.jnu.encoding and its value will be unchanged. It will remain undocumented.

      Testing

      Significant testing will be required to understand the extent of the compatibility impact. Testing from developers or organizations with geographically diverse user populations will be needed.

      Developers can check for issues with existing JDK releases by running with -Dfile.encoding=UTF-8 in advance of any early access or JDK release with the change.

      Some existing unit/regression tests may need to be updated.

      Alternatives

      • Keep the status quo: This doesn't eliminate the hazards for new developers.
      • Deprecate all methods in the Java SE API that use the default charset: This will encourage developers to use constructors and methods that take a charset parameter but the resulting code is more verbose.
      • Specify UTF-8 as the default charset without providing any means to change it: The compatibility impact of this is too high.

      Risks and Assumptions

      There are is no risk in some environments:

      • The default charset on macOS has been UTF-8 for several releases.
      • The default charset in many (but not all) Linux environments is UTF-8 so these environments will not observe a change.
      • Many server applications are started with -Dfile.encoding=UTF-8 so they will also not observe any change.

      In other environments, the risk to changing the default charset to UTF-8 after 20+ years may be significant. We expect the main impact will be to users of Microsoft Windows in Asian locales and maybe some server environments in Asian/other locales.

      • Upgrading: e.g. an application has been running for years with SJIS as the default charset. When upgraded to a JDK release that uses UTF-8 as the default charset it experiences problems when reading files that are encoded as SJIS. For this example, the application could be changed to specify SJIS when opening the file. If the code cannot be changed then running with -Dfile.encoding=SYSTEM will force the default charset to be SJIS until the application is updated or the file converted to UTF-8.

      • Environments where there are several JDK versions in use, e.g. one user using an older JDK release where SJIS is the default charset, another where UTF-8 is the default charset.

        Attachments

          Activity

            People

            • Assignee:
              sherman Xueming Shen
              Reporter:
              alanb Alan Bateman
              Owner:
              Alan Bateman
            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated: