Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8253952

Refine ZipOutputStream.putNextEntry() to recalculate ZipEntry's compressed size

    XMLWordPrintable

    Details

    • Type: Enhancement
    • Status: Resolved
    • Priority: P4
    • Resolution: Fixed
    • Affects Version/s: 8, 11, 16
    • Fix Version/s: 16
    • Component/s: core-libs

      Description

      ### Summary

      Work around wrong usage of `ZipOutputStream.putNextEntry()` in user code which can lead to the `ZipException "invalid entry compressed size"`.

      ### Motivation

      In general it is not safe to directly write a ZipEntry obtained from `ZipInputStream.getNextEntry()`, `ZipFile.entries()`, `ZipFile.getEntry()` or `ZipFile.stream()` with `ZipOutputStream.putNextEntry()` to a `ZipOutputStream` and then read the entries data from the `ZipInputStream` and write it to the `ZipOutputStream` as follows:
      ```
       ZipEntry entry;
       ZipInputStream zis = new ZipInputStream(...);
       ZipOutputStream zos = new ZipOutputStream(...);
       while((entry = zis.getNextEntry()) != null) {
           zos.putNextEntry(entry);
           zis.transferTo(zos);
       }
      ```
      The problem with this code is that the zip file format does not record the compression level used for deflation in its entries. In general, it doesn't even mandate a predefined compression ratio per compression level. Therefore the compressed size recorded in a `ZipEntry` read from a zip file might differ from the new compressed size produced by the receiving `ZipOutputStream`. Such a difference will result in a `ZipException` with the following message:
      ```
       java.util.zip.ZipException: invalid entry compressed size (expected 12 but got 7 bytes)
      ```
      The correct way of copying all entries from one zip file into another requires the creation of a new `ZipEntry` or at least resetting of the compressed size field. E.g.:
      ```
       while((entry = zis.getNextEntry()) != null) {
           ZipEntry newEntry = new ZipEntry(entry.getName());
           zos.putNextEntry(newEntry);
           zis.transferTo(zos);
       }
      ```
      or:
      ```
       while((entry = zis.getNextEntry()) != null) {
           entry.setCompressedSize(-1);
           zos.putNextEntry(entry);
           zis.transferTo(zos);
       }
      ```
      Unfortunately, there's a lot of user code out there which gets this wrong and uses the bad coding pattern described before. Searching for `"java.util.zip.ZipException: invalid entry compressed size (expected 12 but got 7 bytes)"` gives ~2500 hits (~100 on StackOverflow). It's also no hard to find plenty of instances of this anti-pattern on GitHub when doing a code search for `ZipEntry` and `putNextEntry()`. E.g. [Gradle 4.x wrapper task][1] is affected as well as the latest version of the [mockableAndroidJar task][2]. I've recently fixed two occurrences of this pattern in OpenJDK (see [JDK-8240333][3] and [JDK-8240235][4]) but there still exist more of them (e.g. [`test/jdk/java/util/zip/ZipFile/CopyJar.java`][5] which is there since 1999 :).

      ### Description

      So while this has clearly been a problem before, it apparently wasn't painful enough to trigger any action from the side of the JDK. However, recently quite some zlib forks with [superior deflate/inflate performance have evolved][6]. Using them with OpenJDK is quite straight-forward: one just has to configure the alternative implementations by setting `LD_LIBRARY_PATH` or `LD_PRELOAD` correspondingly. We've seen big saving by using these new zlib implementations for selected services in production and the only reason why we haven't enabled them by default until now is the problem I've just described. The reason why these new libraries uncover the described anti-pattern much more often is because their compression ratio is slightly different from that of the default zlib library. This can easily trigger a `ZipException` even if an application is not using a different compression levels but just a zip file created with another zlib version.

      I'd therefore like to propose the following workaround for the wrong `ZipOutputStream.putNextEntry()` usage in user code:

      - ignore the compressed size if it was implicitly determined from the zip file and not explicitly set by calling `ZipEntry.setCompressedSize()`.

      - Change the API-documentation of `ZipOutputStream.putNextEntry()` and `JarOutputStream.putNextEntry()` to explain the problem and why `putNextEntry()` will ignore the compressed size of a `ZipEntry` if that was set implicitely when reading that entry from a `ZipFile` or `ZipInputStream`.


      ### Technical Details

      A zip file consists of a stream of File Entries followed by a Central Directory (see [here for a more detailed specification][7]). Each File Entry is composed of a Local File Header (LFH) followed by the compressed Data and an optional Data Descriptor. The LFH contains the File Name and among other attributes the Compressed and Uncompressed size and CRC of the Data. In the case where the latter three attributes are not available at the time when the LFH is created, this fact will be recorded in a flag of the LFH and will trigger the creation of a Data Descriptor with the corresponding information right after the Data section. Finally, the Central Directory contains one Central Directory File Header (CDFH) for each entry of the zip archive. The CDFH is an extended version of the LFH and the ultimate reference for the contents of the zip archive. The redundancy between LFH and CDFH is a tribute to zip's long history when it was used to store archives on multiple floppy discs and the CDFH allowed to update the archive by only writing to the last disc which contained the Central Directory.

      `ZipEntries` read with `ZipInputStream.getNextEntry()` will initially only contain the information from the LFH. Only after the next entry was read (or after `ZipInputStream.closeEntry()` was called explicitly), will the previously read entry be updated with the data from the Data Descriptor. `ZipInputStream` doesn't inspect the Central Directory at all.

      On the other hand, `ZipFile` only queries the Central Directory for `ZipEntry` information so all `ZipEntries` returned by `ZipFile.entries()`, `ZipFile.getEntry()` and `ZipFile.stream()` will always instantly contain the full Compressed and Uncompressed Size and CRC information for each entry independently of the LFH contents.

      ### Risks and Assumptions

      If we choose to ignore the implicitly recorded compressed size in a `ZipEntry` read from a zip file when writing it to a `ZipOutputStream`, this will lead to zip files with incomplete information in the LFH and an additional Data Descriptor as described before. However, the result is still fully compatible to the zip file specification. It's also not unusual, because by default all new zip files created with `ZipOutputStream` will contain LFHs without Compressed and Uncompressed Size and CRC information and an additional Data Descriptor. Theoretically it is possible to create new zip files with `ZipOutputStream` class and Compressed and Uncompressed Size and CRC information in the LFH but that's complex and inefficient because it requires two steps. A first step to determine the crc and compressed size of the data and a second step to actually write the data to the `ZipOutputStream` (which will compress it a second time). This is because the current API offers no possibility to write already compressed data to a `ZipOutputStream`.

      Consequently, the only straight-forward way of creating zip files from Java which have all the data in the LFH and no Data Descriptor is by copying `ZipEntries` from an existing zip file with the buggy method described before. This incidentally worked more or less reliable for a long time but breaks miserably when using different zlib implementations. Ignoring the implicitly set compressed size of `ZipEntries` can easily fix this problem.

      I'm not aware of any tool which can not handle such files and if it exists it would have problems with the majority of Java created zip files anyway (e.g. all jar-files created with the jar tool have zip entries with incomplete LFH data and additional Data Descriptor).

      Ignoring the implicitly set compressed size of `ZipEntries` has no measurable performance impact and will increase the size of zip archives which used to have the complete file information in the LFH before by 16 bytes per entry. On the other hand it will give us the freedom to use whatever zip implementation we like :)


      [1]: https://github.com/gradle/gradle/blob/e76905e3a/subprojects/build-init/src/main/java/org/gradle/api/tasks/wrapper/Wrapper.java#L152-L158
      [2]: https://android.googlesource.com/platform/tools/base/+/refs/heads/master/build-system/builder/src/main/java/com/android/builder/testing/MockableJarGenerator.java#86
      [3]: https://bugs.openjdk.java.net/browse/JDK-8240333
      [4]: https://bugs.openjdk.java.net/browse/JDK-8240235
      [5]: https://github.com/openjdk/jdk/blob/master/test/jdk/java/util/zip/ZipFile/CopyJar.java
      [6]: https://github.com/simonis/zlib-bench/blob/master/Results.md
      [7]: https://en.wikipedia.org/wiki/Zip_(file_format)

        Attachments

          Issue Links

          There are no Sub-Tasks for this issue.

            Activity

              People

              Assignee:
              simonis Volker Simonis
              Reporter:
              simonis Volker Simonis
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: