Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Truncation issue with GZIPInputStream when reading bgzip-compressed streams #1691

Open
jkmatila opened this issue Oct 16, 2023 · 2 comments
Open

Comments

@jkmatila
Copy link

Description of the issue

When reading bgzip-compressed data from a stream (e.g. over the network or from a pipe), htsjdk can stop early and not read the stream all the way to the end, processing partial data. This is due to using java.util.zip.GZIPInputStream which has a known bug JDK-7036144 affecting reading concatenated gzip streams when the complete size of the stream is not known beforehand.

The problem has been discussed earlier in the comment thread to broadinstitute/gatk issue #4224, but it seems that a part of the recommended fix, replacing the use of the buggy GZIPInputStream class, was not adopted. There also is a comment in SamInputResource.java which references this problem and the aforementioned JDK bug.

For example, GzipCompressorInputStream from Apache Commons Compress would be an alternative which does not suffer from this problem, as noted in its Javadoc.

Your environment

  • version of htsjdk: 4.0.2
  • version of Java: OpenJDK 17.0.8.1 (Eclipse Temurin-17.0.8.1+1), OpenJDK 21 (Eclipse Temurin-21+35)
  • which OS: Linux x86_64 (Ubuntu 22.04.3)

Steps to reproduce

Here is an example program which demonstrates one instance of the problem, VCFIteratorBuilder using the problematic GZIPInputStream class when reading a bgzipped VCF stream, and only part of the stream gets read. (Note: The code attempts to read a 200 MB file from the Web, although it in practice due to the bug it does not get very far.)

import java.io.IOException;
import java.io.InputStream;
import java.net.URL;

import htsjdk.variant.vcf.VCFIterator;
import htsjdk.variant.vcf.VCFIteratorBuilder;

public class Repro {

    public static void main(String[] args) throws IOException {
        URL url = new URL("https://ftp-trace.ncbi.nlm.nih.gov/1000genomes/ftp/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz");
        int expectedNumRecords = 1103547;

        int numRecords = 0;
        try (InputStream in = url.openStream();
             VCFIterator vcfIterator = new VCFIteratorBuilder().open(in)) {
            while (vcfIterator.hasNext()) {
                vcfIterator.next();
                numRecords++;
                if (numRecords % 100000 == 0) {
                    System.out.println("Read " + numRecords + " records...");
                }
            }
        }

        if (numRecords != expectedNumRecords) {
            System.err.println("URL: " + url);
            System.err.println("Expected: " + expectedNumRecords + " records");
            System.err.println("Parsed: " + numRecords + " records");
            System.exit(1);
        }
    }

}

There are also other places where GZIPInputStream is used in htsjdk.

Expected behaviour

The compressed VCF stream should be read all the way to the end and all records parsed.

Actual behaviour

The code stops reading the stream at some point, not parsing all records. No error is returned and parsing appears successful but in fact produces only part of the expected data. The exact location where it stops probably depends on how the block boundaries happen to coincide with the buffer size. It appears that if you have a large enough input file, it will likely stop at some point before reaching the end.

On both OpenJDK 17 (Eclipse Temurin-17.0.8.1+1) and on OpenJDK 21 (Eclipse Temurin-21+35) using the official Eclipse Temurin Docker images the above code outputs the following:

URL: https://ftp-trace.ncbi.nlm.nih.gov/1000genomes/ftp/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
Expected: 1103547 records
Parsed: 7395 records
@lbergelson
Copy link
Member

Yeah, this bug is awful. We worked around it in GATK but never really fixed it in htsjdk. I decided to message the java development mailing list last week about it. I wish I had done that years ago, since someone responded really quickly. After 12 years or so since it was first reported, this bug is going to be fixed soon. I'm hoping/asking that it be back ported to java 17 or at least 21.

See openjdk/jdk#17113

@jfuerth
Copy link

jfuerth commented Aug 27, 2024

Thanks for championing this fix in the JDK! I hit this issue when using VCFIterator on JDK 22, and I can now stream this file when running on the latest build of JDK 23.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
3 participants