Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRAM random access very slow #1162

Open
eugenegardner opened this issue Jul 6, 2018 · 8 comments
Open

CRAM random access very slow #1162

eugenegardner opened this issue Jul 6, 2018 · 8 comments
Assignees
Labels

Comments

@eugenegardner
Copy link

Hello,

I've searched around a bit to see if there is a simple answer to my problem (like #722) but could not seem to determine what performance hits people are accepting when using CRAM v BAM in htsjdk.

My current code looks something like:

SamReaderFactory factory = SamReaderFactory.makeDefault().validationStringency(ValidationStringency.SILENT).referenceSource(refSource);
SAMReader mySam = factory.open(SamInputResource.of(localBamFile).index(localBamFileIndex));
SAMRecordIterator itr = mySam.queryContained(chr, start, end);
while (itr.hasNext()) {
   SAMRecord insert = itr.next();
   String readName = insert.getReadName();
   ##Obviously do some other stuff here as well
}
itr.close();
mySam.close()

localBamFile can be either a CRAM or BAM file.

I am seeing terrible performance when doing this over several genomic coordinates. This does not appear to extend to a simple genome walker that goes from one end of the genome to another. Can anybody point me to what I am doing wrong or if this is a known issue with htsjdk? I have resorted to making an external call to samtools to get around this issue (and improve performance dramatically).

Thanks!

@cmnbroad cmnbroad self-assigned this Jul 9, 2018
@cmnbroad
Copy link
Collaborator

@eugenegardner Thanks for the issue report. Its certainly the case that doing this on a CRAM is slower than a BAM, but I'm not sure if what you're experiencing is something beyond the perf issues we know about (see #850 and #851, as well as #1065, although I don't think the latter is affecting you based on the code snippet above). A couple of questions:

  • are you using .crai or .bai ?
  • in the cases when performance is slow, do you know if the results you get are correct (i.e., do they match what you get from samtools ?)
  • what implementation is backing the reference source - is it the ReferenceSource implementation built in to htsjdk, or something else
@eugenegardner
Copy link
Author

eugenegardner commented Jul 10, 2018

Thanks for the reply @cmnbroad. I have read through the issues you referenced above:

#850: This seems like it very well might be a similar case but I wonder as to the performance increase that they mention in that issue when compared to processing with an identical bam. They may be getting a 2-2.5 cram->cram comparison with their optimization, but I would imagine that cram->bam performance would still be in favor of bam. I am experiencing 10-20x increased times for cram vs bam.
#851: I have tried both bai and crai indices in the past. I have not done it with the more recent releases of htsjdk so I will go back and test again.
#1065: I do use queryMate() in my code in other locations, but have yet to run into this issue.

To be clear, I am also accessing various information about the reads I do retrieve (i.e. retrieving read ref sequence, flags, etc.) so I am unsure where the slow-down may be. I also have not tried to go into the htsjdk source code and see where the true slowdown is and was hoping that I was doing something fundamentally wrong. This is for a project that I am not really funded to work on and more of a pet project in which I want to properly implement CRAM (i.e. I don't have too much time to work on it...).

As for you questions I didn't answer above:

in the cases when performance is slow, do you know if the results you get are correct (i.e., do they match what you get from samtools ?)

I am unsure and currently checking the results of my code. I would imagine so based off the limited testing I have done thus far.

what implementation is backing the reference source - is it the ReferenceSource implementation built in to htsjdk, or something else

I am using the ReferenceSource constructor as provided by htsjdk:

ReferenceSource refSource = new ReferenceSource(new File("hg19.fa"))

I have dug into the source code for IGV to see if there is something special going on there to enable rapid CRAM access, but was unable to identify anything quickly.

@cmnbroad
Copy link
Collaborator

@eugenegardner In very general terms, I think reading a CRAM serially in htsjdk is somewhere in the range of 2x-3x slower than reading a BAM. I would expect random query access may be even worse when compared to BAM, in part due to #850. Also FWIW, #1065 affects not only queryMate, but queryAlignmentStart.

Branches with proposed fixes for both #850 and #1065 exist (my fix for #1065 is a bit out of date, and has no tests yet, but I need to rebase and finish it anyway). If you wanted, you could try running with one or the other to see if they help. In the meantime I'll try to see if we can get those merged soon.

@eugenegardner
Copy link
Author

@cmnbroad, the 2x-3x slower than reading a bam is a very accurate representation of the slow down and is what I see in my own code when I simply do:

SamReaderFactory factory = SamReaderFactory.makeDefault().validationStringency(ValidationStringency.SILENT).referenceSource(refSource);
SAMReader mySam = factory.open(SamInputResource.of(localBamFile).index(localBamFileIndex));
SAMRecordIterator itr = mySam.iterator()

I will go ahead and grab #850 & #1065 and see if they make a big difference and report back.

@eugenegardner
Copy link
Author

eugenegardner commented Jul 13, 2018

Hi @cmnbroad, I have attempted to implement the branch for #850 and have not seen an appreciable increase in speed. I have dug into the problem a bit more and it appears that calling:

SAMRecord record = itr.next();

Is taking the vast majority of the processing time (especially the first read in the iterator). I don't quite understand why this is the case. Are reads decoded when this call is made, rather than at initialization of the underlying CRAMIterator? I am having a little trouble figuring out exactly where the holdup is because I am not too familiar with the htsjdk source code. I have used debug mode, and all the reads in a given container are loaded into the iterator very fast (on the order of 148 ms) compared to the actual next() call.

@cmnbroad
Copy link
Collaborator

cmnbroad commented Jul 13, 2018

@eugenegardner Thanks for trying out #850 - I was hoping it would help, but its useful to know that it didn't in your case. I just started reviewing it yesterday, and its going to need some changes, so in the interest of limiting the number of variables in the mix, I would suggest reverting to master for now if its not helping.

Reading CRAMs is inherently bursty because the smallest addressable unit is the slice. In order to access one read, the entire slice is processed (HTSJDK is not particularly smart about this; I suspect some of this processing could be deferred and done lazily on demand, but for now its done all at once). HTSJDK uses a default of 10,000 records per slice to write CRAMs, but other implementations can use different values. So, the processing thats done when you call next depends on a lot of factors, such as whether the particular call is crossing a slice/container boundary, which implementation created the CRAM you're reading; what parameters were used; and whether or not the iterator is using an index to satisfy a bounded query.

@eugenegardner
Copy link
Author

Hi folks,

Just wanted to say that I implemented v2.22.0 with all of it's CRAM improvements and noticed a fairly reasonable speed up.

Previously, instead of relying on the previous CRAM implementation, I had implemented a hack which made system calls to samtools (ugly I know but I didn't have a lot of time to do something better). This was reasonably fast but I always intended to fix. The new CRAM implementation in v2.22.0 is still slower than those system calls, but now instead of a factor of magnitude different (>10x slower) it now appears to be in the 1-2x range. I can't tell precisely what the difference is yet as I haven't set up controlled tests on a private server, but it's pretty obvious the improvements make a big difference.

Thanks for doing all you do!

@cmnbroad
Copy link
Collaborator

@eugenegardner That great to hear - thanks for letting us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 participants