-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CRAM random access very slow #1162
Comments
@eugenegardner Thanks for the issue report. Its certainly the case that doing this on a CRAM is slower than a BAM, but I'm not sure if what you're experiencing is something beyond the perf issues we know about (see #850 and #851, as well as #1065, although I don't think the latter is affecting you based on the code snippet above). A couple of questions:
|
Thanks for the reply @cmnbroad. I have read through the issues you referenced above: #850: This seems like it very well might be a similar case but I wonder as to the performance increase that they mention in that issue when compared to processing with an identical bam. They may be getting a 2-2.5 cram->cram comparison with their optimization, but I would imagine that cram->bam performance would still be in favor of bam. I am experiencing 10-20x increased times for cram vs bam. To be clear, I am also accessing various information about the reads I do retrieve (i.e. retrieving read ref sequence, flags, etc.) so I am unsure where the slow-down may be. I also have not tried to go into the htsjdk source code and see where the true slowdown is and was hoping that I was doing something fundamentally wrong. This is for a project that I am not really funded to work on and more of a pet project in which I want to properly implement CRAM (i.e. I don't have too much time to work on it...). As for you questions I didn't answer above:
I am unsure and currently checking the results of my code. I would imagine so based off the limited testing I have done thus far.
I am using the ReferenceSource constructor as provided by htsjdk:
I have dug into the source code for IGV to see if there is something special going on there to enable rapid CRAM access, but was unable to identify anything quickly. |
@eugenegardner In very general terms, I think reading a CRAM serially in htsjdk is somewhere in the range of 2x-3x slower than reading a BAM. I would expect random query access may be even worse when compared to BAM, in part due to #850. Also FWIW, #1065 affects not only queryMate, but queryAlignmentStart. Branches with proposed fixes for both #850 and #1065 exist (my fix for #1065 is a bit out of date, and has no tests yet, but I need to rebase and finish it anyway). If you wanted, you could try running with one or the other to see if they help. In the meantime I'll try to see if we can get those merged soon. |
@cmnbroad, the 2x-3x slower than reading a bam is a very accurate representation of the slow down and is what I see in my own code when I simply do:
I will go ahead and grab #850 & #1065 and see if they make a big difference and report back. |
Hi @cmnbroad, I have attempted to implement the branch for #850 and have not seen an appreciable increase in speed. I have dug into the problem a bit more and it appears that calling:
Is taking the vast majority of the processing time (especially the first read in the iterator). I don't quite understand why this is the case. Are reads decoded when this call is made, rather than at initialization of the underlying CRAMIterator? I am having a little trouble figuring out exactly where the holdup is because I am not too familiar with the htsjdk source code. I have used debug mode, and all the reads in a given container are loaded into the iterator very fast (on the order of 148 ms) compared to the actual |
@eugenegardner Thanks for trying out #850 - I was hoping it would help, but its useful to know that it didn't in your case. I just started reviewing it yesterday, and its going to need some changes, so in the interest of limiting the number of variables in the mix, I would suggest reverting to master for now if its not helping. Reading CRAMs is inherently bursty because the smallest addressable unit is the |
Hi folks, Just wanted to say that I implemented v2.22.0 with all of it's CRAM improvements and noticed a fairly reasonable speed up. Previously, instead of relying on the previous CRAM implementation, I had implemented a hack which made system calls to samtools (ugly I know but I didn't have a lot of time to do something better). This was reasonably fast but I always intended to fix. The new CRAM implementation in v2.22.0 is still slower than those system calls, but now instead of a factor of magnitude different (>10x slower) it now appears to be in the 1-2x range. I can't tell precisely what the difference is yet as I haven't set up controlled tests on a private server, but it's pretty obvious the improvements make a big difference. Thanks for doing all you do! |
@eugenegardner That great to hear - thanks for letting us know! |
Hello,
I've searched around a bit to see if there is a simple answer to my problem (like #722) but could not seem to determine what performance hits people are accepting when using CRAM v BAM in htsjdk.
My current code looks something like:
localBamFile can be either a CRAM or BAM file.
I am seeing terrible performance when doing this over several genomic coordinates. This does not appear to extend to a simple genome walker that goes from one end of the genome to another. Can anybody point me to what I am doing wrong or if this is a known issue with htsjdk? I have resorted to making an external call to samtools to get around this issue (and improve performance dramatically).
Thanks!
The text was updated successfully, but these errors were encountered: