-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why annotation differences between NCBI and UCSC? #147
Comments
Can you tell us what you did on the table browser and why you're using the
table browser at all?
If you want to get the FASTA file, the genome assembly page has a link to
the FASTA file and that's in the same format as NCBI. Isn't that a lot
easier than going through the table browser? Maybe we should make this link
more prominent?
thanks!
Max
…On Mon, Oct 28, 2024 at 4:45 PM Anton Nekrutenko ***@***.***> wrote:
Suppose I'm trying to obtain annotation data for Plasmodium vivax
(GCF_000002415.2). If I download data from NCBI datasets and do a simple
grep on sequence headers, I get this:
>NC_009906.1 Plasmodium vivax chromosome 1, whole genome shotgun sequence
>NC_009907.1 Plasmodium vivax chromosome 2, whole genome shotgun sequence
>NC_009908.2 Plasmodium vivax chromosome 3, whole genome shotgun sequence
>NC_009909.1 Plasmodium vivax chromosome 4, whole genome shotgun sequence
>NC_009910.1 Plasmodium vivax chromosome 5, whole genome shotgun sequence
>NC_009911.1 Plasmodium vivax chromosome 6, whole genome shotgun sequence
>NC_009912.1 Plasmodium vivax chromosome 7, whole genome shotgun sequence
>NC_009913.1 Plasmodium vivax chromosome 8, whole genome shotgun sequence
>NC_009914.1 Plasmodium vivax chromosome 9, whole genome shotgun sequence
>NC_009915.1 Plasmodium vivax chromosome 10, whole genome shotgun sequence
>NC_009916.1 Plasmodium vivax chromosome 11, whole genome shotgun sequence
>NC_009917.1 Plasmodium vivax chromosome 12, whole genome shotgun sequence
>NC_009918.1 Plasmodium vivax chromosome 13, whole genome shotgun sequence
>NC_009919.1 Plasmodium vivax chromosome 14, whole genome shotgun sequence
if I download this same data from the UCSC Genome Table browser for the
same accession, I get this:
>GCF_000002415.2_assembly_NC_007243.1 range=NC_007243.1:1-5990 5'pad=0 3'pad=0 strand=+ repeatMasking=none
>GCF_000002415.2_assembly_AAKM01000014.1 range=NC_009906.1:1-565852 5'pad=0 3'pad=0 strand=+ repeatMasking=none
>GCF_000002415.2_assembly_AAKM01000023.2 range=NC_009906.1:569853-830022 5'pad=0 3'pad=0 strand=+ repeatMasking=none
>GCF_000002415.2_assembly_AAKM01000024.1 range=NC_009907.1:1-162059 5'pad=0 3'pad=0 strand=+ repeatMasking=none
>GCF_000002415.2_assembly_AAKM01000013.1 range=NC_009907.1:165060-755035 5'pad=0 3'pad=0 strand=+ repeatMasking=none
>GCF_000002415.2_assembly_AAKM01000015.1 range=NC_009908.2:1-533272 5'pad=0 3'pad=0 strand=- repeatMasking=none
>GCF_000002415.2_assembly_AAKM01000020.1 range=NC_009908.2:539273-927448 5'pad=0 3'pad=0 strand=- repeatMasking=none
>GCF_000002415.2_assembly_AAKM01000045.1 range=NC_009908.2:928949-942913 5'pad=0 3'pad=0 strand=+ repeatMasking=none
>GCF_000002415.2_assembly_AAKM01000030.1 range=NC_009908.2:949914-985794 5'pad=0 3'pad=0 strand=+ repeatMasking=none
>GCF_000002415.2_assembly_AAKM01000057.1 range=NC_009908.2:989795-999975 5'pad=0 3'pad=0 strand=+ repeatMasking=none
>GCF_000002415.2_assembly_AAKM01000053.1 range=NC_009908.2:1000476-1011127 5'pad=0 3'pad=0 strand=+ repeatMasking=none
>GCF_000002415.2_assembly_AAKM01000037.2 range=NC_009909.1:1-15630 5'pad=0 3'pad=0 strand=+ repeatMasking=none
>GCF_000002415.2_assembly_AAKM01000019.1 range=NC_009909.1:17131-430684 5'pad=0 3'pad=0 strand=+ repeatMasking=none
>GCF_000002415.2_assembly_AAKM01000018.1 range=NC_009909.1:432485-876622 5'pad=0 3'pad=0 strand=- repeatMasking=none
>GCF_000002415.2_assembly_AAKM01000006.1 range=NC_009910.1:1-1370936 5'pad=0 3'pad=0 strand=+ repeatMasking=none
>GCF_000002415.2_assembly_AAKM01000021.1 range=NC_009911.1:1-329199 5'pad=0 3'pad=0 strand=+ repeatMasking=none
>GCF_000002415.2_assembly_AAKM01002769.1 range=NC_009911.1:332200-1033388 5'pad=0 3'pad=0 strand=+ repeatMasking=none
>GCF_000002415.2_assembly_AAKM01000007.1 range=NC_009912.1:1-1198945 5'pad=0 3'pad=0 strand=+ repeatMasking=none
>GCF_000002415.2_assembly_AAKM01002770.1 range=NC_009912.1:1199046-1497819 5'pad=0 3'pad=0 strand=- repeatMasking=none
>GCF_000002415.2_assembly_AAKM01000008.1 range=NC_009913.1:1-1165049 5'pad=0 3'pad=0 strand=+ repeatMasking=none
>GCF_000002415.2_assembly_AAKM01000017.1 range=NC_009913.1:1165250-1678596 5'pad=0 3'pad=0 strand=- repeatMasking=none
>GCF_000002415.2_assembly_AAKM01000005.1 range=NC_009914.1:1-1923364 5'pad=0 3'pad=0 strand=+ repeatMasking=none
>GCF_000002415.2_assembly_AAKM01000011.1 range=NC_009915.1:1-895497 5'pad=0 3'pad=0 strand=+ repeatMasking=none
>GCF_000002415.2_assembly_AAKM01000016.1 range=NC_009915.1:898298-1419739 5'pad=0 3'pad=0 strand=- repeatMasking=none
>GCF_000002415.2_assembly_AAKM01000003.1 range=NC_009916.1:1-2021996 5'pad=0 3'pad=0 strand=- repeatMasking=none
>GCF_000002415.2_assembly_AAKM01000027.1 range=NC_009916.1:2025997-2067354 5'pad=0 3'pad=0 strand=- repeatMasking=none
>GCF_000002415.2_assembly_AAKM01000009.1 range=NC_009917.1:1-1012632 5'pad=0 3'pad=0 strand=- repeatMasking=none
>GCF_000002415.2_assembly_AAKM01000004.1 range=NC_009917.1:1018633-3004884 5'pad=0 3'pad=0 strand=+ repeatMasking=none
>GCF_000002415.2_assembly_AAKM01000002.1 range=NC_009918.1:1-2031768 5'pad=0 3'pad=0 strand=+ repeatMasking=none
>GCF_000002415.2_assembly_AAKM01000001.1 range=NC_009919.1:1-2132794 5'pad=0 3'pad=0 strand=+ repeatMasking=none
>GCF_000002415.2_assembly_AAKM01000010.1 range=NC_009919.1:2135795-3120417 5'pad=0 3'pad=0 strand=+ repeatMasking=none
The specific problems are:
It looks like UCSC splits chromosomes on gaps:
In NCBI you have this:
>NC_009906.1 Plasmodium vivax chromosome 1, whole genome shotgun sequence
In UCSC you have this:
>GCF_000002415.2_assembly_AAKM01000014.1 range=NC_009906.1:1-565852 5'pad=0 3'pad=0 strand=+ repeatMasking=none
>GCF_000002415.2_assembly_AAKM01000023.2 range=NC_009906.1:569853-830022 5'pad=0 3'pad=0 strand=+ repeatMasking=none
UCSC injects additional stuff in FASTA headers
Things like >GCF_000002415.2_assembly_AAKM01000014.1 in FASTA headers
makes them unusable. This is because GTF's obtained from UCSC list only the
accession:
NC_009906.1 hub_3894797_GCF_000002415.2_hub_3894797_ncbiRefSeq exon 8396 8646 0.000000 + . gene_id "XM_001613345.1"; transcript_id "XM_001613345.1";
NC_009906.1 hub_3894797_GCF_000002415.2_hub_3894797_ncbiRefSeq exon 8787 8854 0.000000 + . gene_id "XM_001613345.1"; transcript_id "XM_001613345.1";
if you are trying to do any joins (e.g., when prepping a snpEff database)
you get into the problem of comparing NC_009906.1 against GCF_000002415.2_assembly_AAKM01000014.1
range=NC_009906.1:1-565852 5'pad=0 3'pad=0 strand=+ repeatMasking=none
and it fails of course
Solutions
Canb UCSC:
1. Not split the chromosomes and have them exactly as in NCBI
2. Not inject additional stuff into FASTA headers?
—
Reply to this email directly, view it on GitHub
<#147>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACL4TJLPQJHUUOUJ6GFFH3Z5ZL2HAVCNFSM6AAAAABQXXOGV6VHI2DSMVQWIX3LMV43ASLTON2WKOZSGYYTQOBQHAZDKMY>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
I'll demonstrate a use case on Thursday. |
For the types of users we're targeting here, the table browser does not sound like a tool they should use. You want to give them pre-cooked workflows, something where they just have to push a button. I thought you wanted to hide sequence retrieval under some tool and not make them suffer from our table browser? Also, I doubt that can change the TB sequence ID output format now, it's been like this for 20 years... |
Ok. But if splitting on gaps necessary? |
The TB doesn't split on gaps, not by default, it's the first time I'm hearing this. If you just output the fasta for a plain track that covers all chromosomes, then it shouldn't split the sequence. What exactly did you click on the table browser? |
Suppose I'm trying to obtain annotation data for Plasmodium vivax (GCF_000002415.2). If I download data from NCBI datasets and do a simple grep on sequence headers, I get this:
if I download this same data from the UCSC Genome Table browser for the same accession, I get this:
The specific problems are:
It looks like UCSC splits chromosomes on gaps:
In NCBI you have this:
In UCSC you have this:
UCSC injects additional stuff in FASTA headers
Things like
>GCF_000002415.2_assembly_AAKM01000014.1
in FASTA headers makes them unusable. This is because GTF's obtained from UCSC list only the accession:if you are trying to do any joins (e.g., when prepping a snpEff database) you get into the problem of comparing
NC_009906.1
againstGCF_000002415.2_assembly_AAKM01000014.1 range=NC_009906.1:1-565852 5'pad=0 3'pad=0 strand=+ repeatMasking=none
and it fails of course
Solutions
Canb UCSC:
The text was updated successfully, but these errors were encountered: