Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why annotation differences between NCBI and UCSC? #147

Open
nekrut opened this issue Oct 28, 2024 · 5 comments
Open

Why annotation differences between NCBI and UCSC? #147

nekrut opened this issue Oct 28, 2024 · 5 comments
Assignees

Comments

@nekrut
Copy link
Contributor

nekrut commented Oct 28, 2024

Suppose I'm trying to obtain annotation data for Plasmodium vivax (GCF_000002415.2). If I download data from NCBI datasets and do a simple grep on sequence headers, I get this:

>NC_009906.1 Plasmodium vivax chromosome 1, whole genome shotgun sequence
>NC_009907.1 Plasmodium vivax chromosome 2, whole genome shotgun sequence
>NC_009908.2 Plasmodium vivax chromosome 3, whole genome shotgun sequence
>NC_009909.1 Plasmodium vivax chromosome 4, whole genome shotgun sequence
>NC_009910.1 Plasmodium vivax chromosome 5, whole genome shotgun sequence
>NC_009911.1 Plasmodium vivax chromosome 6, whole genome shotgun sequence
>NC_009912.1 Plasmodium vivax chromosome 7, whole genome shotgun sequence
>NC_009913.1 Plasmodium vivax chromosome 8, whole genome shotgun sequence
>NC_009914.1 Plasmodium vivax chromosome 9, whole genome shotgun sequence
>NC_009915.1 Plasmodium vivax chromosome 10, whole genome shotgun sequence
>NC_009916.1 Plasmodium vivax chromosome 11, whole genome shotgun sequence
>NC_009917.1 Plasmodium vivax chromosome 12, whole genome shotgun sequence
>NC_009918.1 Plasmodium vivax chromosome 13, whole genome shotgun sequence
>NC_009919.1 Plasmodium vivax chromosome 14, whole genome shotgun sequence

if I download this same data from the UCSC Genome Table browser for the same accession, I get this:

>GCF_000002415.2_assembly_NC_007243.1 range=NC_007243.1:1-5990 5'pad=0 3'pad=0 strand=+ repeatMasking=none
>GCF_000002415.2_assembly_AAKM01000014.1 range=NC_009906.1:1-565852 5'pad=0 3'pad=0 strand=+ repeatMasking=none
>GCF_000002415.2_assembly_AAKM01000023.2 range=NC_009906.1:569853-830022 5'pad=0 3'pad=0 strand=+ repeatMasking=none
>GCF_000002415.2_assembly_AAKM01000024.1 range=NC_009907.1:1-162059 5'pad=0 3'pad=0 strand=+ repeatMasking=none
>GCF_000002415.2_assembly_AAKM01000013.1 range=NC_009907.1:165060-755035 5'pad=0 3'pad=0 strand=+ repeatMasking=none
>GCF_000002415.2_assembly_AAKM01000015.1 range=NC_009908.2:1-533272 5'pad=0 3'pad=0 strand=- repeatMasking=none
>GCF_000002415.2_assembly_AAKM01000020.1 range=NC_009908.2:539273-927448 5'pad=0 3'pad=0 strand=- repeatMasking=none
>GCF_000002415.2_assembly_AAKM01000045.1 range=NC_009908.2:928949-942913 5'pad=0 3'pad=0 strand=+ repeatMasking=none
>GCF_000002415.2_assembly_AAKM01000030.1 range=NC_009908.2:949914-985794 5'pad=0 3'pad=0 strand=+ repeatMasking=none
>GCF_000002415.2_assembly_AAKM01000057.1 range=NC_009908.2:989795-999975 5'pad=0 3'pad=0 strand=+ repeatMasking=none
>GCF_000002415.2_assembly_AAKM01000053.1 range=NC_009908.2:1000476-1011127 5'pad=0 3'pad=0 strand=+ repeatMasking=none
>GCF_000002415.2_assembly_AAKM01000037.2 range=NC_009909.1:1-15630 5'pad=0 3'pad=0 strand=+ repeatMasking=none
>GCF_000002415.2_assembly_AAKM01000019.1 range=NC_009909.1:17131-430684 5'pad=0 3'pad=0 strand=+ repeatMasking=none
>GCF_000002415.2_assembly_AAKM01000018.1 range=NC_009909.1:432485-876622 5'pad=0 3'pad=0 strand=- repeatMasking=none
>GCF_000002415.2_assembly_AAKM01000006.1 range=NC_009910.1:1-1370936 5'pad=0 3'pad=0 strand=+ repeatMasking=none
>GCF_000002415.2_assembly_AAKM01000021.1 range=NC_009911.1:1-329199 5'pad=0 3'pad=0 strand=+ repeatMasking=none
>GCF_000002415.2_assembly_AAKM01002769.1 range=NC_009911.1:332200-1033388 5'pad=0 3'pad=0 strand=+ repeatMasking=none
>GCF_000002415.2_assembly_AAKM01000007.1 range=NC_009912.1:1-1198945 5'pad=0 3'pad=0 strand=+ repeatMasking=none
>GCF_000002415.2_assembly_AAKM01002770.1 range=NC_009912.1:1199046-1497819 5'pad=0 3'pad=0 strand=- repeatMasking=none
>GCF_000002415.2_assembly_AAKM01000008.1 range=NC_009913.1:1-1165049 5'pad=0 3'pad=0 strand=+ repeatMasking=none
>GCF_000002415.2_assembly_AAKM01000017.1 range=NC_009913.1:1165250-1678596 5'pad=0 3'pad=0 strand=- repeatMasking=none
>GCF_000002415.2_assembly_AAKM01000005.1 range=NC_009914.1:1-1923364 5'pad=0 3'pad=0 strand=+ repeatMasking=none
>GCF_000002415.2_assembly_AAKM01000011.1 range=NC_009915.1:1-895497 5'pad=0 3'pad=0 strand=+ repeatMasking=none
>GCF_000002415.2_assembly_AAKM01000016.1 range=NC_009915.1:898298-1419739 5'pad=0 3'pad=0 strand=- repeatMasking=none
>GCF_000002415.2_assembly_AAKM01000003.1 range=NC_009916.1:1-2021996 5'pad=0 3'pad=0 strand=- repeatMasking=none
>GCF_000002415.2_assembly_AAKM01000027.1 range=NC_009916.1:2025997-2067354 5'pad=0 3'pad=0 strand=- repeatMasking=none
>GCF_000002415.2_assembly_AAKM01000009.1 range=NC_009917.1:1-1012632 5'pad=0 3'pad=0 strand=- repeatMasking=none
>GCF_000002415.2_assembly_AAKM01000004.1 range=NC_009917.1:1018633-3004884 5'pad=0 3'pad=0 strand=+ repeatMasking=none
>GCF_000002415.2_assembly_AAKM01000002.1 range=NC_009918.1:1-2031768 5'pad=0 3'pad=0 strand=+ repeatMasking=none
>GCF_000002415.2_assembly_AAKM01000001.1 range=NC_009919.1:1-2132794 5'pad=0 3'pad=0 strand=+ repeatMasking=none
>GCF_000002415.2_assembly_AAKM01000010.1 range=NC_009919.1:2135795-3120417 5'pad=0 3'pad=0 strand=+ repeatMasking=none

The specific problems are:

It looks like UCSC splits chromosomes on gaps:

In NCBI you have this:

>NC_009906.1 Plasmodium vivax chromosome 1, whole genome shotgun sequence

In UCSC you have this:

>GCF_000002415.2_assembly_AAKM01000014.1 range=NC_009906.1:1-565852 5'pad=0 3'pad=0 strand=+ repeatMasking=none
>GCF_000002415.2_assembly_AAKM01000023.2 range=NC_009906.1:569853-830022 5'pad=0 3'pad=0 strand=+ repeatMasking=none

UCSC injects additional stuff in FASTA headers

Things like >GCF_000002415.2_assembly_AAKM01000014.1 in FASTA headers makes them unusable. This is because GTF's obtained from UCSC list only the accession:

NC_009906.1	hub_3894797_GCF_000002415.2_hub_3894797_ncbiRefSeq	exon	8396	8646	0.000000	+	.	gene_id "XM_001613345.1"; transcript_id "XM_001613345.1";
NC_009906.1	hub_3894797_GCF_000002415.2_hub_3894797_ncbiRefSeq	exon	8787	8854	0.000000	+	.	gene_id "XM_001613345.1"; transcript_id "XM_001613345.1";

if you are trying to do any joins (e.g., when prepping a snpEff database) you get into the problem of comparing NC_009906.1 against GCF_000002415.2_assembly_AAKM01000014.1 range=NC_009906.1:1-565852 5'pad=0 3'pad=0 strand=+ repeatMasking=none
and it fails of course

Solutions

Canb UCSC:

  1. Not split the chromosomes and have them exactly as in NCBI
  2. Not inject additional stuff into FASTA headers?
@nekrut nekrut converted this from a draft issue Oct 28, 2024
@nekrut nekrut moved this to Todo in BRC development tasks Oct 28, 2024
@nekrut nekrut changed the title Annotation differences between NCBI and UCSC Oct 28, 2024
@maximilianh
Copy link

maximilianh commented Oct 28, 2024 via email

@nekrut
Copy link
Contributor Author

nekrut commented Oct 28, 2024

Can you tell us what you did on the table browser and why you're using the table browser at all? If you want to get the FASTA file, the genome assembly page has a link to the FASTA file and that's in the same format as NCBI. Isn't that a lot easier than going through the table browser? Maybe we should make this link more prominent? thanks! Max

On Mon, Oct 28, 2024 at 4:45 PM Anton Nekrutenko @.> wrote: Suppose I'm trying to obtain annotation data for Plasmodium vivax (GCF_000002415.2). If I download data from NCBI datasets and do a simple grep on sequence headers, I get this: >NC_009906.1 Plasmodium vivax chromosome 1, whole genome shotgun sequence >NC_009907.1 Plasmodium vivax chromosome 2, whole genome shotgun sequence >NC_009908.2 Plasmodium vivax chromosome 3, whole genome shotgun sequence >NC_009909.1 Plasmodium vivax chromosome 4, whole genome shotgun sequence >NC_009910.1 Plasmodium vivax chromosome 5, whole genome shotgun sequence >NC_009911.1 Plasmodium vivax chromosome 6, whole genome shotgun sequence >NC_009912.1 Plasmodium vivax chromosome 7, whole genome shotgun sequence >NC_009913.1 Plasmodium vivax chromosome 8, whole genome shotgun sequence >NC_009914.1 Plasmodium vivax chromosome 9, whole genome shotgun sequence >NC_009915.1 Plasmodium vivax chromosome 10, whole genome shotgun sequence >NC_009916.1 Plasmodium vivax chromosome 11, whole genome shotgun sequence >NC_009917.1 Plasmodium vivax chromosome 12, whole genome shotgun sequence >NC_009918.1 Plasmodium vivax chromosome 13, whole genome shotgun sequence >NC_009919.1 Plasmodium vivax chromosome 14, whole genome shotgun sequence if I download this same data from the UCSC Genome Table browser for the same accession, I get this: >GCF_000002415.2_assembly_NC_007243.1 range=NC_007243.1:1-5990 5'pad=0 3'pad=0 strand=+ repeatMasking=none >GCF_000002415.2_assembly_AAKM01000014.1 range=NC_009906.1:1-565852 5'pad=0 3'pad=0 strand=+ repeatMasking=none >GCF_000002415.2_assembly_AAKM01000023.2 range=NC_009906.1:569853-830022 5'pad=0 3'pad=0 strand=+ repeatMasking=none >GCF_000002415.2_assembly_AAKM01000024.1 range=NC_009907.1:1-162059 5'pad=0 3'pad=0 strand=+ repeatMasking=none >GCF_000002415.2_assembly_AAKM01000013.1 range=NC_009907.1:165060-755035 5'pad=0 3'pad=0 strand=+ repeatMasking=none >GCF_000002415.2_assembly_AAKM01000015.1 range=NC_009908.2:1-533272 5'pad=0 3'pad=0 strand=- repeatMasking=none >GCF_000002415.2_assembly_AAKM01000020.1 range=NC_009908.2:539273-927448 5'pad=0 3'pad=0 strand=- repeatMasking=none >GCF_000002415.2_assembly_AAKM01000045.1 range=NC_009908.2:928949-942913 5'pad=0 3'pad=0 strand=+ repeatMasking=none >GCF_000002415.2_assembly_AAKM01000030.1 range=NC_009908.2:949914-985794 5'pad=0 3'pad=0 strand=+ repeatMasking=none >GCF_000002415.2_assembly_AAKM01000057.1 range=NC_009908.2:989795-999975 5'pad=0 3'pad=0 strand=+ repeatMasking=none >GCF_000002415.2_assembly_AAKM01000053.1 range=NC_009908.2:1000476-1011127 5'pad=0 3'pad=0 strand=+ repeatMasking=none >GCF_000002415.2_assembly_AAKM01000037.2 range=NC_009909.1:1-15630 5'pad=0 3'pad=0 strand=+ repeatMasking=none >GCF_000002415.2_assembly_AAKM01000019.1 range=NC_009909.1:17131-430684 5'pad=0 3'pad=0 strand=+ repeatMasking=none >GCF_000002415.2_assembly_AAKM01000018.1 range=NC_009909.1:432485-876622 5'pad=0 3'pad=0 strand=- repeatMasking=none >GCF_000002415.2_assembly_AAKM01000006.1 range=NC_009910.1:1-1370936 5'pad=0 3'pad=0 strand=+ repeatMasking=none >GCF_000002415.2_assembly_AAKM01000021.1 range=NC_009911.1:1-329199 5'pad=0 3'pad=0 strand=+ repeatMasking=none >GCF_000002415.2_assembly_AAKM01002769.1 range=NC_009911.1:332200-1033388 5'pad=0 3'pad=0 strand=+ repeatMasking=none >GCF_000002415.2_assembly_AAKM01000007.1 range=NC_009912.1:1-1198945 5'pad=0 3'pad=0 strand=+ repeatMasking=none >GCF_000002415.2_assembly_AAKM01002770.1 range=NC_009912.1:1199046-1497819 5'pad=0 3'pad=0 strand=- repeatMasking=none >GCF_000002415.2_assembly_AAKM01000008.1 range=NC_009913.1:1-1165049 5'pad=0 3'pad=0 strand=+ repeatMasking=none >GCF_000002415.2_assembly_AAKM01000017.1 range=NC_009913.1:1165250-1678596 5'pad=0 3'pad=0 strand=- repeatMasking=none >GCF_000002415.2_assembly_AAKM01000005.1 range=NC_009914.1:1-1923364 5'pad=0 3'pad=0 strand=+ repeatMasking=none >GCF_000002415.2_assembly_AAKM01000011.1 range=NC_009915.1:1-895497 5'pad=0 3'pad=0 strand=+ repeatMasking=none >GCF_000002415.2_assembly_AAKM01000016.1 range=NC_009915.1:898298-1419739 5'pad=0 3'pad=0 strand=- repeatMasking=none >GCF_000002415.2_assembly_AAKM01000003.1 range=NC_009916.1:1-2021996 5'pad=0 3'pad=0 strand=- repeatMasking=none >GCF_000002415.2_assembly_AAKM01000027.1 range=NC_009916.1:2025997-2067354 5'pad=0 3'pad=0 strand=- repeatMasking=none >GCF_000002415.2_assembly_AAKM01000009.1 range=NC_009917.1:1-1012632 5'pad=0 3'pad=0 strand=- repeatMasking=none >GCF_000002415.2_assembly_AAKM01000004.1 range=NC_009917.1:1018633-3004884 5'pad=0 3'pad=0 strand=+ repeatMasking=none >GCF_000002415.2_assembly_AAKM01000002.1 range=NC_009918.1:1-2031768 5'pad=0 3'pad=0 strand=+ repeatMasking=none >GCF_000002415.2_assembly_AAKM01000001.1 range=NC_009919.1:1-2132794 5'pad=0 3'pad=0 strand=+ repeatMasking=none >GCF_000002415.2_assembly_AAKM01000010.1 range=NC_009919.1:2135795-3120417 5'pad=0 3'pad=0 strand=+ repeatMasking=none The specific problems are: It looks like UCSC splits chromosomes on gaps: In NCBI you have this: >NC_009906.1 Plasmodium vivax chromosome 1, whole genome shotgun sequence In UCSC you have this: >GCF_000002415.2_assembly_AAKM01000014.1 range=NC_009906.1:1-565852 5'pad=0 3'pad=0 strand=+ repeatMasking=none >GCF_000002415.2_assembly_AAKM01000023.2 range=NC_009906.1:569853-830022 5'pad=0 3'pad=0 strand=+ repeatMasking=none UCSC injects additional stuff in FASTA headers Things like >GCF_000002415.2_assembly_AAKM01000014.1 in FASTA headers makes them unusable. This is because GTF's obtained from UCSC list only the accession: NC_009906.1 hub_3894797_GCF_000002415.2_hub_3894797_ncbiRefSeq exon 8396 8646 0.000000 + . gene_id "XM_001613345.1"; transcript_id "XM_001613345.1"; NC_009906.1 hub_3894797_GCF_000002415.2_hub_3894797_ncbiRefSeq exon 8787 8854 0.000000 + . gene_id "XM_001613345.1"; transcript_id "XM_001613345.1"; if you are trying to do any joins (e.g., when prepping a snpEff database) you get into the problem of comparing NC_009906.1 against GCF_000002415.2_assembly_AAKM01000014.1 range=NC_009906.1:1-565852 5'pad=0 3'pad=0 strand=+ repeatMasking=none and it fails of course Solutions Canb UCSC: 1. Not split the chromosomes and have them exactly as in NCBI 2. Not inject additional stuff into FASTA headers? — Reply to this email directly, view it on GitHub <#147>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACL4TJLPQJHUUOUJ6GFFH3Z5ZL2HAVCNFSM6AAAAABQXXOGV6VHI2DSMVQWIX3LMV43ASLTON2WKOZSGYYTQOBQHAZDKMY . You are receiving this because you are subscribed to this thread.Message ID: @.>

I'll demonstrate a use case on Thursday.

@maximilianh
Copy link

For the types of users we're targeting here, the table browser does not sound like a tool they should use. You want to give them pre-cooked workflows, something where they just have to push a button. I thought you wanted to hide sequence retrieval under some tool and not make them suffer from our table browser?

Also, I doubt that can change the TB sequence ID output format now, it's been like this for 20 years...

@nekrut
Copy link
Contributor Author

nekrut commented Oct 28, 2024

For the types of users we're targeting here, the table browser does not sound like a tool they should use. You want to give them pre-cooked workflows, something where they just have to push a button. I thought you wanted to hide sequence retrieval under some tool and not make them suffer from our table browser?

Also, I doubt that can change the TB sequence ID output format now, it's been like this for 20 years...

Ok. But if splitting on gaps necessary?

@maximilianh
Copy link

The TB doesn't split on gaps, not by default, it's the first time I'm hearing this. If you just output the fasta for a plain track that covers all chromosomes, then it shouldn't split the sequence. What exactly did you click on the table browser?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
3 participants