Refactor ContrastiveDataset and ContrastiveDistillationDataset #579

DemirTonchev · 2024-12-25T11:13:24Z

Refactor of ContrastiveDataset and ContrastiveDistillationDataset to generate pairs lazily. Also the trainer code is updated to create dataset from iterator. This targets how pairs are generated more motivation - #578

dataset = Dataset.from_generator(data_sampler.__iter__) this change allows to work with arbitrary big dataset (although the trade off is the cache on the disk managed by arrow dataset)

This also targets the ContrastiveDistillationDataset bug in #578

fixes: #578

…ers 4.45.2

…with bigger dataset

DemirTonchev added 10 commits December 20, 2024 15:03

fixed to work with processing_class instead tokenizer after transform…

cb4e803

…ers 4.45.2

refactor attempt for ContrastiveDataset so that it does not blow RAM …

6a69303

…with bigger dataset

added Samplit strategy enum

09869e1

improved logic and fixed iterator pattern

82474dc

fix for negative samples formula

d01dc36

added multilalbel support as in the original implementation

53eaace

ContrastiveDataset iterator refactor

3e3fa5f

trainer fixed to work with ContrastiveDataset iter method

2d5e29b

ContrastiveDistillationDataset iter refactor

1c905b1

typing fix

9cafa02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor ContrastiveDataset and ContrastiveDistillationDataset #579

Refactor ContrastiveDataset and ContrastiveDistillationDataset #579

DemirTonchev commented Dec 25, 2024 •

edited

Loading

Refactor ContrastiveDataset and ContrastiveDistillationDataset #579

Are you sure you want to change the base?

Refactor ContrastiveDataset and ContrastiveDistillationDataset #579

Conversation

DemirTonchev commented Dec 25, 2024 • edited Loading

DemirTonchev commented Dec 25, 2024 •

edited

Loading