experiment
I needed to grasp which fashions and duties would profit most from the curation course of. As a baseline for the experiment, we used crowdsource labels to fine-tune two LLMs of various sizes (Gemini Nano-1 with 1.8B parameters, NANO-2 with 3.25B parameters) with two duties of various complexity (primarily based on skilled alignment beneath the decrease one). Every crowdsourced dataset has a robust class of imbalance with ~100k annotations and a robust class of imbalance, with a median of round 95% benign labels.
Every of those 4 baseline situations was in contrast with the corresponding curation situations through which every mannequin (NANO-1 and NANO-2) is fine-tuned in a number of rounds utilizing the curation course of described above. For every iteration, we chosen a set of curated examples and used them for mannequin analysis and fine-tuning as above. All fashions stopped earlier than reaching similar to the skilled inside alignment, thus stopped at six iterations (~400 fine-tuning and ~250 analysis samples) on account of decrease complexity and 5 iterations (~250 fine-tuning and ~150 analysis samples). (Notice that decrease complexity duties have increasingly more completely different examples, which can clarify the very long time required to converge.) Each datasets had constructive examples with a stability of roughly 40% for the ultimate class.
The next desk offers an outline of the size and high quality of the information utilized in every situation. Consultants reached the common pairwise Cohen kappa (decrease complexity activity) and .78 (higher complexity duties) via the curation course of. We think about these to be the ceiling of the mannequin’s efficiency. To evaluate the standard of crowdsourced information, crowdsourced annotations and kappa alignments between specialists have been calculated. That is primarily based on a whole curation set of .59 (decrease complexity) and .41 (greater complexity).