LLM experiments
Great thing about working in ČSOB data science department, we can use up to 10% of our time for learning and exploration 👨🎓 🕵♀️ . Which is what my colleagues are now doing with text classification.
As usual, we have a real-world problem: Over hundred of clients daily enter a claim. The task is to classify them into categories for further processing. Guys approached it by “old-school” feed-forward neural network leading to satisfying 83% accuracy. Successfully deployed for a few months now. (Thanks IPA team for doing the heavy lifting!)
Exploration starts with a simple question: Can LLMs or SLMs (small language models, such as BERT) improve this performance? Aim is not to use off-the-self models, but try techniques like fine tuning, LoRA, and other we were reading about. We have available on-premise GPU power as well as AWS GPUs, therefore we are able to do a lot of interesting experiments.
RoBERTa was a quick win over old school - getting to 86% accuracy after finetuning that is not GPU resource greedy.
To the heavy guns. Playing around with Phi3, Nemo and Llama 3 took some time of setting up and tens of hours of training. Effectively blocking other projects on GPUs, because they are GPU greedy. Beating RoBERTa was a tough fight, Phi3 and Nemo couldn’t pass the 86%. Until finally, few configurations of Llama 3 landed 87%. 1 percentage point better (p.p.). I was surprised plain old encoder-only model being so hard to beat. But it is what it is - data science doesn’t play favorites.
Guys have some more experiments planned (for example, what to do with “Other” category 🤔 ), but this exploration already brought important questions. Is the 1 p.p. worth of using LLM on GPUs (probably in cloud with some additional risks)? Is 3 p.p. enough for changing running and proven model? Should we drop vanilla neural networks, because BERT is better baseline to start with?
Looking forward to see more 10% projects with interesting results and learnings. 🎉