Weeknotes: Week 5 - Q1 - 2025
Well we made it to February and despite the best efforts of some, it hasn't ended yet. I suppose we should count our blessings. After mentioning DeepSeek last week the world kind of lost its mind over the Chinese model. NVIDIA stocks dropped then soared, open models were the worst, then back to the best, reasoning is the word of the day, Meta rushed to reverse engineer it, OpenAI/Microsoft claimed it stole their IP then provided a hosted version of it in Azure and I’ve just had a bacon and egg sandwich. If you would like a really good if long read on what happened and why DeepSeek and their R1 & V3 models are interesting then this essay by Jeffrey Emanuel has you covered.
Benchmarking Local LLMs
The release of DeepSeek-R1 got me thinking about updating my Australian agronomy benchmark for evaluating small/quantised models, so this week I gave it a bit of a shakedown. The previous benchmark question set was derived from the GRDC grow notes and updates papers. It was good but due to the way I’d made it it contained a lot of poor questions as well as questions about GRDC projects that aren’t widely known. You could say this is a bonus for understanding Australian agronomy but I don’t think it really gave a good reflection of capability. So this week I built a synthetic dataset by using the Google Gemini2 model. The jury is still out about the quality of this (much reduced) dataset but at 200 questions, it feels much more manageable. I’ve not fully interrogated the dataset but from what I have seen it has merit.
Anyway. I’ve been using this in my benchmark. Some results are below. I’ve been thinking a bit more about what I mean by small model. I feel it needs a bit more definition. At the moment my rough unscientific definition of small model is less than approximately 80B parameters as my mac studio can't realistically run anything more but I haven’t really been able to find any guidance online.
Prior to Christmas I started to try to find some funding for putting a bit more rigour to this benchmarking for various agricultural sectors. If you’d be interested in collaborating please let me know.
Back to work
Next week is the first proper week back at work (and exercise) after the holidays. I've been doing things already but now with my family all back to school and work it's time to start digging in and knocking some things off of the to do list. First month in, 2025 is already starting to shape up as a busy year.
Until next week...