How do you choose an AI (LLM)?

(Feature image courtesy of Alan D Thompson. https://lifearchitect.ai/models)

As a part of a larger project I thought I’d read a bit into AI rather than just pick a tool and not know what I was doing. Well I learned a lot, which I’ll share here, but there’s probably still a lot more to learn. And I’ve found some references which I think are handy, or at least interesting for you to also browse, so I’ve included them below. Let’s discuss!

The number of parameters in a model seems roughly like the number of neurons or brain-power size, probably not a direct comparison between products as their approach differs. The training dataset size might make a big difference and also the make up of the training data used, as some focus on carefully selected content (expertise?) over broader content (wider human culture?). The majority of recent and accelerated growth in AI models seems to be due to these 2, the growing number of parameters and increasingly larger datasets. The number of tokens in a model I believe can be likened to it’s vocabulary size. And many models are trained on internet data available up to a certain date, or training data available date, so be aware of this if recent world information is important. Something else to be aware of, some interfaces and products say they will choose from different underlying models depending on the situation, in principle to optimise the outcome, but I imagine this might also depend the developer or user’s subscription budget. So you may not always know what you’re using. I’ve found you can try ask a model for more information about itself, but OpenAI for example will only tell you the training data available date, but it doesn’t seem able to (or willing to) provide all the details.

OpenAI’s GPT4 seems to still be a go-to for industry comparison, consistently at or near the top with the most complex model (largest number of parameters) and perhaps it’s the easiest to get started with (and/or pay for). Or perhaps that’s OpenAI’s marketing at work, as ChatGPT seems to be a name lots of people recognise and many people have tried, experts and non-experts. There are heaps of different benchmarks, but GPT4 seems consistently of high quality and often a benchmark that everyone aims to beat. But there is definitely no “right answer” as many AI’s are designed to be stronger in different areas.

Try to be aware of an AI’s capabilities, for example use this link to look up OpenAI model capabilities. Notice that if you just select GPT-4 on their web site or perhaps in other apps, that is currently shorthand for gpt-4-0613 and this model has 8k context tokens (8×1024, or roughly 8,000) and uses training data available up to Sept 2021. (correct at the time of writing this) A better alternative for long form creativity could be to choose a model with 32k tokens. OpenAI says that 1 token is the equivalent of about 0.75 words. So 8k is 6,144 words which is about 22 pages of a novel (assuming 280 words per page average), and 32k at 4 times larger would be about 88 pages of a novel. But what is context? AI is essentially “stateless”, meaning it has no memory, so without you seeing it happen the recent memory or history of your conversation (both prompts and responses) is sent each time to the AI as the “context” for your next prompt. That’s why you can come back to any chat conversation at a later time, the AI doesn’t magically remember where you left off, the software stores and sends the full context to the AI again. I hope I explained that well enough. So the maximum number of context tokens is essentially the maximum size of the AI’s short-term memory. Now, depending on how each chat algorithm works, as the context limit is reached they could either remove (forget) the oldest messages, or they could summarise (shorten) a section of the older conversations. If you’re having a quick chat or writing a short poem this probably doesn’t matter. But if you’re getting the AI to help write a manuscript having over “22 pages” of short-term memory could start to matter.

References:
A Comparative Analysis of Leading LLMs, updated Mar 2024. https://mindsdb.com/blog/navigating-the-llm-landscape-a-comparative-analysis-of-leading-large-language-models
A collection of model comparisons can be found here. https://lifearchitect.ai/models/
And you can even see some benchmarking results as people submit them, eg. the LMSys Chatbot Arena Leaderboard. https://huggingface.co/collections/open-llm-leaderboard/the-big-benchmarks-collection-64faca6335a7fc7d4ffe974a
And the description of some benchmarks can be found here. https://www.vellum.ai/blog/llm-benchmarks-overview-limits-and-model-comparison

Responsible Use of AI Technology

Maybe some final AI threads for others to follow, if interested, on the responsible development and/or use of AI. I think these things we still need to watch unfold, and see if the discussions, policies, companies and the world converge or diverge.

In 2021 UNESCO adopted an agreement to support 193 member states in implementing and reporting on a set of recommendations on the ethics of AI. Several companies have signed an agreement with UNESCO at the Global Forum on Ethics of AI in February 2024 to develop more ethical AI (Microsoft, Mastercard, Lenovo, GSMA, INNIT, LG, Salesforce, Telefonica). While OpenAI has a major partnership with Microsoft, they don’t appear to have a position on the UNESCO’s recommendations. In fact a recent UNESCO study seems to point fingers at biases within OpenAI’s older models and their closed development approach. The results however, for the more recent ChatGPT model (see figure 1) showed a significant shift towards more positive and neutral content generation. I’m sure UNESCO are hoping studies such as this draw more attention to their recommendations and may even influence global AI benchmarking. So I think it’s fair to say there has been a lot of effort put into OpenAI’s approach to safety and the pressure is not coming off anytime soon.

The final thread I’ll mention is the environmental impact of the computing power necessary to train these models. I think it’s clear that while the AI buzz is booming this will be an issue. This article suggests that in 2019 the power used for training one AI model is 20 times larger than the power an average American uses in a year, and the power used just keeps increasing for larger and larger models. Sound like a lot? Sure, but how many people then use that one model, millions? An AI once trained can be used over and over again, and in fact for some significant tasks AI is being used to significantly reduce the need for computational power (engineering and simulation, architecture and rendering). The buzz may be driving up speculative investment and growth, but eventually companies and models will only remain viable if their cost of growth remains lower than what people are willing to pay. The more a pre-trained model is used the better, and the more people who re-use it the better, and the more AI companies can reduce their training costs the better. So the end point of this thread isn’t clear to me, I don’t think the market will tolerate infinite growth, and I think there’s a chance the benefits could look similar to cloud computing – where the total size and growth is still very large, but the net benefit for applications moving to use the cloud is an 87% decrease in energy consumption (see this article).