Skip to main content

Since the release of ChatGPT and the attention drawn to large language models, we have seen a rapid increase in releases of more models and a rapidly evolving market associated with a use of LLMs. A model's suitability for utilization in a business context depends heavily on the respective use case. In this blog post, we shall take a closer look at the currently most important models, and compare them on the basis of enterprise-relevant criteria so that you can maintain a better overview.

Large language models existed already before ChatGPT

With the introduction of ChatGPT, the world has been turned upside down – or so it seems. In fact, however, large language models existed even before the release of OpenAI's ChatGPT. Google and the University of Toronto laid the foundation for a transformation from "simple" natural-language Processing (NLP) to generative AI already in 2017, in their paper titled "Attention is all you need". It presents a transformer architecture which uses an attention mechanism to register relationships between words and model dependencies over longer text sequences. This makes it possible to generate longer, coherent texts and even create entire dialogues between humans and AI systems.

In 2018, OpenAI introduced its first generative pre-trained transformer (GPT) based on transformer architecture. Since then, language models have been continuously expanded, in terms of their datasets as well as the size of the models themselves – from initially 115 million parameters (GPT-1) to the rumoured 170 trillion parameters (GPT-4).

Rights of use

After the release of ChatGPT in November 2022, the community gained momentum and numerous new models were introduced. A critical aspect in considering these models is their transparency. Although some models are labelled open-source, the foundation of their development basis renders them unsuitable for commercial use. For example, some models use dialogs generated with ChatGPT as training data. This makes the use of these models legally difficult, because OpenAI's usage rights do not allow use of the API to generate competing products. Some models, on the other hand, use Meta's leaked LLaMA whose use is only permissible for science. To allow use of the LLaMA offshoots, their delta weights must be applied to the original LLaMA. The delta weights run under the Apache 2.0 license, but the underlying Meta model runs under a non-commercial license. This highlights the fact that there is often not one, but two licenses for a model: One for the inference code (often a bit more permissive), and another for the weights of the model, which is often somewhat more restrictive (as in the case of LLaMA) and excludes commercial use. With this strategy, a company can assume the label "open source" (the code having being licensed), but at the same time prevent commercial use of its proprietary know-how (because the code is of no use without the model weights).

Recently, however, licenses for important models (and their weights) have been made more permissive. One such example is Falcon, which initially allowed commercial use only in return for royalties, but was soon later furnished with an Apache 2.0 license. The license of LLaMA-2, unlike LLaMA, also allows commercial, royalty-free use - but to a limited extent. If the number of active users exceeds the threshold of 700 million in a month, a usage permit from Meta is required.

Commercial availability and usability of models as alternatives to ChatGPT is desirable for the development of new LLM applications, as shown by the emergence of projects which pursue this open approach and therefore use their own architectures and cleansed or crowd-sourced data sets. A prominent example of this is Dolly 2.0, for which Databricks generated over 15,000 training data sets with its own employees in a massive crowd-sourcing initiative after the first model Dolly 1.0 which had been trained using data generated on ChatGPT and was commercially prohibited. Another example is the RedPajama project by, which reproduced the original Meta LLaMA data sets and subsequently trained models itself, releasing them under a permissive license.

Data security

Another critical aspect concerns the use and thus the security of the data with which LLMs interact. Where are enterprise-sensitive data stored for fine-tuning models or enriching prompts, and how are they possibly used further? The same question arises for (perhaps even partly personal) data exchanged in chats. The use of OpenAI's models has been strongly questioned and criticized in this regard. In the meantime, OpenAI has tweaked this aspect so that sharing of data with OpenAI for further development of AI models can be controlled by users themselves via an opt-in. However, OpenAI reserves the right to retain the data for 30 days for monitoring possible misuse. Microsoft has the same arrangement for its Azure OpenAI service which allows access to OpenAI models via REST APIs.

In this debate, the German startup Aleph Alpha is clearly orienting itself toward data security – it relies on its own data centre in Germany and attaches great importance to compliance with the General Data Protection Regulation (GDPR). In addition, the start-up uses explainable-AI methods to track AI output and even suppress output without a trusted source. MosaicML's platform for model training and deployment has adopted the same stance and promises not to use or store the data.

These topics are relevant if a company decides not to host the LLM itself, but send the data to an external API for fine-tuning and user prompts. This is precisely why open-source models are attractive for enterprises, as they can be deployed on proprietary infrastructure so as to guarantee control over own data.

Map of large language models

Countless models have emerged on the market in the meantime. On the basis of various criteria, we have compared established models of interest to enterprises in order to give you a general overview of the LLM landscape. Although we have included Meta's LLaMA as one of the first LLMs, all other models derived directly or indirectly from it have been left out of our overview table, as they are not suitable for commercial use.

The model suitable for utilization in your project depends heavily on the use case. If your enterprise uses sensitive data, then you might choose a self-hosted model (e.g. via Hugging Face). On the other hand, if you have less sensitive data and simpler prompts and want to get started quickly, then access OpenAI directly. For those sceptical about direct integration of OpenAI due to privacy concerns, however, a use of Azure OpenAI is a compromise making it easier to meet compliance requirements by having the model hosted in the selected Azure region. An outflow of data to the USA is thus avoided. Because Microsoft is already established as a trustworthy, long-standing partner for many enterprises in the DACH region (Germany/Austria/Switzerland), it is often much easier for the responsible staff to opt for an Azure solution instead of directly addressing the OpenAI API. On the other hand, you can choose Claude if your use case needs an extremely long context to pass a prompt enriched with data to the LLM. In terms of API availability and latency, Claude also performs significantly better than GPT-4, in whose case temporary outages and long waits for output are not uncommon. Then again, LLaMA-2 might prove suitable if the focus is on ethically correct answers and maximum avoidance of harmful outputs. During development of this chat model, Meta has taken certain measures such as supervised fine- tuning for safety, use of a safety reward model and "red teaming" to generate highly dependable results.

No matter which model you choose, it has to perform properly. So far, the output generated by OpenAI's GPT-3.5 and GPT-4 has been considered reliable and good. Your use case will determine which of the OpenAI models to utilize. GPT-3.5 is recommended for small enquiries or support in intermediate steps, while GPT-4 performs better for longer context, more creativity and a lower degree of hallucinations. However, the performance of an OpenAI model seems to vary over time and can even drop sharply during individual tasks, which is why the quality of utilized models should be monitored over time. There are also performance differences between models from different open sources. This is where Hugging Face's Open LLM Leaderboard can show the way.

New models are being created at an astounding rate. For businesses, the launch of LLaMA-2 will soon have generated an availability of many more good open-source models such as the new togethercomputer/LLaMA-2-7B-32K or stabilityai/StableBeluga2. Equally fascinating are the use cases associated with large language models.




How companies approach the topic of LLM and use-case determination can be read in this blog post.

Any questitions?




Your Contact
Laura Weber
Laura loves to learn new technologies and thus get to know a broad spectrum of BI. In the data science environment, she is particularly fascinated by Natural Language Processing (NLP) and its use cases in different industries.
#DataEngineering #DataScience #MountainLiveBalance