After the release of ChatGPT in November 2022, the community gained momentum and numerous new models were introduced. A critical aspect in considering these models is their transparency. Although some models are labelled open-source, the foundation of their development basis renders them unsuitable for commercial use. For example, some models use dialogs generated with ChatGPT as training data. This makes the use of these models legally difficult, because OpenAI's usage rights do not allow use of the API to generate competing products. Some models, on the other hand, use Meta's leaked LLaMA whose use is only permissible for science. To allow use of the LLaMA offshoots, their delta weights must be applied to the original LLaMA. The delta weights run under the Apache 2.0 license, but the underlying Meta model runs under a non-commercial license. This highlights the fact that there is often not one, but two licenses for a model: One for the inference code (often a bit more permissive), and another for the weights of the model, which is often somewhat more restrictive (as in the case of LLaMA) and excludes commercial use. With this strategy, a company can assume the label "open source" (the code having being licensed), but at the same time prevent commercial use of its proprietary know-how (because the code is of no use without the model weights).
Recently, however, licenses for important models (and their weights) have been made more permissive. One such example is Falcon, which initially allowed commercial use only in return for royalties, but was soon later furnished with an Apache 2.0 license. The license of LLaMA-2, unlike LLaMA, also allows commercial, royalty-free use - but to a limited extent. If the number of active users exceeds the threshold of 700 million in a month, a usage permit from Meta is required.
Commercial availability and usability of models as alternatives to ChatGPT is desirable for the development of new LLM applications, as shown by the emergence of projects which pursue this open approach and therefore use their own architectures and cleansed or crowd-sourced data sets. A prominent example of this is Dolly 2.0, for which Databricks generated over 15,000 training data sets with its own employees in a massive crowd-sourcing initiative after the first model Dolly 1.0 which had been trained using data generated on ChatGPT and was commercially prohibited. Another example is the RedPajama project by together.ai, which reproduced the original Meta LLaMA data sets and subsequently trained models itself, releasing them under a permissive license.