A large language model (LLM) is designed to process and generate human language. According to the original observation of LLM in the first part, this time it is about various models trained in the hardware requirements and advance.
Advertisement
Professor Dr. Michael Stall has been working in Siemens Technology since 1991. His research focuses on the software architecture for large complex systems (distributed systems, cloud computing, IIOT), embedded systems and artificial intelligence. He recommends professional fields on software architecture issues and is responsible for architectural training of senior software architects in Siemens.

Fasting your seat belt!
Essential hardware
A comment on hardware requirements on the edge: The details suggest that there are complex mathematical calculations with the vector for both training and execution (estimates). GPU core and nerve processors are experts in them. Therefore, LLM developers require TPU or GPU performing with RAM as much as possible. With normal CPU and small memory you do not get away from very small models. References and calculated loads of LLM should fit into the memory of the GPU. The latter may include several hundred gigabytes, while the reference ranges from some kiloby to many megabytes. The Deepsek R1 model has 671 billion weight/tensers and thus requires storage of about 500 gigabytes; The Openai model should sometimes require more than terabytes. For this, many Nvidia H100 GPU accelerator will be required. As a result, developers can only train and drive models on their local systems, for example with good hardware tools, models with 70 billion parameters.
Trained models in advance: Transfer learning power
The first trained models are LLMs who train their creators on large data records and have become fine for specific tasks. These models serve as the initial point for other functions to enable the model to use patterns and relationships learned during training. Popular models are types:
- Burt (Representation representation of bidleen encoder from transformer): A model that is already trained that uses a multi -level bobbler transformer encoder to generate the relevant representation of words in the input text.
- Roberta (Strongly optimized Burt Preetraning approach): A model is already trained that uses a modified version of Burt architecture and another training target.
- XLNET (Extreme Language Modeling): A model that is already trained that uses autocenoding and combination of writer-composed techniques to generate the relevant representation of words in the input text.
Reference Window: Area of ​​Model Sight Sight
The reference window belongs to the amount of input text that the model can see at a certain time. We already considered this in the first part of the series, but should be mentioned here again. The reference window may be firm or dynamic depending on the model architecture. A larger reference enables the window model to record more reference, but also enhances computing effort. Modern LLM has a reference length of a few thousand tokens up to a few million tokens.
Mask: Model’s eyes
Mask helps the model prevent models from focusing on parts of the input text. There are different types of masks, including:
- Pading masks are used to prevent the model from paying attention to padding tokens that are added to the input text to give it a certain length. These tokens include, for example, EOS tokens (the end of the sequence).
- In the model, reference masks prevent parts of the input text, such as future tokens in a sequence.
- In a sequence, the causes of masks are useful to prevent models from focusing on future tokens, and thus able to generate text-like.
Early processing of a query
As soon as you enter a query (prompt) in an LLM, the query passes through the following stages:
- Tickenization: As a result of the implementation tokens, the user destroys the prompt, such as the word or sub -order.
- Embeding Creation: Each token turns into a numerical representation by leading it through an embeding layer.
- Physical coding: The prompts link the model with positional coding to obtain an order of embeding tokens.
The LLM then performs the following stages:
- Self post: Embeding of the prompt goes through a self-employment layer to generate a relevant representation.
- Cross Post: Relevant representation goes through a cross-post layer to take into account external information, for example input text or other models.
- Feed-forward layer: LLM then directs the output of the cross-postpiece layer to convert the display into a higher level through several feed forward layers. Objective: The layer adds non -line and capacity to learn complex patterns.
- Reference Window: The final output creates LLM using a reference window. It works on a high level of presentation. This allows the model to focus on a certain part of the query.
- Masking: Ultimately, the masking model of the final version is to prevent some parts of the query, as on the padding tokens.
The following code displays step-by-step processing of a query, including Toking, embeding creation, positional coding, self-station, cross-station, feed-forward layer, reference window and masking.
import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer
# Laden Sie das vorab trainierte Modell und den Tokenizer
modell = AutoModelForMaskedLM.from_pretrained("bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# Definieren Sie eine Abfrage
abfrage = "Erläutere die Geschichte des Heise-Verlags."
# Tokenisieren Sie die Abfrage
eingabe = tokenizer(abfrage, return_tensors="pt")
# Erstellen Sie die Embeddings
embedded_abfrage = modell.embeddings(eingabe("input_ids"))
# Fügen Sie die positionale Kodierung hinzu
positionale_kodierung = modell.positional_encodings(embedded_abfrage)
embedded_abfrage = embedded_abfrage + positionale_kodierung
# Wenden Sie die Self-Attention an
kontextualisierte_darstellung = modell.self_attention(embedded_abfrage)
# Wenden Sie die Cross-Attention an
hoeher_ebene_darstellung = modell.cross_attention(kontextualisierte_darstellung)
# Wenden Sie die Feed-Forward-Layer an
transformierte_darstellung = modell.feed_forward_layers(hoeher_ebene_darstellung)
# Wenden Sie das Kontextfenster an
endgueltige_ausgabe = modell.context_window(transformierte_darstellung)
# Maskieren Sie die Ausgabe
antwort = modell.masking(endgueltige_ausgabe)
print(antwort)
The model created or modified during execution:
- model.pth: trained model file
- Tokenizer.json: Tokar Configuration File
- Query.txt: input query file
- Answer .txt: output-response file
Note: This is a strong simplified example. In practice, real applications with LLM are much more complex and more complex.
In my next blog post, I go from LLMS to various architectural types.
(RME)
