Base models in LLMS along with their size Differences:
Here size is nothing but the number of parameters which is relatively equal to their memory.
General Structure of LLMS in terms of input:
The question that we pass is called a prompt and the space allocated to the prompt is called the context window, the size of it is typically around 1000 words but it differs from LLM to LLM. And the result is called completion which contains the prompt and the solution. This entire step is called inference.
REFER TO THE TRANSFORMER BLOG.....
In simple words, Transformer can be seen as follows
PROMPTING:
The main idea of prompting is to fine-tune the input so you can get corresponding accurate output from the model. This is a very vast domain. Let's see one of the large number of methods available.
IN-CONTEXT LEARNING:
In the above example, we can see that the result is not as expected. So we next try to use the in-context learning method
CONFIGURATION-INFERENCE PARAMETERS:
Here Max new tokens indicate the number of new tokens generated but output while generating an answer to the prompt.
Below is a scenario in which even if we increase the maximum number of tokens to 200. the output still has less number of words because the <EOS> token has appeared.
As shown in the image below, the top-k and top-p values are used to random sample the output word. That is while generating a sentence to pick the next word from the sequence we optimally choose the words by using these two parameters.
Top K will ensure that it will pick the top k words with the highest probability. This will ensure randomness by creating k options for the word.
The next parameter that we are looking at is the Temperature
In simple words the more the temperature the higher the randomness is.
These are the following steps that we follow:
- Collecting data from different sources.
- filtering data for removing bias(Ex: gender bias as explained in one of the earlier blogs). After this step, we will be left with 1-3% of all the data we have collected.
- The LLM initially has to learn the vector representation of all these tokens that are preserved. This process is often called encoding. In this step, the model tries to understand and analyze these patterns.
Types of models that we have :
Encoder-only models/Autoencoding Models:
They try to randomly mask one token and predict the masked token based on words on both sides. This will help it generate an understanding of words in a bi-directional way.
Decoder Models/ Autoregressive models:
They will try to predict the next word based on the previous words. So, this is unidirectional. larger decoder models exhibit Zero-shot abilities.
Encoder-Decoder Models/ Sequence-to-Sequence Model:
It is believed that the more the size of the LLM the better performing the model is. here SIze is considered in terms of parameters. Here when we say parameters we refer to the training weights. In a transformer architecture, we use these weights while learning embeddings and attention weights, feedforward neural networks, etc. locations. However, this calling of parameters requires more computational power.
For training the 1B parameter model we need
Quantization:
In quantization, we try to reduce the size or memory required to store a parameter or weight. The original representation is FP32 to FP16, BF16 and INT8. But the popular version is BF16. Methods like FP16 and INT8 try to reduce the range of values that we can store compared to FP32 whereas BF16 will try to truncate most of the fraction part and allow the range to be the same as FP32. Due to this, it can also fasten the calculation since it won't reduce the range allowing parameter weights to be large.
Multi-GPU Computation:
DDP:
In DDP our main aim is to distribute the data across Multiple GPUs to enable parallel processing and finally combine all the gradients and synchronize the final model across all GPUs that are used. This kind of implementation speeds up the process of computation but is not able to tackle the memory point of view since each GPU needs to hold the entire model.
FSDP:
In this, we will try to do the same thing as DDP but here we also split the model across GPU. This will reduce memory problems efficiently but it again creates a time problem since the GPUs need to communicate with each other to get weights from others during forward and backward passes.
Comparison of performance :
As you can see for some point in time FSDP and DDP were working at the same level but when parameters went up to 11.3B the DDP raised a memory error. This shows it is optimal to choose FSDP at all times irrespective of Data Size.
The second plot indicates how good the performance is as the number of GPUs keeps on increasing and then falls down due to more communication.
Scaling Choices for Pre-training:
To achieve the goal we are looking we have to consider three factors as shown in the image. While searching for optimal values for these three different parameters for achieving maximum model performance,.Researchers have found that all three show a power-law distribution for performance.
This indirectly means that as the Computation power increases keeping that data size and parameters constant the model performance keeps expanding. The pattern is the same across all the other things as well. But this eventually led to the question "What could be ideal values for data size and parameters, given the computation resource?".
Chinchilla Paper:
The paper revealed that two situations may cause the model to not perform as expected.
After extended research, they found that
Bloomberg-GPT(domain specific training):
This a scenario where you want to build a domain-specific application. In models like BloombergGPT, we often train from scratch. Bloomberg is one of the applications of Chinchilla's paper.
Fine-Tuning a LLM:
Though we opted for in-context prompt training for a specific task, it is not so optimal in the case of smaller LLM and it also has a drawback of using context window space on a large scale.Instruction Fine-Tuning:
In instruction fine-tuning, we may not need a much larger dataset as training LLM scratch. In fine-tuning instruction, we need a format/ template to help the model understand and learn.
As shown in the above image the training happens in a way where a model is given one of the prompts and its completion is evaluated against the true competition. This can be done by evaluating the probability distribution of the predicted output and true output calculating the loss and updating the parameters. This process is repeated based on training and dataset, resulting in a newly trained LLM with better results for the specific task we are training it for.
Drawbacks of Fine-Tuning on a Single Task:
Fine-tuning on Single tasks eventually results in a phenomenon called Catastrophic forgetting. Due to the change in whole model weights during instruction fine-tuning discussed above, there is a high chance for the model failing to perform other tasks that the model used to perform well.
Multi-task Instruction Fine-Tuning:
The main purpose is to train the model on multiple tasks, resulting in a multi-task efficient model. However, the data that is required will be large for each task.
The Flan T5 is trained on the above different sets of Datasets. The summary template used for FLAN-T5 is given here
Before:
Performance analysis parameters:
Rouge-2(Bi-grams):
Rouge-L:
It finds the longest common subsequences and tries to evaluate them based on it as shown above. In the example, the largest possible subsequences are "it is" and "cold outside".
Drawbacks of Simple Rouge Score:
Evaluation Bench Marks:
PEFT:
Parameter-efficient fine-tuning is a method with 2 possible approaches of memory and performance-efficient techniques for fine-tuning.- The first approach is to keep 80% of the model weights frozen and update those 20% of model weights which could help us improve in required tasks.
- Second, we try to freeze the entire model and add extra parameters to produce the required results.
One of the main advantages of using the PEFT method is as below.
Type of PEFT Methods:
Reparameterization:
LORA:
The LORA works as shown in the above image. The model weights at the self-attention layer are updated as shown in the above image. A further in-depth explanation of LORA will be done in an individual block.
The memory and training time saved by the LORA technique can be seen in the below image
LORA enables the Different task adaptation models as below :
As described in the image we can do this for multiple tasks and just switch between the tasks based on requirements with ease.
Results of using LORA:
As shown in the above image, the full fine-tuned model for summarization performed much better than the base model. But the full fine-tuning comes with large GPU and memory costs. On the other hand, the LORA model archives similar results as full-fine tuning but with very little memory and GPU resources.
How to choose Rank:
This is still an active area of Research and one of the research paper results is as follows:
As the rank increased there was a significant rise in scores but after a certain point, it simply formed a flattened curve.
Additive Methods - Prompt Tuning:
This is a method in which we freeze entire LLM parameters and concentrate on the embedding part to achieve the targeted results for our task. The core idea is to add some additional tokens at the embedding stage along with the original embeddings from the prompt at hand
These tokens are not like regular tokens in English but these are tokens that can take any form. In English word or token is a fixed entity in vector space
But in the case of these soft tokens, they can take any value and don't correspond to a particular location.
As discussed earlier in Full fine tuning the model weights are updated. But in the case of prompt tuning, the embedding matrix is only trained.
Performance Analysis:
The plot indicates that though soft prompt tuning does not achieve the best performance with a smaller number of parameter models eventually, performs the same as Full-fine tuning tasks with larger models.
RLHF(Reinforcement learning with human Feedback):
Due to training from the real-world data, the trained models can have the above issues. Let's see a few instances where it fails to perform
So for this reason we will fine-tune the model with Human feedback to archive these 3H and avoid toxic responses.
Studies also prove that the model trained using RLHF performs way better.
This RLHF not only helps to minimize toxicity but also helps us in achieving personalization in our model.
Structure of RLHF we are planning to implement:
The agent performing an action is the LLM model. Action it takes is to choose a word and action space is a set of possible vocabulary to choose. The environment is the context box in the LLM and the current context is the state of the RLHF. To Generate aligned text(non-toxic text). With this, we try to optimize the model parameters to get maximum reward. Here the reward is decided by whether the present word generated is toxic or non-toxic. This toxic nature is identified either by using human feedback or by building a reward model by training it over a set of toxic and non-toxic data.
Working of RLHF:
The first step is to give our model some set of prompt samples and produce multiple completions.
The second step is to evaluate these completions with human assistance. the same prompts are given to multiple labelers to avoid any mistake by one labeler. On the right, we can see how the third labeler did the wrong interpretation.
The next step is to prepare structured data for model training
As you can see from the above image once the ranking is finalized, we will create pairs with length 2 one indicating the correct completion and is denoted as 1, the other indicating less appropriate completion(0). While generating these pairs we should make sure that the correct prompt is always in first position for the reward model to be trained.
In this way, a language model is trained to create a reward model. And training process could be like below
In the above image, the prediction from LLM is sent to the reward model. The reward model which in turn gave a reward value that is slightly on the lower side so the RL algorithm now will receive this score so it tries to update the weights of LLM to increase the score.
In this way, the iteration continues until we reach a threshold value for the reward. or max iterations reached.
As the name suggests, PPO is an algorithm that tries to update the weights of LLM with small changes to align the LLM model to human preference.
This can be broken down into 2 parts/phases
In phase one, we will generate multiple completions for the prompts. And pass these completions to the reward model for getting a reward score.
Generate reward scores
As you can see in the image above, as the new token is generated the estimated future reward has raised to 1.23. But it is different from the actual value produced by the reward model
The goal is to reduce the difference between the estimated and known rewards(value loss). This helps to update the value function to be more accurate for future use.
In phase 2, we try to use this policy loss to update the weights of the LLM.
here pie of (a_t/s_t) indicates the probability of next token a_t given present state s_t. In the formula, the below term indicates the old version probability of LLM, and the above is the updated version probability of LLM. The bottom terms are always frozen because it has a fixed estimation from the previous model.
A_t indicates advantage terms that estimate how better or worse the current action is compared to all possible actions at this state s_t. That is, it indicates how better token a_t is compared to all other tokens in hand.
Comments
Post a Comment