7B alpaca defeated 540B "Google GPT", MIT used game theory to tweak large models, which can be completed without training

Original source: Qubits

Image source: Generated by Unbounded AI

Based on game theory, MIT proposes a new large model optimization strategy.

With its blessing, the 7B parameter Llama surpassed the 540B “Google GPT” PaLM on multiple datasets.

Moreover, the whole process does not require additional training of the model, and consumes less computing resources.

This optimization strategy based on game theory is called equilibrium ranking.

The research team transformed the large model language decoding process into a regularized incomplete information game.

The term can be broken down into two parts: “regularization” and “incomplete information game”, which we will introduce in the detailed explanation of the principle.

In the process of the game, the model constantly optimizes the produced answers to make the generated results more factual.

Experimental results show that the balanced ranking optimization method is significantly better than other methods and even other models on multiple test datasets.

So, how exactly does the equilibrium ranking method apply game theory to large models?

Let the big model “play self”

As mentioned earlier, the researchers directly turned the process of language decoding of large models into a “regularized incomplete information game” process.

The incomplete information game is the core of the whole method, and regularization is a mechanism to avoid errors, let’s look at this game first.

Specifically, they designed two modules, the generator (G) and the discriminator (D), which hold different information and play different roles.

The generator generates answers based on the “correctness parameter” randomly given by the environment (N); The discriminator is only responsible for judging whether the generator’s answer is correct, not depending on the environmental parameters.

If the discriminator’s judgment is consistent with the environmental parameter, both are rewarded with 1 point, otherwise neither will be scored.

In performing repetitive generation and discrimination, the goal of the model is to achieve Nash equilibrium.

Unilaterally changing one’s own strategy under the Nash equilibrium strategy portfolio, while other players’ strategies remain unchanged, will not increase their own returns.

For example, Zhang San and Li Si decide together what to have for dinner, the options are hot pot and barbecue, and the other known conditions are as follows:

  • Zhang San’s satisfaction with hot pot is 2 points (I like it), and his satisfaction with barbecue is 1 point (OK)
  • Li Si’s satisfaction with barbecue is 2 points, and his satisfaction with hot pot is 1 point
  • Neither of them wants to eat alone, so satisfaction is 0 when eating alone

At this time, there are four ways to choose between the two, and the corresponding satisfaction scores are as follows:

In this scenario, the best strategy is when two people choose the same, and as long as either person unilaterally changes the strategy, the satisfaction of both will become 0** at the same time.

Returning to the balanced ranking optimization method, the generator and discriminator will first initialize the strategy, which is based on questions or answers, respectively.

The Nash equilibrium in this environment is shown in the following table:

After initialization, the generator and discriminator will play multiple rounds of games, gradually updating the strategy until the iteration is terminated.

At the end of each game, the difference between the scores of the discriminator and generator and the score of the optimal strategy is calculated separately, which is called the “regret value”.

Then iterate gradually until the regret value converges and approaches the Nash equilibrium.

Once the Nash equilibrium is reached, the strategy of the generator and discriminator is determined, and the candidate answers are scored separately, and then sorted to select the best answer.

Under the Nash equilibrium condition, the scores should be consistent, and if they do not agree, the answer will be eliminated.

However, since the criterion for scoring generators and judges is consistency with environmental information, not objective facts, the simple pursuit of Nash equilibrium does not necessarily guarantee a reasonable answer.

In order to avoid the occurrence of errors in both at the same time, the developers also introduced a regularization error correction mechanism.

The first is to give generators and discriminators a priori strategies based on objective facts, rather than allowing them to initialize randomly.

These a priori strategies are the “golden rule” of generator and discriminator generation strategies, which guide the optimization direction of strategies.

There is also a KL penalty strategy, where the KL divergence (also known as relative entropy) from the initial strategy is calculated when a new strategy emerges.

KL divergence describes the correlation between the two, with higher values leading to lower correlations.

Assuming that P(x) and Q(x) are two probability distributions on the random variable X, respectively, the KL divergence in the discrete and continuous cases are:

This result is added to the function of generating a new strategy, avoiding deviations from the objective facts of the final generated result.

As shown in the following equation, the reward function U contains the KL divergence term and sets the penalty coefficient λ (>0).

When the KL divergence is greater, that is, the greater the deviation from the objective facts, the reward score obtained by the model will be reduced.

In this way, when the generator and discriminator results agree but do not match the facts, the relevant results will not receive a high score and will not be the final answer.

With this strategy, the research team achieved excellent results with the 7B Llama with lower consumption.

Some capabilities exceed “Google GPT”

Overall, the balanced ranking optimized Llama performed well in common sense reasoning, reading comprehension, mathematics, and dialogue tasks.

In terms of multiple-choice questions, it is also Llama, and after the optimization of the balanced ranking method, the model’s performance on multiple datasets such as MMLU is ranked in a relatively high position.

In terms of Q&A, the 13B Llama with the optimized balanced ranking strategy achieved the best results in the TruthfulQA dataset, and the 7B version was not much different from the first place.

In addition to text-related comprehension and reasoning, the model also reaches a high level in mathematics.

Among the many optimization methods of the 7B Llama model, balanced sorting achieved the best score in the GSM8K test.

The balanced ranking method is not only the best among many Llama optimization methods, but also surpasses other models after optimization.

On the Challenge diversity of the ARC dataset and the High diversity of the RACE dataset, the accuracy of Llama-7B+ balanced sorting was 58.3% and 56.4%, respectively, significantly exceeding the 53.0% and 49.1% of the PaLM-540B.

For more specific details, you can go to the original paper.

Paper Address:

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
0/400
No comments
Trade Crypto Anywhere Anytime
qrCode
Scan to download Gate App
Community
  • 简体中文
  • English
  • Tiếng Việt
  • 繁體中文
  • Español
  • Русский
  • Français (Afrique)
  • Português (Portugal)
  • Bahasa Indonesia
  • 日本語
  • بالعربية
  • Українська
  • Português (Brasil)