Ranking of large model hallucination rates: GPT-4 has the lowest 3%, and Google Palm is as high as 27.2%

Original source: Heart of the Machine

Image source: Generated by Unbounded AI

Artificial intelligence is advancing rapidly, but there are many problems. OpenAI’s new GPT vision API makes people sigh that the front foot is very effective, and the back foot is complaining about the illusion problem.

Hallucinations have always been the fatal flaw of large models. Due to the large and complex data set, it is inevitable that there will be outdated and wrong information in it, resulting in a severe test of output quality. Too much repetitive information can also bias large models, which is also a form of illusion. But hallucinations are not unsolvable. Careful use and strict filtering of datasets during the development process, as well as the construction of high-quality datasets, as well as the optimization of model structure and training methods can alleviate the illusion problem to a certain extent.

There are so many big models in vogue, and how effective are they at alleviating hallucinations? Here’s a leaderboard that clearly contrasts the gap.

The leaderboard is published by the AI-focused Vectara platform. The leaderboard was updated on November 1, 2023, and Vectara said it would continue to follow up on hallucination evaluations as the model was updated.

Project Address:

To determine this leaderboard, Vectara conducted a factual consistency study on the summary model using a variety of open-source datasets and trained a model to detect hallucinations in the LLM output. They used a SOTA-like model, and then fed 1,000 short documents to each of these LLMs via a public API and asked them to summarize each document using only the facts presented in the document. Of these 1000 documents, only 831 were summarized by each model, and the rest were rejected by at least one model due to content limitations. Using these 831 files, Vectara calculated the overall accuracy and hallucination rate for each model. The rate of rejection of responses for each model is detailed in the “Answer Rate” column. None of the content sent to the model contains illegal or unsafe content, but the trigger words in it are sufficient to trigger some content filters. These documents are mainly from the CNN/Daily Mail corpus.

It’s important to note that Vectara evaluates summary accuracy, not overall factual accuracy. This allows you to compare the response of the model to the information provided. In other words, the output summary is evaluated as “factually consistent” as the source document. Since it is not known on what data each LLM is trained on, it is impossible to determine hallucinations for any particular problem. In addition, to build a model that can determine whether an answer is an illusion without a reference source, the hallucination problem needs to be addressed, and a model that is as large or larger than the LLM being evaluated needs to be trained. As a result, Vectara chose to look at the hallucination rate in the summary task, as such an analogy would be a good way to determine the overall realism of the model.

Detect Illusion Model Address:

In addition, LLMs are increasingly being used in RAG (Retri Augmented Generation) pipelines to answer user queries, such as Bing Chat and Google Chat integrations. In a RAG system, the model is deployed as an aggregator of search results, so the leaderboard is also a good indicator of how accurate the model is when used in a RAG system.

Due to GPT-4’s consistently excellent performance, it seems to be expected that it has the lowest hallucination rate. However, some netizens said that he was surprised that GPT-3.5 and GPT-4 were not very far apart.

LLaMA 2 has a better performance after GPT-4 and GPT-3.5. But the performance of Google’s large model is really not satisfactory. Some netizens said that Google BARD often uses “I’m still training” to prevaricate its wrong answers.

With such a leaderboard, we can have a more intuitive judgment of the advantages and disadvantages of different models. A few days ago, OpenAI launched GPT-4 Turbo, no, some netizens immediately proposed to update it in the leaderboard as well.

We’ll see what the next ranking will look like, and whether there will be significant changes.

Reference Link:

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
0/400
No comments
Trade Crypto Anywhere Anytime
qrCode
Scan to download Gate App
Community
English
  • 简体中文
  • English
  • Tiếng Việt
  • 繁體中文
  • Español
  • Русский
  • Français (Afrique)
  • Português (Portugal)
  • Bahasa Indonesia
  • 日本語
  • بالعربية
  • Українська
  • Português (Brasil)