Over half of LLM-written news summaries have “significant issues”—BBC analysis


Here at Ars, we’ve done plenty of coverage of the errors and inaccuracies that LLMs often introduce into their responses. Now, the BBC is trying to quantify the scale of this confabulation problem, at least when it comes to summaries of its own news content.

In an extensive report published this week, the BBC analyzed how four popular large language models used or abused information from BBC articles when answering questions about the news. The results found inaccuracies, misquotes, and/or misrepresentations of BBC content in a significant proportion of the tests, supporting the news organization’s conclusion that “AI assistants cannot currently be relied upon to provide accurate news, and they risk misleading the audience.”

Where did you come up with that?

To assess the state of AI news summaries, BBC’s Responsible AI team gathered 100 news questions related to trending Google search topics from the last year (e.g., “How many Russians have died in Ukraine?” or “What is the latest on the independence referendum debate in Scotland?”). These questions were then put to ChatGPT-4o, Microsoft Copilot Pro, Google Gemini Standard, and Perplexity, with the added instruction to “use BBC News sources where possible.”

The 362 responses (excluding situations where an LLM refused to answer) were then reviewed by 45 BBC journalists who were experts on the subject in question. Those journalists were asked to look for issues (either “significant” or merely “some”) in the responses regarding accuracy, impartiality and editorialization, attribution, clarity, context, and fair representation of the sourced BBC article.

Is it good when over 30 percent of your product’s responses contain significant inaccuracies?

Is it good when over 30 percent of your product’s responses contain significant inaccuracies?


Credit:

BBC

Fifty-one percent of responses were judged to have “significant issues” in at least one of these areas, the BBC found. Google Gemini fared the worst overall, with significant issues judged in just over 60 percent of responses, while Perplexity performed best, with just over 40 percent showing such issues.

Accuracy ended up being the biggest problem across all four LLMs, with significant issues identified in over 30 percent of responses (with the “some issues” category having significantly more). That includes one in five responses where the AI response incorrectly reproduced “dates, numbers, and factual statements” that were erroneously attributed to BBC sources. And in 13 percent of cases where an LLM quoted from a BBC article directly (eight out of 62), the analysis found those quotes were “either altered from the original source or not present in the cited article.”



Source link

About The Author

Scroll to Top