人工智能：机遇与挑战并存，数据质量是关键

Artificial intelligence (AI) is fast becoming the new “Marmite”. Like the salty spread that polarizes taste-buds, you either love AI or you hate it. To some, AI is miraculous, to others it’s threatening or scary. But one thing is for sure – AI is here to stay, so we had better get used to it.

In many respects, AI is very similar to other data-analytics solutions in that how it works depends on two things. One is the quality of the input data. The other is the integrity of the user to ensure that the outputs are fit for purpose.

Previously a niche tool for specialists, AI is now widely available for general-purpose use, in particular through Generative AI (GenAI) tools. Also known as Large Language Models (LLMs), they’re now widley available through, for example, OpenAI’s ChatGPT, Microsoft Co-pilot, Anthropic’s Claude, Adobe Firefly or Google Gemini.

GenAI has become possible thanks to the availability of vast quantities of digitized data and significant advances in computing power. Based on neural networks, this size of model would in fact have been impossible without these two fundamental ingredients.

GenAI is incredibly powerful when it comes to searching and summarizing large volumes of unstructured text. It exploits unfathomable amounts of data and is getting better all the time, offering users significant benefits in terms of efficiency and labour saving.

Many people now use it routinely for writing meeting minutes, composing letters and e-mails, and summarizing the content of multiple documents. AI can also tackle complex problems that would be difficult for humans to solve, such as climate modelling, drug discovery and protein-structure prediction.

I’d also like to give a shout out to tools such as Microsoft Live Captions and Google Translate, which help people from different locations and cultures to communicate. But like all shiny new things, AI comes with caveats, which we should bear in mind when using such tools.

User beware

LLMs, by their very nature, have been trained on historical data. They can’t therefore tell you exactly what may happen in the future, or indeed what may have happened since the model was originally trained. Models can also be constrained in their answers.

Take the Chinese AI app DeepSeek. When the BBC asked it what had happened at Tiananmen Square in Beijing on 4 June 1989 – when Chinese troops cracked down on protestors – the Chatbot’s answer was suppressed. Now, this is a very obvious piece of information control, but subtler instances of censorship will be harder to spot.

Trouble is, we can’t know all the nuances of the data that models have been trained on

We also need to be conscious of model bias. At least some of the training data will probably come from social media and public chat forums such as X, Facebook and Reddit. Trouble is, we can’t know all the nuances of the data that models have been trained on – or the inherent biases that may arise from this.

One example of unfair gender bias was when Amazon developed an AI recruiting tool. Based on 10 years’ worth of CVs – mostly from men – the tool was found to favour men. Thankfully, Amazon ditched it. But then there was Apple’s gender-biased credit-card algorithm that led to men being given higher credit limits than women of similar ratings.

Another problem with AI is that it sometimes acts as a black box, making it hard for us to understand how, why or on what grounds it arrived at a certain decision. Think about those online Captcha tests we have to take to when accessing online accounts. They often present us with a street scene and ask us to select those parts of the image containing a traffic light.

The tests are designed to distinguish between humans and computers or bots – the expectation being that AI can’t consistently recognize traffic lights. However, AI-based advanced driver assist systems (ADAS) presumably perform this function seamlessly on our roads. If not, surely drivers are being put at risk?

A colleague of mine, who drives an electric car that happens to share its name with a well-known physicist, confided that the ADAS in his car becomes unresponsive, especially when at traffic lights with filter arrows or multiple sets of traffic lights. So what exactly is going on with ADAS? Does anyone know?

Caution needed

My message when it comes to AI is simple: be careful what you ask for. Many GenAI applications will store user prompts and conversation histories and will likely use this data for training future models. Once you enter your data, there’s no guarantee it’ll ever be deleted. So think carefully before sharing any personal data, such medical or financial information. It also pays to keep prompts non-specific (avoiding using your name or date of birth) so that they cannot be traced directly to you.

Democratization of AI is a great enabler and it’s easy for people to apply it without an in-depth understanding of what’s going on under the hood. But we should be checking AI-generated output before we use it to make important decisions and we should be careful of the personal information we divulge.

It’s easy to become complacent when we are not doing all the legwork. We are reminded under the terms of use that “AI can make mistakes”, but I wonder what will happen if models start consuming AI-generated erroneous data. Just as with other data-analytics problems, AI suffers from the old adage of “garbage in, garbage out”.

But sometimes I fear it’s even worse than that. We’ll need a collective vigilance to avoid AI being turned into “garbage in, garbage squared”.

The post Garbage in, garbage out: why the success of AI depends on good data appeared first on Physics World.

User beware

Caution needed

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签