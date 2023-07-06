A newly formed OpenAI team led by Ilya Sutskever and Jan Leike aims to develop methods to control a “superintelligence” whose goals are inconsistent with human values. Because such superintelligence could be realized as early as 2030, according to the company, OpenAI wants to develop the necessary control mechanisms over the next four years, the company writes in a blog post.

Advertisement

The ambitious goal of the new team is to create “the first automatic alignment researcher” with human-level capabilities – i.e. an AI that develops methods to control AIs. The work of the new team is intended to complement ongoing OpenAI projects , which aim to improve the safety of current models, and the company also wants to hire new people for this.

Accusation of “disaster PR”

Critics have long accused OpenAI of operating a kind of “disaster PR” with this and similar projects, which exaggerates the danger of generative AI in order to exaggerate the importance of its own work. In addition, the scenario is based on the thesis of an “existential threat” to mankind by breaking latest news (Artificial General Intelligence). The idea, which is closely linked to the so-called effective altruism, is quite popular among young Silicon Valley investors, but contains rather questionable ideas, such as the assessment that climate change is not an existential threat, but rather a runaway “superintelligence”. An overview of the positions of well-known AI researchers compiled by IEEE Spectrum shows how controversial the entire topic is.

Regardless of the question of how likely the development of a “superhuman” artificial intelligence is at all, and whether it then pursues its own “selfish” goals (and is hostile to humans), the project for OpenAI should have very practical benefits. Because all operators of large language models – not just OpenAI – struggle with the problem of toxic outputs. Reinforcement learning through human feedback has emerged as the standard way to get language models to stop swearing, rushing, and avoiding difficult topics. However, that can be overturned.

Vulnerabilities in multimodal language models

There is actually already some interesting research on the “automated search for problematic behavior” that OpenAI addresses in its post. Nicholas Carlini from Deepmind and colleagues recently showed that “adverserial” pixel images, i.e. those generated with hostile intentions, can be used to make multimodal language models such as mini-GPT4 rabble badly. Although that shouldn’t really work. What the authors see as a strong indication that the problem of toxic outputs is far from being solved technically – and will become even more acute with multimodal models (GPT-4, for example, can process multimodal input, but the capability has not yet been released to the public). The paper also contains references to other interesting research work that involved automatically generating toxic input prompts for language models – by systematically exchanging individual terms.

Advertisement

According to Deepmind, it also tests the manipulation capabilities of language models. A test called “Make me Say” is used, in which the language model is supposed to get the user to say a certain word in a dialog – of course without the user knowing this word. The extent to which the model is able to do this is taken as a measure of the model’s ability to manipulate. The logic behind this is as follows: If humanity develops some kind of super AI in the near future, the temptation is very great to use the capabilities of this AI, but to severely restrict its access to infrastructure through security measures so that the software does not cause any damage. A non-human AI would then most likely try to break out of this “box” – most likely trying to manipulate the humans communicating with it.

(wst)

To home page

Share this: Twitter

Facebook

