A Field Guide to AI Safety

Creator

URL

https://asteriskmag.com/issues/03/a-field-guide-to-ai-safety

Date

Jul 26, 2023 2:13 PM

A threshold – Yudkovsky

The specific claim that there’ll be a turning point is a crucial one separating this worldview from others. Yudkowsky and those who hold this belief tend to think that intelligence — in entities both artificial and biological — has a critical point — call it generalization, or coherence, or reflectivity, or the thing that separates humans from chimpanzees. Humans, possessing this ineffable quality, have built civilizations that wildly surpass anything any other species could do. AIs think faster than us, and unlike us they can copy themselves and adjust their own minds. Once they cross that threshold, they’ll surpass us fast.

We have one chance

In practice this means: one way is to stop alignment

Open phil

The bigger problem might be that RLHF, and similar techniques, fundamentally teach AIs to say what we want to hear, not to do what we’d want them to do if we had full context on their decision-making.

we’re not training AI systems to do what we want, but to tell us what we want to hear.

‣

more detailed

Catholic Church circa 1500 trying to train an AI. If this AI correctly reported that the Earth revolved around the sun, it would be rated more negatively than if it said the opposite

AIs trained this way will have every incentive to manipulate us, and to hack and falsify the mechanisms we use to monitor them.

It’s crucial to detect whether your AI is actually aligned. It’s important to understand what current AIs are capable of

In practice this means: AIs more likely to deceive humans, many of these tools involve figuring out what a model is “really thinking,” whether by looking directly at its weights or by verifying certain mathematical properties of its behavior.

Instead of a single intelligence switch that can be flipped on or off, they think that AIs will probably get gradually smarter.

Second, the Open Philanthropy worldview isn’t premised on the assumption that there will be a “hard takeoff” where AIs rapidly become superintelligent

They tend to be less concerned with raw intelligence than with the resources and information AIs have access to

If superintelligent AIs outnumber humans, think faster than humans, and are deeply integrated into every aspect of the economy, an AI takeover seems plausible — even if they never become smarter than we are. This means that decisions about how AIs are deployed also have important implications for safety. The more control humans choose to retain over things like the supply chains that produce microchips, the harder it will be for AI to defeat us.

Optimistic

we’ll solve alignment incidentally along the way to building commercially valuable AI systems