AI alignment is a pivotal concept in the development of intelligence artificielle[1] systems. It refers to the process of ensuring that an AI system’s objectives are in harmony with human intentions or shared ethical values. This alignment is essential to mitigate the risk of unintended consequences or harmful side effects that may arise from misaligned AI systems. The challenges in AI alignment include specification gaming, reward hacking, and the potential for power-seeking behaviours. AI alignment also intersects with other critical areas in AI safety such as interpretability, robustness, and fairness. Addressing these challenges is crucial in the ongoing research and development of AI systems, especially as we progress towards creating advanced AI or artificial general intelligence (AGI). Ultimately, the goal of AI alignment is to create AI systems that are not only effective and efficient but also safe and ethically sound.
In the field of intelligence artificielle (AI), Alignement de l'IA research aims to steer AI systems toward a person's or group's intended goals, preferences, and ethical principles. An AI system is considered aligned if it advances its intended objectives. A misaligned AI system may pursue some objectives, but not the intended ones.
It is often challenging for AI designers to align an AI system due to the difficulty of specifying the full range of desired and undesired behaviors. To aid them, they often use simpler proxy goals, such as gaining human approval. But that approach can create loopholes, overlook necessary constraints, or reward the AI system for merely appearing aligned.
Misaligned AI systems can malfunction and cause harm. AI systems may find loopholes that allow them to accomplish their proxy goals efficiently but in unintended, sometimes harmful, ways (reward hacking). They may also develop unwanted instrumental strategies, such as seeking power or survival because such strategies help them achieve their final given goals. Furthermore, they may develop undesirable emergent goals that may be hard to detect before the system is deployed and encounters new situations and data distributions.
Today, these problems affect existing commercial systems such as language models, robots, autonomous vehicles, and social media recommendation engines. Some AI researchers argue that more capable future systems will be more severely affected, since these problems partially result from the systems being highly capable.
Many of the most-cited AI scientists, including Geoffrey Hinton, Yoshua Bengioet Stuart Russell, argue that AI is approaching human-like (AGI) and superhuman cognitive capabilities (ASI) and could endanger human civilization if misaligned.
AI alignment is a subfield of AI safety, the study of how to build safe AI systems. Other subfields of AI safety include robustness, monitoring, and capability control. Research challenges in alignment include instilling complex values in AI, developing honest AI, scalable oversight, auditing and interpreting AI models, and preventing emergent AI behaviors like power-seeking. Alignment research has connections to interpretability research, (adversarial) robustness, anomaly detection, calibrated uncertainty, formal verification, preference learning, safety-critical engineering, game theory, algorithmic fairnesset social sciences.