[논문 리뷰] Teaching AI to Handle Exceptions: Supervised Fine-Tuning with
  Human-Aligned Judgment
Paper

[논문 리뷰] Teaching AI to Handle Exceptions: Supervised Fine-Tuning with Human-Aligned Judgment

2025.06.29
·Web·by Anonymous
#LLM#Fine-tuning#AI Ethics#Decision Making#Human Alignment

Key Points

  • 1This research examines how large language models (LLMs) handle exceptions in decision-making, finding that LLMs tend to adhere strictly to policies, diverging from human flexible judgment.
  • 2Comparing ethical framework prompting, chain-of-thought, and supervised fine-tuning (SFT), the study demonstrates that SFT using human explanations significantly improves LLM alignment with human decision-making.
  • 3This effective SFT method allows LLMs to learn the underlying reasons for decisions, enabling them to generalize to new scenarios and offering valuable insights for developing more reliable AI systems.

This paper investigates how Large Language Models (LLMs) handle exception processing within complex decision-making scenarios, specifically evaluating the alignment of AI judgment with human judgment. The core problem addressed is the observed divergence between LLM and human responses when confronted with exceptions, highlighting a potential reliability issue for AI in real-world applications.

The research methodology centers on comparing three distinct approaches to guide LLM behavior in handling exceptions:

  1. Ethical Framework Prompting: This approach involves prompting LLMs to generate responses by explicitly leveraging moral decision-making principles such as Deontology, Consequentialism, or Virtue Ethics. The goal is to steer the LLM's judgment towards ethically grounded decisions. Technically, this entails designing prompts that integrate these philosophical frameworks, instructing the LLM to consider or articulate its decisions based on these defined ethical guidelines.
  1. Chain-of-Thought (CoT) Prompting: This method encourages the LLM to articulate explicit reasoning steps before arriving at a final decision. By inducing a step-by-step thinking process, the aim is to foster improved judgment and enhance the transparency of the decision-making rationale. From a technical standpoint, CoT prompting involves appending specific instructions to the input prompt, such as "Let's think step by step," compelling the model to output intermediate thought processes that lead to its conclusion.
  1. Supervised Fine-Tuning (SFT): This technique focuses on enhancing LLM performance by fine-tuning the model using a dataset that incorporates not only human decisions but crucially, human explanations for those decisions. The objective is to teach the model the underlying rationale and "how" decisions are made, rather than merely "what" the decision is. Technically, this SFT process involves training the LLM on data where each instance includes the scenario, the human's decision, and the accompanying human justification or reasoning. This allows the model to learn the nuances of human judgment and generalize based on the provided explanations.

For experimental design, researchers generated a diverse set of exception scenarios varying in "exception strength" (or level) and policy regulations, all presented within realistic business contexts for both human participants and LLMs. The evaluation process involved comparing the performance of the LLMs across these three approaches against human judgment using several metrics:

  • Baseline Refusal Rate Measurement: This involved quantifying the LLM's inherent tendency to "refuse" or deviate from the general policy in exception cases, comparing it directly with human refusal rates to understand baseline differences in judgment.
  • Ethical Framework Impact Assessment: The study measured the refusal rate of LLMs when guided by ethical frameworks, assessing whether these principles significantly altered their decision-making behavior.
  • CoT Prompting vs. SFT Evaluation: The efficacy of CoT prompting and SFT in bridging the gap between LLM and human decision-making was evaluated. A key finding was the significant improvement observed with SFT, especially when human explanations were integrated into the fine-tuning data.

The principal findings of the research revealed that LLMs generally tend to adhere strictly to policies, which often results in a lack of flexibility compared to human decision-making. While ethical framework prompting did not yield significant improvements in LLM decision-making, Supervised Fine-Tuning proved to be effective in aligning LLMs more closely with human judgment. This alignment was attributed to the model's ability to learn the rationale behind decisions, moving beyond simple binary classifications ("Yes/No") to understand the underlying reasons. A crucial discovery was that fine-tuning with human explanations, rather than just decision labels, empowered the model to generalize effectively to new and unseen scenarios. This research provides valuable insights into modeling human thought processes within AI systems and suggests concrete methods for developing more reliable AI decision-making capabilities in real-world environments. Future research directions include exploring real-world applicability and analyzing AI responses in iterative conversational contexts.