Reinforcement Learning - Bandit Algorithms (Slot Machine Probability Simulation, Bull Market Stock Selection)
Key Points
- 1This paper introduces reinforcement learning concepts and the multi-armed bandit problem, highlighting the critical balance between exploration and exploitation in decision-making.
- 2It demonstrates methods for estimating slot probabilities using recursive Q-value updates and epsilon-greedy policies, further comparing their performance in both static and dynamically changing environments.
- 3The principles of Q-learning are then applied to a real-world financial scenario, identifying KOSPI stocks with a high probability of significant price increases based on a learned Q-value threshold.
The paper explores the application of reinforcement learning (RL) and multi-armed bandit (MAB) algorithms, primarily focusing on the epsilon-greedy strategy, for problems ranging from simulated slot machine optimization to identifying potentially rising stocks in the KOSPI market.
The core methodology begins with an introduction to Reinforcement Learning, defining an agent that interacts with an environment, selects actions, receives rewards, and learns an optimal policy to maximize cumulative rewards. The Multi-Armed Bandit (MAB) problem is presented as a fundamental RL scenario, where an agent must choose among several slot machines, each with a different, unknown reward probability, to maximize total reward over time. This introduces the critical dilemma of exploration (trying new, potentially better options) versus exploitation (choosing the currently known best option).
Two primary methods for estimating slot machine probabilities (Q-values) are discussed:
- Experimental Probability Estimation (Sample Average): This involves repeatedly playing a slot machine and calculating the empirical average of rewards. For a sequence of rewards , the estimated probability .
- Recursive Update (Incremental Sample Average Update): This method updates the estimated Q-value incrementally with each new reward. If is the estimated mean after trials and is the reward from the -th trial, the new estimate is updated as:
This method is more computationally efficient as it does not require storing all past rewards. It is also noted to be a form of Exponential Moving Average (EMA) when considering a constant learning rate.
The paper then details the implementation of an epsilon-greedy agent for the MAB problem. A Game class simulates slot machines, each having a randomly assigned success rate. An Agent class embodies the epsilon-greedy policy:
- With probability (exploration), the agent chooses a random action (slot machine).
- With probability (exploitation), the agent chooses the action (slot machine) that currently has the highest estimated Q-value, i.e., .
update method uses the recursive update rule mentioned above: , where Ns[action] is the count of times action has been selected. The performance of this agent is evaluated over multiple runs and steps, calculating total_reward and rates (average reward per step). Comparisons are made by varying the value (e.g., 0.01, 0.1, 0.3) to demonstrate the trade-offs between exploration and exploitation.A more advanced Q-learning algorithm with a constant learning rate (alpha-constant update) is introduced to handle environments where slot probabilities can dynamically change. A Game2 class is used, where self.rates (slot success probabilities) are perturbed with noise () after each play. An Agent2 class is introduced with a self.alpha (learning rate). Its update method implements the Q-learning update rule:
This formula updates the Q-value based on the difference between the observed reward and the current Q-value , scaled by the constant learning rate . This approach gives more weight to recent rewards, making it suitable for non-stationary environments. The paper compares the performance of the "sample average" agent and the "alpha constant update" agent.
Finally, the paper applies the Q-learning concept to identify potentially rising stocks in the KOSPI market. For each KOSPI stock, daily closing prices are analyzed. A reward is defined as 1 if the next day's closing price increases by 5% or more compared to today's, and 0 otherwise. A Q-value is maintained for each stock, representing its estimated "value" or potential for positive returns. The Q-value for each stock is updated daily using the Q-learning rule:
where . A specific condition, , is used to filter for high-potential stocks. Only when a stock's Q-value exceeds 0.9 (indicating a high expected reward) are its subsequent rewards considered for calculating a "reward rate" (count / total_count), which tracks the frequency of positive returns among high-Q stocks. Stocks with a final Q-value greater than 0.9 are then identified as "rising stocks."