By Bofei Zhang, Jiayi Du, Yixuan Wang, Muyang Jin
Abstract
Delta hedging is an options strategy that utilizes delta aiming to reduce the risk associated with price movements in the underlying asset while minimizing trading costs, by taking offsetting long or short positions. We employed Deep Reinforcement Learning (DRL) to address this hedging problem in a realistic setting, including discrete time trading with high level of market friction. First, we implemented a simulation environment that simulated stock movements and option pricing by OpenAI Gym. Second, we utilized multiple DRL methods including Deep Q-Learning (DQL), DQL with Preserve Output Precisely and Adaptive Scaling Target (Pop-art) implementation, and Proximal Policy Optimization (PPO) to build agents that can learn how to optimally hedge an option. Third, we evaluated the agent performance in terms of accumulative reward, volatility, trading cost, and profit and loss attribution of our agents and the baseline Delta Hedging policy. We are able to show that PPO has the best performance among all other DRL algorithms. Moreover, PPO has significantly shorter training time and generates more financially sensible policy than other DRL methods.
Finance Background
An option is a form of financial derivative that gives the buyer the right to buy or sell a number of corresponding underlying assets at a set price point on a set future date. Our benchmark strategy, delta hedging, is derived by the well celebrated Black-Scholes Merton (BSM) model, where the pricing of a European call pricing follows the equation:
View Image Long Description
We will solve the problem in a more realistic setting where we perform discrete hedging with transaction cost in a market with friction.
Our environment simulates the stock movement and computes option prices according to BSM. At each time step, the agent observes the state, i.e. a 3d vector: (stock price, time to maturity, number of stocks) Then, the agent receives a reward that correlated to the wealth gain of this time step:
View Image Long Description
DQL
DQL utilized MLP to approximate the state-action value function by learn a Q function. All Q functions obey Bellman equation:
View Image Long Description
Then, updating Q function becomes a problem of minimizing temporal difference error:
View Image Long Description
Reinforcement Learning Structure
A policy π is learned by a Multilayer Perceptron (MLP) for choosing the next action to maximize reward. By definition, action-state value function and state value function are:
DQL + Pop-Art
Pop-art is a method that adaptively normalizes the state-action value function used in the learning updates is proposed.
Algorithm 1 SGD on squared loss with Pop-Art
For a given differentiable function hθ, initialize θ. Initialize W = I, b = 0, ∑ = I, and µ = 0. while learning do Observe input X and target Y Use Y to compute new scale ∑new and new shift µnew | |
W ← ∑-1new∑W, b ← ∑-1new(∑b + µ – µnew | (rescale W and b) |
∑ ← ∑new, µ ← µnew | (update scale and shift) |
h ← hθ(X) | (store output of hθ) |
J ← (▽θhθ1(X),…,▽0hθ,m(X)) | (compute Jacobian of hθ) |
S ← (Wh + b– ∑–1(Y – µ) | (computer normalized error) |
θ ← θ – αJWτδ | (compute SGD update for θ) |
W ← W – αδhτ | (compute SGD update for W) |
b ← b – αδ | (compute SGD update for b) |
end while |
By definition, action-state value function and state value function are:
We empirically estimate value function by General Advantages Estimation (GAE):
View Image Long Description
View Image Long Description
We integrate trading cost to reward by the following equation:
View Image Long Description
PPO
Proximal Policy Optimization improve a surrogate objective vial stochastic gradient descent:
View Image Long Description
We used GAE to estimate advantage functions. By adding value function regression loss and entropy bonus we have the loss function:
View Image Long Description
Where value function loss is given by
View Image Long Description
Conclusion and Discussion
Training time and Convergence
- DQL and PPO with reward clipping will have crashed reward if training takes too long. DQL with Pop-art can fix this issue. In general, PPO has the fastest convergence speed than all other methods.
Accumulative Reward
- All DRL agents have a similar policy to baseline delta hedging.
Total Cost and Volatility
Non-cost case (cost multiplier = 0)
- All DRL agents find more optimal strategy as the average realized vol are much lower compared to baseline delta, but slightly large than zero as financially, given discrete time trading, DRL agents tend to be off a bit in between hedging time.
High-cost case (cost multiplier =5)
- Delta trades too much inducing higher cost compared to all DRL agents.
- All DRL agents realize much lower average cost while maintaining the hedge, showing their capability balancing between trading error and costs.
- Overall PPO achieves better performance in terms of its lower average cost at 54.87 and standard deviation at 12.70, compared to both DQL and DQL with Pop-art, at the sacrifice of slightly higher volatility of total P&L, representing its more cost-conscious decision at trade-off between costs and trading errors.
P&L
- The mean of Delta Policy’s P&L is significantly smaller than zero in both non-cost and high-cost cases
- All DRL agents outperform Delta as their t-statistic of P&L are much more often close to zero and insignificant
- DQL with Pop-art performs slightly better in high-cost case compared to other DRL agents, as it achieves significantly positive t-statistics of total P&L
Policy
- All agents trade less when cost is implemented and the number of random actions (individual dots deviating from the piecewise segments) decreases.
- PPO trades more conservatively than both DQN agents. There is an apparent effect of Pop-art on regular DQL: less random actions are made, and the decisions are closer to the benchmark decisions (dashed lines, derived from delta hedging.).
Results
Acknowledgements
It is our pleasure to have our wonderful advisor Petter N.Kolm leading and supporting us throughout the project.