Machine Learning dalam Asset Pricing Empiris: Pendekatan Prediktif Berbasis High-Dimensional Data


🇮🇩 VERSI BAHASA INDONESIA

1. Kerangka Metodologis Umum

Penelitian ini memandang asset pricing empiris sebagai persoalan estimasi ekspektasi kondisional return, yang secara formal dapat dinyatakan sebagai:E(Ri,t+1Ft)\mathbb{E}\left(R_{i,t+1} \mid \mathcal{F}_t\right)

dengan:

  • Ri,t+1R_{i,t+1}​ adalah excess return saham iii pada periode t+1t+1,
  • Ft={Xi,t,Zt}\mathcal{F}_t = \{X_{i,t}, Z_t\} adalah himpunan informasi yang tersedia pada waktu tt,
  • Xi,tX_{i,t} mencerminkan karakteristik spesifik perusahaan,
  • ZtZ_t​ merepresentasikan kondisi makroekonomi dan pasar agregat.

Berbeda dari pendekatan klasik yang mengasumsikan bentuk linier dan faktor risiko terstruktur, studi ini mengadopsi kerangka supervised learning, di mana fungsi prediksi f()f(\cdot) dipelajari langsung dari data:Ri,t+1=f(Xi,t,Zt)+εi,t+1R_{i,t+1} = f(X_{i,t}, Z_t) + \varepsilon_{i,t+1}

tanpa pembatasan awal terhadap linearitas maupun additivitas.


2. Struktur Data dan Ruang Prediktor

2.1 Vektor Prediktor

Vektor prediktor berdimensi tinggi dibangun sebagai:Wi,t=[Xi,tXi,tZtDi]W_{i,t} = \begin{bmatrix} X_{i,t} \\ X_{i,t} \otimes Z_t \\ D_i \end{bmatrix}

di mana:

  • Xi,tR94X_{i,t} \in \mathbb{R}^{94} adalah karakteristik saham,
  • Xi,tZtX_{i,t} \otimes Z_t​ adalah interaksi karakteristik dengan variabel makro,
  • DiD_i adalah dummy industri.

Total dimensi prediktor:dim(Wi,t)>900\dim(W_{i,t}) > 900

yang menempatkan studi ini dalam konteks high-dimensional regression.


3. Spesifikasi Model Matematis

3.1 Model Linear Dasar (Benchmark)

Ordinary Least Squares (OLS)

Ri,t+1=α+Wi,tβ+εi,t+1R_{i,t+1} = \alpha + W_{i,t}’\beta + \varepsilon_{i,t+1}

OLS berfungsi sebagai benchmark, namun tidak konsisten secara out-of-sample ketika:dim(Wi,t)Nataudim(Wi,t)>N\dim(W_{i,t}) \approx N \quad \text{atau} \quad \dim(W_{i,t}) > N

akibat varians estimator yang meningkat tajam.


3.2 Penalized Linear Regression

Elastic Net

Model diestimasi dengan meminimalkan:minβi,t(Ri,t+1Wi,tβ)2+λ1β1+λ2β22\min_{\beta} \sum_{i,t} \left(R_{i,t+1} – W_{i,t}’\beta\right)^2 + \lambda_1 \|\beta\|_1 + \lambda_2 \|\beta\|_2^2

dengan:

  • penalti 1\ell_1mendorong sparsity,
  • penalti 2\ell_2 menstabilkan estimasi pada prediktor berkorelasi tinggi.

3.3 Reduksi Dimensi

Principal Component Regression (PCR)

  1. Dekomposisi:

W=UΣVW = U \Sigma V’

  1. Pemilihan KKK komponen utama:

W~=UKΣK\tilde{W} = U_K \Sigma_K

  1. Regresi:

Rt+1=W~tγ+εt+1R_{t+1} = \tilde{W}_t \gamma + \varepsilon_{t+1}

Partial Least Squares (PLS)

PLS memilih vektor bobot ωk\omega_kωk​ yang memaksimalkan:Cov(Wωk,R)2\text{Cov}(W\omega_k, R)^2

sehingga lebih fokus pada prediktivitas return dibanding variansi prediktor.


3.4 Model Nonlinier Terbatas

Generalized Linear Model dengan Spline

Ri,t+1=α+j=1pgj(Wi,t,j)+εi,t+1R_{i,t+1} = \alpha + \sum_{j=1}^p g_j(W_{i,t,j}) + \varepsilon_{i,t+1}

dengan:

  • gj()g_j(\cdot)gj​(⋅) adalah fungsi spline nonlinier,
  • penalti group LASSO diterapkan untuk seleksi kelompok fungsi.

Keterbatasan utama:2RWjWk=0jk\frac{\partial^2 R}{\partial W_j \partial W_k} = 0 \quad \forall j \neq k

(interaksi tidak dimodelkan).


3.5 Tree-Based Ensemble Models

Random Forest (RF)

f^(W)=1Bb=1BTb(W)\hat{f}(W) = \frac{1}{B} \sum_{b=1}^B T_b(W)

dengan:

  • Tb()T_b(\cdot) adalah decision tree ke-bb,
  • setiap tree dibangun dari bootstrap sample.

RF secara implisit menangkap interaksi:Wj×WkW_j \times W_k

melalui struktur percabangan.


Gradient Boosted Trees (GBRT)

Model aditif:fM(W)=m=1Mνhm(W)f_M(W) = \sum_{m=1}^M \nu h_m(W)

di mana setiap tree hmh_mhm​ mengaproksimasi negative gradient dari fungsi loss.


3.6 Neural Networks

Feedforward Neural Network

h(1)=σ(W(1)W+b(1))h(l)=σ(W(l)h(l1)+b(l))R^i,t+1=W(L)h(L1)+b(L)\begin{aligned} h^{(1)} &= \sigma(W^{(1)} W + b^{(1)}) \\ h^{(l)} &= \sigma(W^{(l)} h^{(l-1)} + b^{(l)}) \\ \hat{R}_{i,t+1} &= W^{(L)} h^{(L-1)} + b^{(L)} \end{aligned}

dengan:

  • L=1,,5L = 1,\dots,5 hidden layers,
  • σ()\sigma(\cdot) fungsi aktivasi nonlinier.

Estimasi dilakukan melalui:minθ(RR^)2+λθ2\min_{\theta} \sum (R – \hat{R})^2 + \lambda \|\theta\|^2

disertai early stopping untuk mencegah overfitting.


4. Evaluasi dan Implementasi Ekonomi

Model dievaluasi berdasarkan:

  • Out-of-sample R2,
  • Sharpe ratio portofolio hasil prediksi,
  • Strategi long–short decile dan market timing.

🇬🇧 ENGLISH VERSION

Methodology and Model Specification

Empirical Asset Pricing via Machine Learning: A High-Dimensional Predictive Framework


1. General Methodological Framework

This study formulates empirical asset pricing as a conditional expectation problem, expressed as:E(Ri,t+1Ft)\mathbb{E}(R_{i,t+1} \mid \mathcal{F}_t)

where:

  • Ri,t+1R_{i,t+1}​ denotes excess stock returns,
  • Ft={Xi,t,Zt}\mathcal{F}_t = \{X_{i,t}, Z_t\} represents the information set,
  • Xi,tX_{i,t} are firm-level characteristics,
  • ZtZ_t captures aggregate macroeconomic conditions.

The predictive model is written as:Ri,t+1=f(Xi,t,Zt)+εi,t+1R_{i,t+1} = f(X_{i,t}, Z_t) + \varepsilon_{i,t+1}

without imposing linearity or additivity ex ante.


2. High-Dimensional Feature Space

The predictor vector is constructed as:Wi,t=[Xi,tXi,tZtDi]W_{i,t} = \begin{bmatrix} X_{i,t} \\ X_{i,t} \otimes Z_t \\ D_i \end{bmatrix}

resulting in more than 900 predictive signals, placing the analysis in a high-dimensional regression environment.


3. Mathematical Model Specifications

3.1 Linear Benchmark Model

Ordinary Least Squares

Ri,t+1=α+Wi,tβ+εi,t+1R_{i,t+1} = \alpha + W_{i,t}’\beta + \varepsilon_{i,t+1}

OLS serves as a baseline but suffers from poor out-of-sample performance under dimensionality.


3.2 Penalized Linear Models

Elastic Net

minβ(RWβ)2+λ1β1+λ2β22\min_{\beta} \sum (R – W\beta)^2 + \lambda_1 \|\beta\|_1 + \lambda_2 \|\beta\|_2^2

balancing sparsity and stability.


3.3 Dimension Reduction

Principal Component Regression

W=UΣVR=UKΣKγ+εW = U\Sigma V’ \quad \Rightarrow \quad R = U_K \Sigma_K \gamma + \varepsilon

Partial Least Squares

PLS maximizes:Cov(Wω,R)2\text{Cov}(W\omega, R)^2

to extract predictive components.


3.4 Restricted Nonlinear Models

Spline-Based GLM

Ri,t+1=α+jgj(Wi,t,j)+εR_{i,t+1} = \alpha + \sum_j g_j(W_{i,t,j}) + \varepsilon

Nonlinear but additive, hence interaction-free.


3.5 Tree-Based Ensembles

Random Forest

f^(W)=1Bb=1BTb(W)\hat{f}(W) = \frac{1}{B} \sum_{b=1}^B T_b(W)

Gradient Boosting

fM(W)=m=1Mνhm(W)f_M(W) = \sum_{m=1}^M \nu h_m(W)

capturing nonlinear interactions.


3.6 Neural Networks

R^=f(W;θ)\hat{R} = f(W;\theta)

with multi-layer nonlinear transformations estimated via regularized least squares and early stopping.


5. Methodological Contribution

The methodology demonstrates that predictive gains in asset pricing arise primarily from nonlinear interactions, not merely from expanding predictor sets.