Are More Data Always Better for Factor Analysis?

Methodological Notes on Factor Modeling, Estimation, and Forecasting


1. Conceptual Framework: Approximate Factor Model (AFM)

The paper is built on the approximate factor model, which is widely used in macroeconomics to summarize large datasets using a small number of latent factors.

Model specification

For each variable i=1,,Ni = 1, \dots, N and time period t=1,,Tt = 1, \dots, T Xit=λi0Ft0+eitX_{it} = \lambda_i^{0\prime} F_t^0 + e_{it}

or equivalently,Xit=χit+eitX_{it} = \chi_{it} + e_{it}

Interpretation of symbols

  • XitX_{it} ​: observed macroeconomic variable ii at time tt
  • Ft0F_t^0 ​: r×1r \times 1 vector of unobserved common factors
  • λi0\lambda_i^0 ​: r×1r \times 1 vector of factor loadings for variable ii
  • χit=λi0Ft0\chi_{it} = \lambda_i^{0\prime} F_t^0 ​: common component shared across variables
  • eite_{it} ​: idiosyncratic component, specific to variable iii

Key modeling assumption

Unlike strict factor models, the idiosyncratic errors eite_{it} ​:

  • may be heteroskedastic,
  • may exhibit weak cross-sectional correlation,
  • and may be serially correlated.

This flexibility motivates the term approximate factor model.


2. Population Covariance Structure and Identification

Let:

  • Xt=(X1t,,XNt)X_t = (X_{1t}, \dots, X_{Nt})’
  • ΣX=Cov(Xt)\Sigma_X = \text{Cov}(X_t)

Then:ΣX=Σχ+Ω\Sigma_X = \Sigma_\chi + \Omega

where:

  • Σχ=ΛΣFΛ\Sigma_\chi = \Lambda \Sigma_F \Lambda’ is the covariance of the common component
  • Ω=Cov(et)\Omega = \text{Cov}(e_t) is the idiosyncratic covariance matrix

Identification logic

  • The matrix Σχ\Sigma_\chi has rank r
  • Its eigenvalues diverge with N
  • Therefore, the first rr eigenvectors of ΣX\Sigma_X span the factor space

This property underpins estimation via Principal Components (PC).


3. Factor Estimation via Principal Components

Sample covariance

Let:Σ^X=1Tt=1TXtXt\widehat{\Sigma}_X = \frac{1}{T} \sum_{t=1}^T X_t X_t’

Denote:

  • vjv_j eigenvector associated with the jj -th largest eigenvalue of Σ^X\widehat{\Sigma}_X

Estimated factors

F^t,N(j)=1Ni=1NXitvij\widehat{F}_{t,N}^{(j)} = \sqrt{\frac{1}{N}} \sum_{i=1}^N X_{it} v_{ij}

Stacking the first rrr components:F^t,N=(F^t,N(1),,F^t,N(r))\widehat{F}_{t,N} = (\widehat{F}_{t,N}^{(1)}, \dots, \widehat{F}_{t,N}^{(r)})’

Estimated loadings

λ^ij=Nvij\widehat{\lambda}_{ij} = \sqrt{N} \, v_{ij}

Important implication

PC implicitly assumes that:

  • idiosyncratic errors are weakly correlated
  • their covariance does not dominate the signal from common factors

When these assumptions fail, adding more variables may reduce factor quality.


4. Forecasting Framework: Diffusion Index Model

The paper evaluates factor quality using out-of-sample forecasting.

Baseline AR model

y^t+1t=α^0+j=1pγ^jytj+1\widehat{y}_{t+1|t} = \widehat{\alpha}_0 + \sum_{j=1}^p \widehat{\gamma}_j y_{t-j+1}

Factor-augmented forecast

If true factors were observable:y^t+1t=β^0+β^1Ft0+j=1pγ^jytj+1\widehat{y}_{t+1|t} = \widehat{\beta}_0 + \widehat{\beta}_1′ F_t^0 + \sum_{j=1}^p \widehat{\gamma}_j y_{t-j+1}

In practice, factors are replaced by estimates:y^t+1t=β^0+β^1F^t,N+j=1pγ^jytj+1\widehat{y}_{t+1|t} = \widehat{\beta}_0 + \widehat{\beta}_1′ \widehat{F}_{t,N} + \sum_{j=1}^p \widehat{\gamma}_j y_{t-j+1}

Interpretation

Forecast accuracy now depends on:

  • the number of variables NN ,
  • the quality of factor estimation,
  • the structure of idiosyncratic errors.

5. Monte Carlo Design: When More Data Hurt

Data-generating process

Xit=m=1rλimFmt+eitX_{it} = \sum_{m=1}^r \lambda_{im} F_{mt} + e_{it}

Target variable:yt+1=m=1rβmFmt+εt+1y_{t+1} = \sum_{m=1}^r \beta_m F_{mt} + \varepsilon_{t+1}

Error structure (key innovation)

Variables are divided into three groups:

  1. Low-noise variables

eit=σ1uite_{it} = \sigma_1 u_{it}

  1. High-noise variables

eit=σ2uit,σ22σ12e_{it} = \sigma_2 u_{it}, \quad \sigma_2^2 \gg \sigma_1^2

  1. Cross-correlated variables

eit=σ3(uit+j=1Cρijujt)e_{it} = \sigma_3 \left(u_{it} + \sum_{j=1}^C \rho_{ij} u_{jt} \right)

Key insight

Adding variables from groups (2) or (3):

  • inflates idiosyncratic covariance,
  • distorts eigenstructure,
  • and reduces forecasting performance.

6. Oversampling and Factor Dominance

When one factor is over-represented in the dataset:

  • it dominates the principal components,
  • other relevant factors become poorly estimated,
  • forecasts targeting those factors deteriorate.

This phenomenon is labeled oversampling bias.


7. Weighted Principal Components

To address these issues, the paper proposes weighted PC estimation.

Standard PC objective

V(k)=1NTi=1Nt=1Teit2V(k) = \frac{1}{NT} \sum_{i=1}^N \sum_{t=1}^T e_{it}^2

Weighted objective

W(k)=1NTi=1NwiTt=1Teit2W(k) = \frac{1}{NT} \sum_{i=1}^N w_{iT} \sum_{t=1}^T e_{it}^2

where:

  • wiTw_{iT} ​: weight reflecting informativeness or contamination of series ii

Practical rules

  • down-weight variables with high residual variance,
  • drop variables with extreme residual correlations,
  • group variables by economic blocks (real, nominal, volatile).

Empirically, smaller and cleaner datasets outperform large noisy panels.


8. Methodological Takeaways

This paper establishes that:

  1. Factor estimation quality depends on data structure, not data size.
  2. Adding variables may:
    • weaken common components,
    • amplify correlated noise,
    • bias factor space estimation.
  3. Intelligent selection and weighting dominate brute-force data expansion.

More data are not always better data.