Are More Data Always Better for Factor Analysis?

Methodological Notes on Factor Modeling, Estimation, and Forecasting

1. Conceptual Framework: Approximate Factor Model (AFM)

The paper is built on the approximate factor model, which is widely used in macroeconomics to summarize large datasets using a small number of latent factors.

Model specification

For each variable $i = 1, \dots, N$ and time period $t = 1, \dots, T$ $X_{it} = \lambda_i^{0\prime} F_t^0 + e_{it}$

or equivalently, $X_{it} = \chi_{it} + e_{it}$

Interpretation of symbols

$X_{it}$ : observed macroeconomic variable $i$ at time $t$
$F_t^0$ : $r \times 1$ vector of unobserved common factors
$\lambda_i^0$ : $r \times 1$ vector of factor loadings for variable $i$
$\chi_{it} = \lambda_i^{0\prime} F_t^0$ : common component shared across variables
$e_{it}$ : idiosyncratic component, specific to variable $i$ i

Key modeling assumption

Unlike strict factor models, the idiosyncratic errors $e_{it}$ :

may be heteroskedastic,
may exhibit weak cross-sectional correlation,
and may be serially correlated.

This flexibility motivates the term approximate factor model.

2. Population Covariance Structure and Identification

Let:

$X_t = (X_{1t}, \dots, X_{Nt})’$
$\Sigma_X = \text{Cov}(X_t)$

Then: $\Sigma_X = \Sigma_\chi + \Omega$

where:

$\Sigma_\chi = \Lambda \Sigma_F \Lambda’$ is the covariance of the common component
$\Omega = \text{Cov}(e_t)$ is the idiosyncratic covariance matrix

Identification logic

The matrix $\Sigma_\chi$ has rank r
Its eigenvalues diverge with N
Therefore, the first $r$ eigenvectors of $\Sigma_X$ span the factor space

This property underpins estimation via Principal Components (PC).

3. Factor Estimation via Principal Components

Sample covariance

Let: $\widehat{\Sigma}_X = \frac{1}{T} \sum_{t=1}^T X_t X_t’$

Denote:

$v_j$ eigenvector associated with the $j$ -th largest eigenvalue of $\widehat{\Sigma}_X$

Estimated factors

$\widehat{F}_{t,N}^{(j)} = \sqrt{\frac{1}{N}} \sum_{i=1}^N X_{it} v_{ij}$

Stacking the first $r$ r components: $\widehat{F}_{t,N} = (\widehat{F}_{t,N}^{(1)}, \dots, \widehat{F}_{t,N}^{(r)})’$

Estimated loadings

$\widehat{\lambda}_{ij} = \sqrt{N} \, v_{ij}$

Important implication

PC implicitly assumes that:

idiosyncratic errors are weakly correlated
their covariance does not dominate the signal from common factors

When these assumptions fail, adding more variables may reduce factor quality.

4. Forecasting Framework: Diffusion Index Model

The paper evaluates factor quality using out-of-sample forecasting.

Baseline AR model

$\widehat{y}_{t+1|t} = \widehat{\alpha}_0 + \sum_{j=1}^p \widehat{\gamma}_j y_{t-j+1}$

Factor-augmented forecast

If true factors were observable: $\widehat{y}_{t+1|t} = \widehat{\beta}_0 + \widehat{\beta}_1′ F_t^0 + \sum_{j=1}^p \widehat{\gamma}_j y_{t-j+1}$

In practice, factors are replaced by estimates: $\widehat{y}_{t+1|t} = \widehat{\beta}_0 + \widehat{\beta}_1′ \widehat{F}_{t,N} + \sum_{j=1}^p \widehat{\gamma}_j y_{t-j+1}$

Interpretation

Forecast accuracy now depends on:

the number of variables $N$ ,
the quality of factor estimation,
the structure of idiosyncratic errors.

5. Monte Carlo Design: When More Data Hurt

Data-generating process

$X_{it} = \sum_{m=1}^r \lambda_{im} F_{mt} + e_{it}$

Target variable: $y_{t+1} = \sum_{m=1}^r \beta_m F_{mt} + \varepsilon_{t+1}$

Error structure (key innovation)

Variables are divided into three groups:

Low-noise variables

$e_{it} = \sigma_1 u_{it}$

High-noise variables

$e_{it} = \sigma_2 u_{it}, \quad \sigma_2^2 \gg \sigma_1^2$

Cross-correlated variables

$e_{it} = \sigma_3 \left(u_{it} + \sum_{j=1}^C \rho_{ij} u_{jt} \right)$

Key insight

Adding variables from groups (2) or (3):

inflates idiosyncratic covariance,
distorts eigenstructure,
and reduces forecasting performance.

6. Oversampling and Factor Dominance

When one factor is over-represented in the dataset:

it dominates the principal components,
other relevant factors become poorly estimated,
forecasts targeting those factors deteriorate.

This phenomenon is labeled oversampling bias.

7. Weighted Principal Components

To address these issues, the paper proposes weighted PC estimation.

Standard PC objective

$V(k) = \frac{1}{NT} \sum_{i=1}^N \sum_{t=1}^T e_{it}^2$

Weighted objective

$W(k) = \frac{1}{NT} \sum_{i=1}^N w_{iT} \sum_{t=1}^T e_{it}^2$

where:

$w_{iT}$ : weight reflecting informativeness or contamination of series $i$

Practical rules

down-weight variables with high residual variance,
drop variables with extreme residual correlations,
group variables by economic blocks (real, nominal, volatile).

Empirically, smaller and cleaner datasets outperform large noisy panels.

8. Methodological Takeaways

This paper establishes that:

Factor estimation quality depends on data structure, not data size.
Adding variables may:
- weaken common components,
- amplify correlated noise,
- bias factor space estimation.
Intelligent selection and weighting dominate brute-force data expansion.

More data are not always better data.