One of the core tasks in quantitative investing is mining alpha factors — finding signals that predict asset returns. The traditional approach relies on researchers manually constructing factor expressions, or using automated search methods like Genetic Programming (GP) to brute-force combinations in the operator space. The former depends on human experience and intuition — low efficiency but high interpretability. The latter is efficient but produces deeply nested operator expressions that are nearly impossible for researchers to interpret.

AlphaGPT (paper) brings large language models into the factor mining pipeline, using an LLM as the factor “generator.” The follow-up work, AlphaGPT 2.0 (paper), further introduces a human-in-the-loop closed cycle.

The Problem with Traditional Factor Mining

A quantitative factor is essentially a mathematical expression. It takes market data as input — prices, volumes, financial metrics — and outputs a numerical score for ranking stocks or constructing portfolios. Classic examples: momentum factor is just returns over the past N days, value factor might be the inverse of P/E ratio.

Simple factors have been thoroughly mined. Finding new effective factors typically requires more complex expressions — combining multiple base operators (rank, std, corr, delay, etc.). The search space grows exponentially, making manual exploration impractical.

Genetic Programming can search this space automatically, but faces two problems. First, search efficiency is low — massive computational resources are spent evaluating meaningless expressions. Second, the output factors lack interpretability. An expression like rank(corr(delay(close, 5), volume, 10)) might backtest well, but researchers can’t articulate the economic logic behind it. Factors without economic justification carry high overfitting risk.

AlphaGPT’s Core Design

AlphaGPT replaces or augments the factor generation step in the GP pipeline with an LLM. The process starts by describing the factor mining task to the LLM in natural language — available operators, data fields, and syntax rules for factor expressions. This effectively gives the LLM a “factor DSL” specification.

The LLM then generates candidate factor expressions. It has seen vast amounts of financial literature and quantitative code during pretraining, giving it some intuition about what factors might work. Its output tends to be more structured than random search, and more interpretable.

Generated factors are backtested to compute metrics like IC (Information Coefficient — correlation between factor values and future returns) and IR (Information Ratio). The backtest results are fed back to the LLM, allowing it to adjust direction in the next generation round — creating a generate-evaluate-feedback iterative loop.

The key insight: LLM-generated factors carry semantic meaning. They’re not blind operator combinations, but expressions constructed from an understanding of financial concepts. For example, the LLM might generate a “volume-weighted price deviation” factor — something with an inherently interpretable economic rationale.

AlphaGPT 2.0: Human-AI Collaboration

AlphaGPT 2.0 introduces human researchers into the loop, creating a Human-in-the-Loop closed cycle:

+-------------------+
|  Research Ideas   |  <-- Human researcher provides direction
+-------------------+
         |
         v
+-------------------+
|    LLM Generates  |  <-- LLM generates candidate factors
|    Alpha Factors  |
+-------------------+
         |
         v
+-------------------+
|    Backtest &     |  <-- Automated backtest evaluation
|    Evaluation     |
+-------------------+
         |
         v
+-------------------+
|  Human Review &   |  <-- Researcher reviews, filters,
|  Feedback         |      provides new direction
+-------------------+
         |
         +-----------> Next iteration

Researchers can intervene at several points: providing initial research directions (e.g., “explore the relationship between volume anomalies and short-term reversal”), selecting factors with plausible economic logic from candidates, and offering qualitative judgments on backtest results (e.g., “this factor’s outperformance in small caps might be a liquidity illusion”).

The value of this design lies in combining strengths from both sides: the LLM’s search efficiency and breadth, plus human researchers’ domain knowledge and judgment. Pure LLM approaches tend to produce many “statistically significant but economically meaningless” factors. Pure manual research is too slow. Human-AI collaboration strikes a balance.

Results and Limitations

Based on the paper’s experimental results, AlphaGPT produces factors with higher average IC and IR than traditional GP methods on the Chinese A-share market, with better interpretability.

But this direction has clear limitations.

The LLM’s financial knowledge comes from pretraining data, which is inherently lagging. Markets evolve dynamically — factor logic that worked last year may have decayed this year. LLMs can’t perceive changes in market microstructure the way human researchers can.

The search space definition — available operators, data fields — still requires human design. The LLM only searches within a given space; it won’t invent new operators or discover new data sources. True alpha innovation often comes from finding data nobody else has looked at, not from finding more complex combinations of the same data.

Additionally, LLM-generated factors may have high mutual correlation. Without proper deduplication and orthogonalization, combining these factors won’t provide additional information gain.

Implications for Quantitative Research

The direction AlphaGPT represents is essentially using LLMs as a quantitative researcher’s copilot. It won’t replace researchers, but can significantly accelerate hypothesis generation and preliminary validation. Researchers can focus on higher-value work: assessing economic logic, designing portfolio construction schemes, monitoring factor decay.

From a broader perspective, LLM applications in quantitative finance extend beyond factor mining. Sentiment analysis, event-driven signal extraction, automated research report summarization, code generation for backtesting — all of these directions are seeing active exploration. AlphaGPT’s contribution is providing a relatively complete framework for LLM-assisted factor mining, establishing a reference baseline for future work.

That said, alpha is a zero-sum game. When everyone is using LLMs to mine factors, how long these factors remain effective is a question worth watching.