Lesson 2

Data Source Breakdown: Signal Value in News, Social Media, and On-Chain Behavior

This lesson starts from the three elements of "signal—noise—validation" and systematically breaks down the distinct roles of news, social media, and on-chain data in narrative trading. It establishes a reusable layered data framework, laying the groundwork for subsequent structured scoring and strategy mapping.

I. Fundamental Differences Among Three Data Types: Facts, Opinions, and Behaviors

In practice, these three data sources can be understood as three types of “evidence”:

  • News and Announcements: Closer to “fact triggers”

Typical examples include regulatory statements, macro data, exchange announcements, project upgrades, funding and partnership disclosures, etc. Their value lies in providing identifiable time points and event boundaries, making them suitable as “narrative starting points.”

  • Social Media and Community Discussions: Closer to “emotion and attention proxies”

Typical examples include discussion volume, retweet structure, KOL concentration, sentiment polarity, topic clustering, etc. Their value lies in measuring the speed and crowding of narrative diffusion, suitable as “narrative intensity and risk temperature.”

  • On-Chain and Transaction Structure: Closer to “capital behavior evidence”

Typical examples include large transfers, exchange net inflows/outflows, stablecoin supply changes, derivatives open interest and funding rates, trade distribution, etc. Their value lies in verifying whether the narrative truly translates into capital action, suitable as the “realization validation layer.”

The key to narrative trading is not to rely solely on one data type but to let all three complement each other: news provides the starting point, social media provides the temperature, on-chain provides the validation.

II. News Data: Strong Trigger, Weak Persistence—Must Address “Expectation Gap”

The advantage of news signals is clear event boundaries, making time series research easier. But common pitfalls are also apparent:

  • Expectation gap: The market may have traded in advance; when public news appears, there may be a reverse move.
  • Semantic ambiguity: The same statement can be interpreted as positive or negative depending on context.
  • Source quality varies: Long repost chains can distort or delay information.

Therefore, news data is better suited as the foundation for an “Event Calendar” and “Narrative Tag Library,” rather than as a direct high-frequency trading trigger.

In practice, news is usually processed into three types of tags:

  • Event type (regulatory/macro/project/security events, etc.)
  • Impact direction (upside risk/downside risk/structural uncertainty)
  • Impact level (global/sector/single asset)

III. Social Media Data: Strong Diffusion, High Noise—Must Address “Manipulation and Homogenization”

Social media data is extremely sensitive to narrative trading because it directly captures attention shifts. However, its noise structure is more complex:

  • Homogenization and repetition: Many accounts repeat the same rhetoric; rising discussion volume doesn’t necessarily mean new information.
  • Manipulation and fake traffic: Bots, paid influencers, and coordinated hype create false popularity.
  • Sentiment polarization: Extreme sentiment often accompanies high volatility; signals may show “spike pulses.”

Therefore, social media data is better used for generating “diffusion structure indicators,” rather than simple sentiment scores.

More valuable structural dimensions include:

  • Whether discussions spread from a few nodes to a broader user base;
  • Whether topics resonate across platforms;
  • Whether sentiment shifts from divergence to consensus (or vice versa).

These dimensions are closer to the formation process of capital behavior than simple positive/negative word frequency.

IV. On-Chain Data: Strong Validation, Weak Explanation—Must Address “Causal Lag”

The biggest advantage of on-chain data is verifiability and statistical resistance to forgery, making it suitable as the “realization layer” of narratives. The challenge lies in explaining the chain of causality:

  • The same on-chain phenomenon may correspond to multiple narratives. For example, rising exchange net inflow may signal sell pressure or market making/hedging activity.
  • Causal direction isn’t always clear. On-chain changes may lag price or lead price; microstructure analysis combining derivatives and spot is needed.

Therefore, on-chain data is better suited to answer “is capital actually moving” rather than “why price must rise.”

In narrative trading frameworks, on-chain indicators typically serve three validation tasks:

  • Whether sustained capital paths appear after a narrative emerges;
  • Whether abnormal concentration occurs during crowded narrative periods;
  • Whether structural transfers occur before or after sharp price movements.

V. Organizing Three Data Types into an “Evidence Pyramid”

To reduce noise and improve actionable quality, a three-layer pyramid structure can be used:

  • Base layer: On-chain and transaction structure (hard evidence)

Used to verify whether narratives translate into capital behavior.

  • Middle layer: Social media diffusion and sentiment structure (soft evidence)

Used to measure narrative intensity, crowding, and persistence.

  • Top layer: News and key events (triggers)

Used to locate narrative starting points and update cadence.

The significance of this structure is that any trading action should strive for “at least two layers of evidence resonance.” Single-layer evidence (especially just social media hype) usually serves only as an observation object—not a stable strategy input.

VI. Time Alignment: The Most Underestimated Engineering Problem in Narrative Trading

The three data types have different time granularity: news in minutes/hours, social media in second-level pulses, on-chain in block time.

If time alignment isn’t rigorous, “false correlation” easily arises:

  • Using future information to explain past price (time travel);
  • Treating delayed on-chain data as instant triggers (causal inversion).

In practice, a unified timeline needs to be established:

  • Event time (news publication time)
  • Discussion peak time (social media heat window)
  • Capital migration time (on-chain transfer confirmation and aggregation window)

Time alignment is a prerequisite for all subsequent scoring models and is the critical threshold for narrative research entering live trading.

Simple Example: How Time Misalignment Leads to Misjudgment

Scenario: A token releases positive news

Actual timeline (aligned)

  • 12:00|Event time: Project releases partnership news
  • 12:00–12:05|Social media diffusion: Discussion heats up; peak at 12:03
  • 12:02–12:15|On-chain capital: Funds start entering (with confirmation and data delay)
  • 12:01–12:08|Price reaction: Price begins rising

Common Error

Treating “data appearance time” as “actual occurrence time”

  • On-chain dashboard shows time: 12:10
  • Actual transaction occurred: 12:02–12:04

Misjudgment result: Price rises first; capital enters afterward; thus incorrectly concluding that on-chain is not the driving factor.

Time travel (using future to explain past)

  • Using social media heat peak at 12:03
  • Explaining price rise at 12:01

The issue is introducing future information, distorting backtest results.

Correct Approach

A unified timeline must be established with clear definitions for each time type:

  • News: Publication time (Event Time)
  • Social media: Heat formation interval (not just a single point)
  • On-chain: Backtrack actual occurrence time (excluding block confirmation and indexing delay)
  • Price: Match execution time

If time isn’t aligned, only superficial correlation is obtained; only within a unified time framework can real driving relationships be identified. This is also the key premise for narrative trading to move from research into live trading.

VII. Data Quality and Preemptive Risk Control: The “Admission Threshold” for Narrative Trading

Before modeling begins, it’s recommended to define data admission rules such as:

  • News source whitelist and cross-verification mechanism;
  • Social media account credibility layering and abnormal traffic filtering;
  • On-chain address tag library update frequency and tolerance for mislabeling.

Data stacking without admission rules only amplifies overfitting risk.

The long-term competitiveness of narrative trading depends largely on whether data governance is engineered—not on how fancy the indicators are.

VIII. Lesson Summary

This lesson completed the core division of the data layer:

  • News provides event triggers and narrative starting points;
  • Social media depicts attention diffusion and sentiment temperature;
  • On-chain verifies capital paths and behavioral realization.

At the same time, this lesson proposed two engineering principles—“evidence pyramid” and “time alignment”—as boundary conditions for subsequent structured modeling.

The next lesson will cover methodology essentials: narrative tags, sentiment scoring, and event mapping—focusing on how to transform unstructured text and on-chain behavior into computable, backtestable, monitorable indicator systems.

Disclaimer
* Crypto investment involves significant risks. Please proceed with caution. The course is not intended as investment advice.
* The course is created by the author who has joined Gate Learn. Any opinion shared by the author does not represent Gate Learn.