Yahoo Comment Extraction Strategy
Use a two-layer strategy for resilience.
Primary Strategy: Inline State JSON
Parse window.__PRELOADED_STATE__ and extract:
- parent+reply total counts
- parent-only counts
- user comments with IDs, reaction counts, reply totals
- pagination URL fields
Reply Strategy
Follow each comment permalink (/profile/news/comments/...) and parse nested replies from state payload.
Fallback Strategy: DOM Anchors
Use stable selectors when state parsing fails:
a[href*="/profile/news/comments/"]for user comment permalinkstime a[href*="/profile/commentator/"][href*="/comments/"]for commentator entries- button labels for reaction/reply counts
Data Model
Persist normalized fields:
comment_id,author,content,likes,replies_countreactions_empathized,reactions_understood,reactions_questioningrepliesJSON
Also persist article-level metrics (total_comments_count, total_parent_comments_count, scraped_comments_count).