Extracting Article Content from Yahoo News
Yahoo News article content is extracted using multiple selectors with fallback logic.
Content Selectors
Try selectors in order of preference:
const contentSelectors = [
'.article_body', // Primary selector
'.sc-fLlhyt', // Alternative class
'article .highLightSearchTarget', // Highlighted content
'[class*="article"] [class*="body"]', // Dynamic class
'article p', // Fallback to paragraphs
];
Extraction Logic
const articleData = await page.evaluate(() => {
let content = '';
// Try each selector
for (const selector of contentSelectors) {
const element = document.querySelector(selector);
if (element && element.textContent && element.textContent.length > 100) {
content = element.textContent.trim();
break;
}
}
// Fallback: concatenate all paragraphs
if (!content) {
const paragraphs = Array.from(document.querySelectorAll('article p, .article p'));
content = paragraphs.map(p => p.textContent?.trim()).filter(Boolean).join('\n\n');
}
return { content };
});
Extract Images
const images: string[] = [];
const imgElements = document.querySelectorAll('article img, .article img');
imgElements.forEach(img => {
const src = img.src;
if (src && !src.includes('logo') && !src.includes('icon')) {
images.push(src);
}
});
Extract Metadata
// Category
const categoryElement = document.querySelector('.category, [class*="category"]');
const category = categoryElement?.textContent?.trim() || '';
// Publish date
const dateElement = document.querySelector('time, .date, [class*="date"]');
const publishedDate = dateElement?.getAttribute('datetime') ||
dateElement?.textContent?.trim() || '';
Validation
if (!articleData.content) {
throw new Error('Could not extract article content');
}
Always validate that content was extracted successfully.