How can I retrieve the main image of a blog post/news article?
I have a made a news aggregator Newzupp which I want to modify. Right now I am simply displaying the titles of the news stories and I am linking them to their urls.
I am planning to make it more graphical, by using images + titles instead of plain titles. I want to know how can I get the main image of each article (somewhat similar to google news).
One way that I can think of is I can strip all the images and display the image which points the the same article. But I do not think that will be efficient. Is there any other way of doing this?
I have found a solution to it.
- Fetch the contents of the url [html/xml]
- Scrape the content using hpricot
- Find all elements with tag "img"
- Do some research to find which of them is the main display image. [Like 6th image in case of Wired.com's rss feed]
I still think this is highly inefficient. I would like to know how services like Google News scrape the sites/blogs and display relevant images.
Perhaps you could filter/ sort by image size or position in the DOM hierarchy (i.e. nearest the top of the body/ immediately after an h1 tag).
What about a blacklist of advert hosts, from whom you would ignore images?
Since, generally speaking, adverts are hosted elsewhere while story-related images are hosted within the same domain, perhaps you could filter the page for those images that have same base url as the site itself.
Why not just convert all the scraped images(using hpricot/nokogiri) to square thumbnail images(using rmagick or the likes of it or just resizing them on the server side) and group those images in one DIV just below the topic body. You can then use a lightbox w/ slideshow to show the actual images only when the user clicks on them. That way it looks more graphical and still not spoil the look of your site. Finding the most relevant image is tricky.
You could also try to search for OpenGraph meta tags on the pages. Most news sites are using the og:image property to specify the main image of an article.
<meta property="og:image" content="http://ia.media-imdb.com/images/rock.jpg" />