What to detect in url's HTML as thumbnail?

I am creating a PHP thumbnail app for links. I get the HTML content of the url, I repair it and I traverse through it to find anything that would suit as a thumbnail for that URL.

First option is of course cheching for the OG(OpenGraph) - let's put OG aside, because searching for og:image in <meta> parameters name or property is an easy one and the process doesnt relate with this question.

However what if there is no OG source - I guess I would check content of all classes and id's for what?

What strings whould I search for? (logo, thumb, ... ?) and in what priority order?

Or is there any NON-external PHP API providing this functionality ?

EDIT

Important: the question has been misunderstood - the question is NOT about how to traverse through DOM tree or how to find the <img> The question is about what to consider when searching for it - what class names/ids and so on... and in what priority order.

Answers


Not sure exactly how facebook do it, maybe try looking at the facebook docs or googling, but here's something you could do to get you started...

First of all, have a fallback check for the old style:

<link rel="image_src" href="/myimage.jpg"/>

If that fails, then you need to select an appropriate image. You could get really fancy and do google-esc scraping which trys to put things into context such as looking for images inside the main content frame only (dictated by checking other website urls and identifying the common layout template). But to start with you could try,

  1. Get all image tags and parse out the src attribute
  2. Purge any sources which aren't unique (might indicate icons like social icons)
  3. Fetch all images to a temp directory
  4. Purge any images whose size is not indicative of a featured image (i.e anything smaller than 300px maybe? You'd have to play with it i guess).
  5. Purge any images whose aspect size is wildly outside that of an expected featured image

Optionally before step 3, you could try removing any images which are within close proximity of another image in the source code, which could identify things like image navigation menus.

Anything more than that would probably require a contextual understanding of the webpage being scraped (which is probably what facebook do). An image followed by several paragraphs for example could indicate a featured article image.

On top of all of that, if you made it a factory class where you can plugin additional parsers for specific sites. You could try to build and plugin more specific parsers for common website layouts, such as wordpress and other CMS's, where 90% of the time, you could probably reasonably expect to be able to identify the main content area of the website at very least to narrow your search (if not the exact image of an article if the template isn't too customised)


You can use simple_html_dom. You can do your work like below by searching different type of tags(img, og tags, etc...);

<?php
include_once('simple_html_dom.php');
$url =''; // To be crawled
$images = array();
$html = file_get_html($url);
foreach ($html->find('img') as $img){ // img is an option. 
    if (!empty($img->getAttribute('src')))
    array_push($images, $img->getAttribute('src'));
}

EDIT: I have gave how to implement to crawl html page and find img like tags. However, main problem here is how to find images. I have given an option img only. And I said that you can use another tags also


Need Your Help

How do you make a silenced exception for async/await?

c# exception-handling async-await

I've heard that the exceptions in fire &amp; forget async calls are swallowed. However, that is not the case I experience with the example below:

Wxpython drag and drop folder path, Popen not working with spaces on Windows

drag-and-drop wxpython popen wx.textctrl

I have the below code which allows the user to drag and drop a folder to get the folder path. I then take this folder path and use it to pass through to a command line application in Windows using ...