Image Detection and Extraction

Images are a vital element of any publication. They add meaning and enrich the content they represent, but just as important, they also increase engagement.

Properly detecting images when crawling articles is important to us because they're important to our partners and the content they produce.

Any image Marfeel detects is delivered from the publisher's servers. Marfeel doesn't host any images on our servers or CDNs and don't enhance images to optimize their file size.

In addition, if a publisher uses source-set to specify multiple image sources for different media queries, Marfeel crawlers will honour it. This is how publishers should handle images and is our preference as it helps reduce the bytes.

According to our extraction policy, they must be coded in HTML so that our crawler can identify them and extract them. We do, however, offer a couple solutions to our clients regarding coding images in HTML.

HTML Image Solutions

If a client is not using <img> tags we require some helper element so our crawler can get the image URL. To better detect images when they are unidentifiable to Marfeel, there are the following 2 options:

1. Add images to an article

In this solution, the client codes images in HTML that are loaded with JavaScript and then hides them with CSS.

To do this a <DIV data-src="http://example.jpg"> is created in the client's HTML code that identifies the media to our crawler and this <div> tag is then hidden with CSS.

2. Produce specific HTML for Marfeel's crawler

A filter can be implemented so that HTML images are only loaded when there's a query from our crawler.

In this case an <img> tag is created for all a client's images to only be used by our crawler, identified as User-agent "Marfeel-crawler"

The benefit of this option is that this won't affect anything a normal user sees.