Boilerpipe

To effectively identify the relevant content that needs to be extracted from a publisher's site, Marfeel applies a bot to accurately determine the content to be Marfeelized.  

Boilerpipe is the highly sophisticated collection of algorithms that essentially break down a publisher's article on their desktop website and identify the content to be replicated in the customer's Marfeel Progressive WebApp (PWA).

Each page of a customer's desktop site contains HTML with different UI elements and behaviors. Boilerpipe parses all the components of this HTML and determines which ones Marfeel requires. It then places these components inMarfeel'sservers to be modified and mirrored in that customer's Marfeel PWA with the optimized UX that drives engagement and maximizes ad revenue.

The components that the Marfeel Boilerpipe is designed to identify and store are integral elements for the article details such as text and the relevant media Marfeel requires. 

UI elements such as desktop advertising, sharing bars, or any latent content are elements that Boilerpipe determines extraneous because these are the components and behaviors that the Marfeel solution optimizes - which have been analyzed and substantiated through data analysis - to maximize engagement and in turn, the revenue generated per visit. 

The process Boilerpipe follows can be broken down into the following steps:

1 - MarfeelExtractor

The MarfeelExtractor in the Boilerpipe scans the HTML for text. It disregards any code regarding images, tags, etc. 

The first operation it performs is to find the title of the page or article by searching for a string of text that is similar to the page URL. 

To determine the remainder of the text to extract, Boilerpipe is governed by several heuristics. For example, to identify the first paragraph of the article, it searches and identifies the largest block of text close to the title. 

2 - HTMLDocumentProcessor

The next fundamental step that Boilerpipe performs is to execute the following detectors that identify specific elements in the HTML code to be extracted for the customer's Marfeel PWA:

ImageDocumentSAXProcessor

Identifies all the media in the page including its correct order. Fundamentally, the ImageDocumentSAXProcessor detects all the media in a publisher's article supported by the Marfeel platform. 

HTMLDocumentSAXProcessor

Marfeel uses different HTML code for a publisher'sMarfeelizedmobile site which is entirely different to the customer's original code in order to create a more responsive mobile site that promotes speed and smoothness and engages the reader. 

The HTMLDocumentSAXProcessor is responsible for replacing the code of all the elements identified for extraction by the Boilerpipe.

For every new piece of code replacing the publisher's original desktop HTML, Marfeel inserts two types of code - a detector and a replacer.

DocumentModifiers

DocumentModifiers are responsible for establishing the revolutionary and innovative features that boost engagement in the Marfeel solution. 

For example, images within the same container are detected, extracted, and modified into collapsed galleries in the publisher's Marfeel PWA to optimize the UI.

There are various types of DocumentModifiers with different features. Each is defined in the customer's definition.json.