Article Details Metadata
Marfeel uses Boilerpipe to detect and extract a publisher's metadata in the article details. Essentially, a customer's page is scraped tag by tag to identify, extract, and store the metadata in the article details.
In Boilerpipe, metadata is a detector. Every time there's a script, Marfeel scans the script to identify and parse relevant elements according to the heuristics in place.
When a metadataProvider is detected, Marfeel extracts the information and stores it at the beginning of an article.
Example
The following is an example of extracting custom dimensions for Google Analytics:
public class DefaultCustomDimensionDetector extends AbstractCustomDimensionMetadataDetector {
private final static Pattern SCRIPT_INFO_ELEMENT = Pattern.compile("[ga|_gaTracker]\\(['\"]set['\"],\\s?['\"](dimension\\d+)['\"],\\s?['\"](.*?)['\"]", Pattern.CASE_INSENSITIVE);
@Override
protected CustomDimension getCustomDimensions(String content) {
CustomDimension customDimension = new CustomDimension();
Matcher matcher = SCRIPT_INFO_ELEMENT.matcher(content);
while(matcher.find() && matcher.groupCount() > 0) {
customDimension.add(matcher.group(1), matcher.group(2));
}
return customDimension;
}
@Override
public String getName() {
return "defaultGACustomDimension";
}
}
CODE