Extraction Flags

The following are all the flags that can be defined in a partner's definition.json regarding extraction.

allowJavascriptLoad 

Description: For section mosaic content loaded with JavaScript, include the flags one by one, checking each individually. If unsuccessful, try with the next one until you use the three simultaneously.

Location: Configuration

Values: True/False

Example:

"allowJavascriptLoad":"true",
"alibabaWaitPageOpen":"true",
"allowExternalJavascriptLoad":"true",

appendRelevantTagsInfo

Description: The flag stores the relevantTags like if they where a meta extracted with a metadataProvider by putting relevant tags in this customDimensionsMeta. When enabled, the HTMLDocumentProcessor.java sanitizes HTML.

Location: Configuration

Values: True/False

Example:

"appendRelevantTagsInfo" : BOOLEAN

articlePathLastParts

Description: If an article path is short, this flag is used to define the minimum words of the last part of the article and check whether it's an article or not.

Location: Configuration

Values: String (setting a numerical value)

Example:

"articlePathLastParts": "1"

articlePathParts

Description: Defines the patterns to use to check whether a page is an article or not.

Location: Configuration

Values: Pattern

Example:

"articlePathParts":"2"

blacklistedUrlPatterns

Description: Defines blacklisting content based on URL patterns.

Location: Configuration

Values: String

Example:

"blacklistedUrlPatterns": "/*/loteria-grossa-cap-any.shtml**"

boilerEnableSecureConnections

Description: Enables the secure connections on the Boilerpipe for articles. If the customer is strictly on HTTPS protocol, then this flag must be used.

Location: Configuration

Values: True/False

Example:

 "boilerEnableSecureConnections": "true"


boilerpipeFetcher

Description: Adds a custom RSS fetcher for the boilerpipe.

Location: Configuration

Values: String

Example:

"boilerpipeFetcher": "tenantRssFetcher"

boilerpipeIgnoreInlineImageDimensions

Description: When enabled, returns IGNORE_INLINE_DIMENSIONS_MEDIA_FILTER in the media filter within the image document SAX processor.

Location: Configuration

Values: True/False

Example:

"boilerpipeIgnoreInlineImageDimensions":"true"

cleanerFetcherBlacklist

Description: Defines the blacklist for the cleaner fetcher if it's selected as the boilerpipe fetcher.

Location: Configuration

Values: String

Example:

"cleanerFetcherBlacklist":".aside-inner, .block.comments"

copyFirstRowOnTableSplit

Description: When enabled, considers the first row of the table - and not the header - as table content.

Location: Configuration

Values: True/False

Example:

"copyFirstRowOnTableSplit": true

cronRefresh

Description: Defines the frequency of section reloads according to cronmaker.

Location: Configuration

Values: Numerical

Example:

"cronRefresh":"0 0/3 * 1/1 * ? *"

deactivated

Description: If the following flag is set to true, it will deactivate the Marfeel version and enable the classic version. It is mostly used by Marfeel's Monetization for internal investigations.

Values: True/False

Example:

"deactivated":true

defaultTopMediaMediaSelectorStrategy

Description: Selects the Top Media based on a specified option. The available options are included in MediaSelector.java in Gutenberg.

Location: Features

Values: String. The possible values for this feature are the following:

  • DETAIL_OR_HINT - This is the default value. With this strategy Marfeel first tries to extract the image from the customer's article details, before moving on to the section mosaic.
  • FORCE_DETAIL - The image is extracted from the article details.
  • FORCE_HINT - The image is extracted from the section mosaic.
  • META_OR_DETAILS - The meta image is extracted. If not there, the image is extracted from the article details.
  • HINT_OR_META - The image is extracted from the article mosaic. If not there, the meta image is extracted.
  • DETAIL_OR_HINT_OR_META - The image is extracted first from the article details, then the mosaic, and then the meta.

Example:

"defaultTopMediaMediaSelectorStrategy":"DETAIL_OR_HINT"

defaultMediaSelectorStrategy

Description: Defines how the image used in the section mosaic is selected.

Location: Features

Values: String. 

detailItemsProcessor

Description: Used when a webpage is slow or there is a lot of content to extract. Throttling the bandwidth makes the process more persistent.

Location: Configuration

Values: String (throttledDetailedItemsProcessor)

Example:

"detailItemsProcessor":"throttledDetailItemsProcessor"

disableAMPCacheForImages

Description: If set to true, the src of the image will be AMP_CACHE_URL_imageURL where the AMP_CACHE_URL is https://cdn.amproject.org/i/. 

Location: Features

Values: True/False

Example:

 "disableAMPCacheForImages": true

disableMultipageTitleSelectorForFirstPage

Description: To disable the selector title for the first page in multipage articles, the following flag needs to be added in the configuration module of the tenant's definition.json.

Location: Configuration

Values: True/False

Example:

"disableMultipageTitleSelectorForFirstPage": "true"

disablePhantomDiskCache

Description: When enabled, disables cache when using phantomjs.

Location: Configuration

Values: True/False

Example:

"disablePhantomDiskCache": "true"

disableProxyScripts

Description: When set to true, scripts do not go through the cache.

Location: Features

Values: True/False

Example:

 "disableProxyScripts": true

disableSectionValidation

Description: When enabled, this flag does not invalidate a given section.

Location: Configuration

Values: True/False

Example:

"disablePhantomDiskCache": "true"

 

enableUnsecureMedia

Description: By default, Marfeel ensures that all media is loaded in HTTPS with src.replace(/^http:/, 'https:'). If this flag is set to true, Marfeel leaves HTTP instead.

Location: Features 

Values: True/False 

Example:

"enableUnsecureMedia": true

 

extractImagesFromNoScript

Description: Enables the extraction of images located inside noscript tags.

Location: Configuration

Values: True/False

Example:

 "extractImagesFromNoScript": "true"

 

extractionQueryParams

Description: This flag adds parameters to the extraction query.

Location: Configuration

Values: String 

Example:

"extractionQueryParams": "mrfCacheBuster={timestamp}"

fbInstantUseTagAsKicker

Description: Defines the given tag as kicker in the header.jsp for Facebook Instant Articles, like "article:section"

Location: Features 

Values: String

Example: 

 "fbInstantUseTagAsKicker": "article:section"

fbInstantUseTagAsSubtitle

Description: For Facebook Instant Articles, sets the subtitle as the one take from the specified meta.

Location: Features

Values: String

Example:

"fbInstantUseTagAsSubtitle": "og:title"

feedRipper

Description: Replaces the whiteCollar source with an RSS feed.

Uris are added as usual in the definition.json. The uri must be in xml format. By default avoid using RSSFeeds on GoLives.

Location: Configuration

Values: String

Example: 

"feedRipper":"rssRipper"

galleryBlackList 

Description: Prevent an image from being processed as an image, and treated as part of the article's textual content. This is especially useful for images under tags such as buttons. 

 Location: Configuration

Values: String (Class) 

 Example:

"galleryBlackList":".author img,[src*='gravatar']"

greedyWhitelist

Description: Whitelists all the children of the elements being whitelisted as well.

Location: Configuration

Values: True/False

Example:

"greedyWhitelist":"true"

ignoreSSLErrors

Description: When enabled, this flag ignores SSL errors on the PhantomJS command.

Location: Configuration

Values: True/False

Example:

"ignoreSSLErrors":true

ignoreWidgetItemsTablet

Description: In L, this version removes WidgetItems(items)

Location: Features

Values: True/False

Example:

 "ignoreWidgetItemsTablet": true

 

imageResizer 

Description: This flag removes the mrf-detailsMedia and mrf-rDetailsMedia classes from an image and adds mrf-noResizeImage.

Location: Configuration 

Values: String (query selector)

Example:

"imageResizer":".journalist-photo"

imageRulerSizeAttribute 

Description: To ensure that Marfeel extracts the correct image sizes from a publisher's desktop site and precisely present crisp images in their Marfeel version, a flag can be added to the tenant's definition.json under configurations. This feature specifically defines the attributes to inspect within the <img> tag to extract the image's width and height. By default, Marfeel inspects for the image's data-width and data-height. This feature is to be used in case the tenant is using custom size attributes for images.

Location: Configuration 

Values: String 

Example:

"imageRulerSizeAttribute": "data-width,data-height"


imageSrcAttribute 

Description: Sometimes Marfeel needs to get image links from other attributes other than src because there are no links at the extraction time (for example, on some sliders using lazy loading).

Location: Configuration 

Values: String (query selector)

Example:

"imageSrcAttribute": "href"

includeParentHrefOnDetailsGallery

Description: If set to true, this flag includes the "data-parent-link" attribute to images and is used in the GalleriesDetector.java.

Location: Configuration

Values: True/False

Example:

"includeParentHrefOnDetailsGallery":"true"

inlineRelatedArticlesStrategy

Description: Defines the section where the inline related articles are being selected.

Location: Features

Values: String. The possible values could be any name of a section for that tenant. The default value is the current section the article is located.

Example: If all the inline related articles should come from the politics sections, it will resemble the following:

"inlineRelatedArticlesSections" : "politics"

itemExtractorType

Description: Chooses between premium (paid content) and boilerpipe extractors.

Location: Configuration

Values: String 

Example:

"itemExtractorType":"cincodiasGalleryExtractor"

jsoupImageSrcAttribute

Description: Same as the imageSrcAttribute except it is used when extracting with jsoup instead of the whitecollar.

Location: Configuration

Values: String 

Example:

"jsoupImageSrcAttribute": "src"

maxConcurrentExtractionRequests

Description: Defines the maximum amount of concurrent extraction of the article details to throttle the extraction .

Location: Configuration 

Values: Numerical 

Example:

"maxConcurrentExtractionRequests":1

maximumNagiosAlert

Description: Defines the maximum Nagios alert (either Warning or Critical).  

Location: Configuration

Values: String (WARNING or CRITICAL)

Example:

 "maximumNagiosAlert":"WARNING"

metaDataDetector

Description: Defines a string of metadata providers separated by commas. In previous versions, the custom metadata extractors were implemented in Gutenberg and included using this tag, but currently Marfeel implements them in the tenant folder using Nashorn, to avoid changing Gutenberg.

Location: Configuration

Values: String

Example:

 "metaDataDetector":"defaultGACustomVariables"

minImageSize

Description: Defines the minimum height and width used to filter images to keep in the boilerpipe (MinSizeFilter.java).

Location: Configuration

Values: Numerical 

Example:

"minImageSize":"75"

 

minWordsToConsiderFar

Description: The minimum amount of words defined to include an image in the article body used as top media, to be duplicated displayed within the body of the text as well.

Location: Features

Values: Numerical

Example:

"minWordsToConsiderFar": "300"

 

multipageBackwardsGenerator

Description: Creates multipage articles, starting from the last page to the first one.

Location: Configuration

Values: String (query selector for the multipages)

Example:

"multipageBackwardsGenerator": ".carousel.slide .item"

 

multipageBackwardsUriGenerator

Description: Defines the URI generator that matches and matches one of the implemented generators.

Location: Configuration 

Values: String

Example:

"multipageBackwardsUriGenerator": "bolavipSlidesUriGenerator"

 

multipageGenerator

Description: Defines the query selector multipage generator for the tenant. 

Location: Configuration

Values: String (query selector)

Example:

"multipageGenerator":".md-item-media,.swiper-slide"

 

multipageTitleSelector

Description: Defines the query selector for the multipage title.

Location: Configuration

Values: String (query selector)

Example:

"multipageTitleSelector": ".titleRanking"

multipageUriGenerator 

Description: By default, Marfeel uses IDUriGenerator as the multipage generator. This flag defines a different URI generator according to the string entered. The selection of the generator is completed in Gutenberg, in the UriGeneratorFactory.java class.

Location: Configuration

Values: String

Example:

"multipageUriGenerator":"pageIndexUriGenerator"

nextArticlesInverseOrder

Description: Changes the default order of native ads and next articles (the default behavior is to display native ads and then next articles). When enabled, the flag displays next articles first and then native ads.

Location: Features

Values: True/False

Example: If the nextArticles were to only use specific widget items, it would resemble the following:

"nextArticlesInverseOrder": true

nextArticlesStrategy

Description: Defines how the next articles are selected and filtered. 

Location: Configuration

Values: String:

  • NO_FILTER
  • NO_WIDGET
  • HAS_DETAILS
  • VALID_ITEM
  • HAS_VALID_ITEMS
  • WIDGET_ITEM

Example: If the nextArticles were to only use specific widget items, it would resemble the following:

"nextArticlesStrategy" : "WIDGET_ITEM,envivoIframe"

nextPageLimit

Description: Defines the maximum number of next pages to be extracted. 

Location: Configuration

Values: Numerical

Example:

"nextPageLimit":"100"

 

nextPageUriBlacklist

Description: Defines a blacklist for URIs.

Location: Configuration

Values: String

Example: 

"nextPageUriBlacklist":"/ad"

notSelectableImages

Description: Defines the images not to be used as Top Media (for example images used in a photoSlider or avatars for authors).

Location: Configuration

Values: String (Class) 

Example:

"notSelectableImages":".rslides img"


quartzInvalidation

Description: When set false, disables the invalidation scheduler (scheduleSectionInvalidationTasks).

Location: Configuration

Values: True/False

Example:

 "quartzInvalidation":"true"

requiresCompass

Description: To compile a tenant's styles using Compass instead of SASS when an image tag cannot be processed. 

Location: Configuration

Values: True/False

Example: 

"requiresCompass":true

respectTopMediaRatio

Description: Forces Top Media to have the same ratio as the original image.

Location: Features

Values: True/False

Example:

"respectTopMediaRatio": true

sanitizeContent

Description: When enabled, the HTMLDocumentProcessor.java sanitizes HTML.

Location: Configuration

Values: True/False

Example:

"sanitizeContent": "true"

useLegacyAlibaba

Description: When enabled, the old Alibaba version is used.

Location: Configuration

Values: True/False

Example:

 "useLegacyAlibaba": true

useSniVerifier

Description: This flag enables server name indication verifications (that is, it uses the the HTMLfetcher SNI verification).

Location: Configuration

Values: True/False

Example: 

 "useSniVerifier": true

 

userAgent 

Description: Used to browse the site's HTML as rendered in specified device.

Location: Configuration

Values: String 

Posible Options for whiteCollarUserAgent: "mobile" or "marfeel" (if this flag isn't added, the crawler will use the default chrome one: "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36"

)

Example: 

 "whiteCollarUserAgent":"mobile",
 "boilerpipeUserAgent":"Mozilla/5.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/600.1.3 (KHTML, like Gecko) Version/8.0 Mobile/12A4345d Safari/600.1.4"

userInterface

Description: Sets the configuration options from both XP and Gutenberg. 

Location: Configuration

Values: String

Example: 

"userInterface":{
    "lang":"en",
    "themeName":"blogAds",
    "resourcesHost":"http://bc.marfeel.com",
    "googleAnalytics":"UA-12345678-1",
    "adex":"true",
    "features":{
        // check "features" entry
    }
}

resourcesHost sets the domain. Be sure to following the below guidelines:

widgets

Description: Defines the widgets to be used.

Location: Features

Values: String

Example: 

"widgets":"mostRead"

validArticleQueryParams

Description: Some Marfeel customers have articles that are built with query parameters. In order to replicate these articles on the customer's Marfeel PWA, Marfeel requires a flag in definition.json and the definitions to identify a valid article.

Location: Configuration

Values: String

Example: 

 "validArticleQueryParams":"&aid=,&MAID=,&MFID="