Extraction Flags
The following are all the flags that can be defined in a partner's definition.json regarding extraction.
showCategoriesInDetails
Description: tenant will show categories inside article details pages if true
Location: Configuration
Values: true, false
Example:
"showCategoriesInDetails":"true"
detailsLinksTarget
Description: All article details links are set as the flag defines
Location: Configuration
Values: _self, _blank,...
Example:
"detailsLinksTarget":"_self"
queryParamsWhitelist
Description: Articles that uses query params and needs to be extracted by boiler. For example: https://www.planet.fr/magazine-hilarant-ils-racontent-leurs-pires-anecdotes-dans-un-aeroport.1675963.6553.html?page=0%2C2 ← the query param is page
Location: Configuration
Values: the query param itself
Example:
"queryParamsWhitelist": "page"
allowJavascriptLoad
Description: For section mosaic content loaded with JavaScript, include the flags one by one, checking each individually. If unsuccessful, try with the next one until you use the three simultaneously.
Location: Configuration
Values: True/False
Example:
"allowJavascriptLoad":"true",
"alibabaWaitPageOpen":"true",
"allowExternalJavascriptLoad":"true",
appendRelevantTagsInfo
Description: The flag stores the relevantTags like if they where a meta extracted with a metadataProvider by putting relevant tags in this customDimensionsMeta. When enabled, the HTMLDocumentProcessor.java sanitizes HTML.
Location: Configuration
Values: True/False
Example:
"appendRelevantTagsInfo" : BOOLEAN
articlePathLastParts
Description: If an article path is short, this flag is used to define the minimum words of the last part of the article and check whether it's an article or not.
Location: Configuration
Values: String (setting a numerical value)
Example:
"articlePathLastParts": "1"
articlePathParts
Description: Defines the patterns to use to check whether a page is an article or not.
Location: Configuration
Values: Pattern
Example:
"articlePathParts":"2"
blacklistedUrlPatterns
Description: Defines blacklisting content based on URL patterns. IMPORTANT: Just considers the path, not the domain or the protocol. Important to take in consideration this string is a PATTERN.
Location: Configuration
Values: String
Example:
"blacklistedUrlPatterns": "/*/loteria-grossa-cap-any.shtml**"
boilerEnableSecureConnections
Description: Enables the secure connections on the Boilerpipe for articles. If the customer is strictly on HTTPS protocol, then this flag must be used.
Location: Configuration
Values: True/False
Example:
"boilerEnableSecureConnections": "true"
boilerpipeFetcher
Description: Adds a custom RSS fetcher for the boilerpipe.
Location: Configuration
Values: String
Example:
"boilerpipeFetcher": "tenantRssFetcher"
boilerpipeIgnoreInlineImageDimensions
Description: When enabled, returns IGNORE_INLINE_DIMENSIONS_MEDIA_FILTER in the media filter within the image document SAX processor.
Location: Configuration
Values: True/False
Example:
"boilerpipeIgnoreInlineImageDimensions":"true"
cleanerFetcherBlacklist
Description: Defines the blacklist for the cleaner fetcher if it's selected as the boilerpipe fetcher.
Location: Configuration
Values: String
Example:
"cleanerFetcherBlacklist":".aside-inner, .block.comments"
copyFirstRowOnTableSplit
Description: When enabled, considers the first row of the table - and not the header - as table content.
Location: Configuration
Values: True/False
Example:
"copyFirstRowOnTableSplit": true
cronRefresh
Description: Defines the frequency of section reloads according to cronmaker.
Location: Configuration
Values: Numerical
Example:
"cronRefresh":"0 0/3 * 1/1 * ? *"
deactivated
Description: If the following flag is set to true, it will deactivate the Marfeel version and enable the classic version. It is mostly used by Marfeel's Monetization for internal investigations.
Values: True/False
Example:
"deactivated":true
defaultTopMediaMediaSelectorStrategy
Description: Selects the Top Media based on a specified option. The available options are included in MediaSelector.java in Gutenberg.
Location: Features
Values: String. The possible values for this feature are the following:
- DETAIL_OR_HINT - This is the default value. With this strategy Marfeel first tries to extract the image from the customer's article details, before moving on to the section mosaic.
- FORCE_DETAIL - The image is extracted from the article details.
- FORCE_HINT - The image is extracted from the section mosaic.
- META_OR_DETAILS - The meta image is extracted. If not there, the image is extracted from the article details.
- HINT_OR_META - The image is extracted from the article mosaic. If not there, the meta image is extracted.
- DETAIL_OR_HINT_OR_META - The image is extracted first from the article details, then the mosaic, and then the meta.
Example:
"defaultTopMediaMediaSelectorStrategy":"DETAIL_OR_HINT"
defaultMediaSelectorStrategy
Description: Defines how the image used in the section mosaic is selected.
Location: Features
Values: String.
detailItemsProcessor
Description: Used when a webpage is slow or there is a lot of content to extract. Throttling the bandwidth makes the process more persistent.
Location: Configuration
Values: String (throttledDetailedItemsProcessor)
Example:
"detailItemsProcessor":"throttledDetailItemsProcessor"
disableAMPCacheForImages
Description: If set to true, the src of the image will be AMP_CACHE_URL_imageURL where the AMP_CACHE_URL is https://cdn.amproject.org/i/.
Location: Features
Values: True/False
Example:
"disableAMPCacheForImages": true
disableMultipageTitleSelectorForFirstPage
Description: To disable the selector title for the first page in multipage articles, the following flag needs to be added in the configuration module of the tenant's definition.json.
Location: Configuration
Values: True/False
Example:
"disableMultipageTitleSelectorForFirstPage": "true"
disablePhantomDiskCache
Description: When enabled, disables cache when using phantomjs.
Location: Configuration
Values: True/False
Example:
"disablePhantomDiskCache": "true"
disableProxyScripts
Description: When set to true, scripts do not go through the cache.
Location: Features
Values: True/False
Example:
"disableProxyScripts": true
disableSectionValidation
Description: When enabled, this flag does not invalidate a given section.
Location: Configuration
Values: True/False
Example:
"disableSectionValidation": "true"
dynamicItemContentConfiguration
Description: When enabled, this flag extracts the specified content block from the DOM of the client. Later it can be consumed from any JSP that you specify.
Location: Configuration
Values: selector > name of the widget → can use multiple values separated by ;
Example:
"dynamicItemContentConfiguration": ".generic-widget > .discounts,dynamicContentWidget;.news-related,newsRelatedWidget"
enableUnsecureMedia
Description: By default, Marfeel ensures that all media is loaded in HTTPS with src.replace(/^http:/, 'https:'). If this flag is set to true, Marfeel leaves HTTP instead.
Location: Features
Values: True/False
Example:
"enableUnsecureMedia": true
extractImagesFromNoScript
Description: Enables the extraction of images located inside noscript tags.
Location: Configuration
Values: True/False
Example:
"extractImagesFromNoScript": "true"
extractionQueryParams
Description: This flag adds parameters to the extraction query.
Location: Configuration
Values: String
Example:
"extractionQueryParams": "mrfCacheBuster={timestamp}"
mrfCacheBuster
Description: This flag adds mrfCacheBuster=${actualTimestamp} query parameter to the extraction query.
Location: Configuration
Values: boolean
Example:
"mrfCacheBuster": true
Note: The mrfCacheBuster is a simplified way of using extractionQueryParams and it is recommended that you use the mrfCacheBuster instead of extractionQueryParams when you're including the timestamp param only to avoid cache issues.
fbInstantUseTagAsKicker
Description: Defines the given tag as kicker in the header.jsp for Facebook Instant Articles, like "article:section"
Location: Features
Values: String
Example:
"fbInstantUseTagAsKicker": "article:section"
fbInstantUseTagAsSubtitle
Description: For Facebook Instant Articles, sets the subtitle as the one take from the specified meta.
Location: Features
Values: String
Example:
"fbInstantUseTagAsSubtitle": "og:title"
feedRipper
Description: Replaces the whiteCollar source with an RSS feed.
Uris are added as usual in the definition.json. The uri must be in xml format. By default avoid using RSSFeeds on GoLives.
Location: Configuration
Values: String
Example:
"feedRipper":"rssRipper"
galleryBlackList
Description: Prevent an image from being processed as an image, and treated as part of the article's textual content. This is especially useful for images under tags such as buttons.
Location: Configuration
Values: String (Class)
Example:
"galleryBlackList":".author img,[src*='gravatar']"
greedyWhitelist
Description: Whitelists all the children of the elements being whitelisted as well.
Location: Configuration
Values: True/False
Example:
"greedyWhitelist":"true"
ignoreSSLErrors
Description: When enabled, this flag ignores SSL errors on the PhantomJS command.
Location: Configuration
Values: True/False
Example:
"ignoreSSLErrors":true
ignoreWidgetItemsTablet
Description: In L, this version removes WidgetItems(items)
Location: Features
Values: True/False
Example:
"ignoreWidgetItemsTablet": true
imageResizer
Description: This flag removes the mrf-detailsMedia and mrf-rDetailsMedia classes from an image and adds mrf-noResizeImage.
Location: Configuration
Values: String (query selector)
Example:
"imageResizer":".journalist-photo"
imageRulerSizeAttribute
Description: To ensure that Marfeel extracts the correct image sizes from a publisher's desktop site and precisely present crisp images in their Marfeel version, a flag can be added to the tenant's definition.json under configurations. This feature specifically defines the attributes to inspect within the <img> tag to extract the image's width and height. By default, Marfeel inspects for the image's data-width and data-height. This feature is to be used in case the tenant is using custom size attributes for images.
Location: Configuration
Values: String
Example:
"imageRulerSizeAttribute": "data-width,data-height"
imageSrcAttribute
Description: Sometimes Marfeel needs to get image links from other attributes other than src because there are no links at the extraction time (for example, on some sliders using lazy loading).
Location: Configuration
Values: String (query selector)
Example:
"imageSrcAttribute": "href"
imageSrcSetAttribute
Description: Client has invalid srcset links but has valid links inside other data-srcset, rather than srcset.
Location: Configuration
Values: String (query selector)
Example:
"imageSrcSetAttribute": "data-lazy-srcset"
includeParentHrefOnDetailsGallery
Description: If set to true, this flag includes the "data-parent-link" attribute to images and is used in the GalleriesDetector.java.
Location: Configuration
Values: True/False
Example:
"includeParentHrefOnDetailsGallery":"true"
inlineRelatedArticlesStrategy
Description: Defines the section where the inline related articles are being selected.
Location: Features
Values: String. The possible values could be any name of a section for that tenant. The default value is the current section the article is located.
Example: If all the inline related articles should come from the politics sections, it will resemble the following:
"inlineRelatedArticlesSections" : "politics"
itemExtractorType
Description: Chooses between premium (paid content) and boilerpipe extractors.
Location: Configuration
Values: String
Example:
"itemExtractorType":"cincodiasGalleryExtractor"
jsoupImageSrcAttribute
Description: Same as the imageSrcAttribute
except it is used when extracting with jsoup instead of the whitecollar.
Location: Configuration
Values: String
Example:
"jsoupImageSrcAttribute": "src"
maxConcurrentExtractionRequests
Description: Defines the maximum amount of concurrent extraction of the article details to throttle the extraction .
Location: Configuration
Values: Numerical
Example:
"maxConcurrentExtractionRequests":1
maximumNagiosAlert
Description: Defines the maximum Nagios alert (either Warning or Critical).
Location: Configuration
Values: String (WARNING or CRITICAL)
Example:
"maximumNagiosAlert":"WARNING"
metaDataDetector
Description: Defines a string of metadata providers separated by commas. In previous versions, the custom metadata extractors were implemented in Gutenberg and included using this tag, but currently Marfeel implements them in the tenant folder using Nashorn, to avoid changing Gutenberg.
Location: Configuration
Values: String
Example:
"metaDataDetector":"defaultGACustomVariables"
minImageSize
Description: Defines the minimum height and width used to filter images to keep in the boilerpipe (MinSizeFilter.java).
Location: Configuration
Values: Numerical
Example:
"minImageSize":"75"
minWordsToConsiderFar
Description: The minimum amount of words defined to include an image in the article body used as top media, to be duplicated displayed within the body of the text as well.
Location: Features
Values: Numerical
Example:
"minWordsToConsiderFar": "300"
multipageBackwardsGenerator
Description: Creates multipage articles, starting from the last page to the first one.
Location: Configuration
Values: String (query selector for the multipages)
Example:
"multipageBackwardsGenerator": ".carousel.slide .item"
multipageBackwardsUriGenerator
Description: Defines the URI generator that matches and matches one of the implemented generators.
Location: Configuration
Values: String
Example:
"multipageBackwardsUriGenerator": "bolavipSlidesUriGenerator"
multipageGenerator
Description: Defines the query selector multipage generator for the tenant.
Location: Configuration
Values: String (query selector)
Example:
"multipageGenerator":".md-item-media,.swiper-slide"
multipageTitleSelector
Description: Defines the query selector for the multipage title.
Location: Configuration
Values: String (query selector)
Example:
"multipageTitleSelector": ".titleRanking"
multipageUriGenerator
Description: By default, Marfeel uses IDUriGenerator as the multipage generator. This flag defines a different URI generator according to the string entered. The selection of the generator is completed in Gutenberg, in the UriGeneratorFactory.java class.
Location: Configuration
Values: String
Example:
"multipageUriGenerator":"pageIndexUriGenerator"
nextArticlesInverseOrder
Description: Changes the default order of native ads and next articles (the default behavior is to display native ads and then next articles). When enabled, the flag displays next articles first and then native ads.
Location: Features
Values: True/False
Example: If the nextArticles were to only use specific widget items, it would resemble the following:
"nextArticlesInverseOrder": true
nextArticlesStrategy
Description: Defines how the next articles are selected and filtered.
Location: Configuration
Values: String:
- NO_FILTER
- NO_WIDGET
- HAS_DETAILS
- VALID_ITEM
- HAS_VALID_ITEMS
- WIDGET_ITEM
Example: If the nextArticles were to only use specific widget items, it would resemble the following:
"nextArticlesStrategy" : "WIDGET_ITEM,envivoIframe"
nextPageLimit
Description: Defines the maximum number of next pages to be extracted.
Location: Configuration
Values: Numerical
Example:
"nextPageLimit":"100"
nextPageUriBlacklist
Description: Defines a blacklist for URIs.
Location: Configuration
Values: String
Example:
"nextPageUriBlacklist":"/ad"
notSelectableImages
Description: Defines the images not to be used as Top Media (for example images used in a photoSlider or avatars for authors).
Location: Configuration
Values: String (Class)
Example:
"notSelectableImages":".rslides img"
quartzInvalidation
Description: When set false, disables the invalidation scheduler (scheduleSectionInvalidationTasks).
Location: Configuration
Values: True/False
Example:
"quartzInvalidation":"true"
requiresCompass
Description: To compile a tenant's styles using Compass instead of SASS when an image tag cannot be processed.
Location: Configuration
Values: True/False
Example:
"requiresCompass":true
respectTopMediaRatio
Description: Forces Top Media to have the same ratio as the original image.
Location: Features
Values: True/False
Example:
"respectTopMediaRatio": true
sanitizeContent
Description: When enabled, the HTMLDocumentProcessor.java sanitizes HTML.
Location: Configuration
Values: True/False
Example:
"sanitizeContent": "true"
showClassicVersionInHeader
Description: When enabled, the classic version icon is shown on top header right, replacing the timestamp.
Location: features
Values: true/false (default false)
Example:
"showClassicVersionInHeader": "true"
useLegacyAlibaba
Description: When enabled, the old Alibaba version is used.
Location: Configuration
Values: True/False
Example:
"useLegacyAlibaba": true
useSniVerifier
Description: This flag enables server name indication verifications (that is, it uses the the HTMLfetcher SNI verification).
Location: Configuration
Values: True/False
Example:
"useSniVerifier": true
userAgent
Description: Used to browse the site's HTML as rendered in specified device.
Location: Configuration
Values: String
Posible Options for whiteCollarUserAgent: "mobile" or "marfeel" (if this flag isn't added, the crawler will use the default chrome one: "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36"
)
Example:
"whiteCollarUserAgent":"mobile",
"boilerpipeUserAgent":"Mozilla/5.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/600.1.3 (KHTML, like Gecko) Version/8.0 Mobile/12A4345d Safari/600.1.4"
userInterface
Description: Sets the configuration options from both XP and Gutenberg.
Location: Configuration
Values: String
Example:
"userInterface":{
"lang":"en",
"themeName":"blogAds",
"resourcesHost":"http://bc.marfeel.com",
"googleAnalytics":"UA-12345678-1",
"adex":"true",
"features":{
// check "features" entry
}
}
resourcesHost sets the domain. Be sure to following the below guidelines:
- http://bc.marfeel.com/ for XP
- http://b.marfeel.com/ for Alice.
widgets
Description: Defines the widgets to be used.
Location: Features
Values: String
Example:
"widgets":"mostRead"
validArticleQueryParams
Description: Some Marfeel customers have articles that are built with query parameters. In order to replicate these articles on the customer's Marfeel PWA, Marfeel requires a flag in definition.json and the definitions to identify a valid article.
Location: Configuration
Values: String
Example:
"validArticleQueryParams":"&aid=,&MAID=,&MFID="
imageCaptionFromAttributes
Description: Specifies the attribute name from the HTML element to be used for the image caption.
Location: Configuration
Values: String
Example:
"imageCaptionFromAttributes":"data-source-name"