Gutenberg

Environments SetUp

In Gutenberg project open 'Edit Configurations', VM options:

-Duser.timezone=UTC
-DJDBC_CONNECTION_STRING=127.0.0.1:27017
-DPARAM1=/Users/userName/MarfeelXP/Tenants/vhosts\&/Users/userName/AliceTenants/Tenants/vhosts\&/Users/UserName/DemoTenants/Tenants/vhosts
-DPARAM2=devOf
-Dspring.profiles.active=inMemoryQuartz,inMemoryJDBC,unsecure,testEmail,panoramix,alice
-Xms512m
-Xmx512m
-XX:PermSize=256m
-XX:MaxPermSize=512m

Remember to overwrite paths with your user name.

Freud

Analysis is carried out with the last definition.json. Launch it once whiteCollar and definition.json are filled.

http://localhost/hub/freud/TenantName/index.json?action=analyze

Launch link, log in, link will change to "http://localhost/hub/freud/TenantName/index/last.json". Check the report's: status: "IN_PROGRESS". Wait until it is completed and reload page to show status: "DONE". Report includes the following and in this order. Not shown lines means none found.

We can also analyze just one item by:http://localhost/hub/freud/TenantName/index/item.json?action=analyze&uri=ARTICLE_URI

  • Extraction Errors,
  • HTMLs Templates,
  • Articles missing Title,
  • Table Elementes detector,
  • Iframes,
  • Video Providers,
  • Article uri length: checks if the uri is too short,
  • ImageRatio detector (detects images with a uncommon ratio),
  • Short articles: check the number of shortArticles by the number of extractedArticles and if the % is above 35% then we add the check.

Iframe Element

IframeElement.java. Three types of Iframes: Videos, ExternalMedia and Generic. Place the iframe in the group that better describes the iframe.

  • Duplicate line with RegEx for the iframe's [src] in the appropriate section.
  • If it's of 'initGenericMediaProviders()' type: param1: Beggining of the URL of the iframe; param2: Name of the class that will be set to the Tenant's iframe.
  • Restart Tomcat.
  • Check the iframe is appearing now.

Run Tests

  • GenerateTestFixtures.java:

    • saveBoilerpipeFixture(URL,URL without http://)
    • Right Click on GenerateTestFixtures Tab -> Run
    • Do not commit this changed file to github, revert it.
  • ImageDocumentSAXProcessorTestDataSet.java:

  • ImageDocumentSAXProcessorParameterizedTest.java:

    • Right click on ImageDocumentSAXProcessorParameterizedTest tab -> Run
  • Run Boilerpipe Tests

    • Right click on MarfeelBoilerpipe folder (in Project) and Run allTests

IMLs XMLs & .idea files in Gutenberg

Files in .idea folder in Gutenberg and MarfeelXP might be deleted after updates. To get back the files write in the terminal:

cd $MARFEELXP_HOME/Jinks/bin
chmod +x mrf-ideaconfig

move to Gutemberg folder:

mrf-ideaconfig -i

move to MarfeelXP folder:

mrf-ideaconfig -i

Next Page

  1. Find in the SourceCode the tag, id, class or uri's pattern of the NextPage's .
  2. HostNextPageDetector.java, Add your new Next_Page_Detector as: NEXT_PAGE_DETECTORS.put("", new NextPageDetector[]{new ()});
  3. Search for a Detector that runs the same logic as yours, copy it, rename it as yours and adapt its code to your needs. If using RegEx, check the pattern.
  4. Check the sample Article's extraction: ItemInvalidate, localhost//index/item.html?invalidate=3&uri=http://www.example.com/article.

  5. On succesful Article's Pages Extraction carry on Tests:

    GenerateTestFixtures.java:

    1. Modify line SaveBoilerPipeFixture(,)
    2. Right Click - Run
    3. Repeat the process for all the pages of the article
    4. Move Unstaged files to Default in Git Changes.
    5. Delete boilerpipe-files of pages 2, 3, 4...

    HTMLDocumentProcessorMultipagedDataSet.java: Add line with url, description, foldername

    HTMLDocumentProcessorMultipageParameterizedTest.java: Right click - Execute Test
    MarfeelBoilerpipe folder(com.marfeel.boilerpipe): Right click - Run all Tests

    Do not commit changes of GenenerateTestFixtures.java

Start Environment

1 - Update Project, from the top menu: VCS -> Git -> Pull.
2 - Update Maven, from right side menu, Maven Projects -> Reimport All Maven Projects.
3 - Issue in the terminal:

Gutenberg
mvn clean package -Dmaven.test.skip=true

4 - Launch Tomcat Server, from the top menu: Run -> Debug 'Tomcat 8'

Video Provider

Example PR: https://github.com/Marfeel/Gutenberg/pull/1686

  1. Find in the SourceCode the Video's Source.
  2. Mount the Video's Iframe or HTML5 code on a .html in XP/Tenants/vhosts/marfeel and check it shows: localhost.marfeel.com/statics/marfeel/file.html
  3. DefinitionVideoDetector.java, Add new VideoProvider. If getting the Src from a Tag's Attribute place it as a VIDEO_ATTR_DETECTORS.
  4. Copy VideoProvider's files and Customize it to your needs in: /Gutenberg/MarfeelBoilerpipe/src/main/java/com/marfeel/boilerpipe/filters/heuristics/video/impl/
  5. Restart Gutenberg, extract articles and check the Video is displayed in ItemInvalidate, localhost/TenantName/index/item.html?invalidate=3&uri=SampleArticle
  6. Add "videoProviders":"YourProvider" on Tenant's definition.json in configuration and extract articles. On Succesfull Article's Pages Extraction carry on Tests:

GenenerateTestFixtures.java:

  1. Modify line SaveBoilerPipeFixture(,)
  2. Right Click - Run

ImageDocumentSAXProcessorTestDataSet.java:

  1. Generate JSON: http://localhost/hub/item.json?invalidate=3&uri=
  2. Insert Test (Copy Previous - Paste & Adapt)
  3. Check detailsMedia in generated JSON file and add line 'new Image'/'new IframeVideo'/new GenericMedia() { Article as in GenenerateTestFixtures.java, domain, ...}
  4. Fill mew String[][]{"customTagActions", ""},{"videoProviders", "videoProviderName"}

ImageDocumentSAXProcessorParametrizedTest.java:

  1. Right click - Run MarfeelBoilerpipe folder:
  2. Right click - Run all Tests