Adds a duration: meta word for videos, a=chris

Chris Pollett [2019-06-28 22:Jun:th]

Adds a duration: meta word for videos, a=chris

Filename
src/configs/PublicHelpPages.php
src/data/public_default.db
src/library/CrawlConstants.php
src/library/PhraseParser.php

diff --git a/src/configs/PublicHelpPages.php b/src/configs/PublicHelpPages.php
index a52f53313..503acf070 100644
--- a/src/configs/PublicHelpPages.php
+++ b/src/configs/PublicHelpPages.php
@@ -2187,8 +2187,7 @@ changing the 04 above to 03, 02, 01 varies the group of cities. Most of the data
  Language: English
  Category: weather
  Channel: /&lt;pre(?:.+?)&gt;([^&lt;]+)/m
- Item: /
-/
+ Item: /\n/
  Title: /^(.+?)\s\s\s+/
  Description: /\s\s\s+(.+?)$/
  Link: http://www.weather.gov/
@@ -2202,16 +2201,15 @@ fixed site by directly entering a URL in the Link field.

 Not all feeds use the same tag to specify the image associated with a news item. The Image XPath allows you to specify relative to a news item (either RSS or HTML) where an image thumbnail exists. If a site does not use such thumbnail one can prefix the path with ^ to give the path relative to the root of the whole file to where a thumb nail for the news source exists. Yioop automatically removes escaping from RSS containing escaped HTML when computing this. For example, the following works for the feed:
 &lt;pre&gt;
-  http://feeds.wired.com/wired/index
- //description/div[contains(@class,
-    &quot;rss_thumbnail&quot;)]/img/@src
+  https://feeds.wired.com/wired/index
+  //description/div[contains(@class, &quot;rss_thumbnail&quot;)]/img/@src
 &lt;/pre&gt;

 &lt;br /&gt;

 A &#039;&#039;&#039;Feed Podcast source&#039;&#039;&#039; is an RSS or Atom source where each item contains a link to a podcast or video podcast. For example,
  http://feed.cnet.com/feed/podcast/all/hd.xml
-The &#039;&#039;&#039;Alternative Link Tag&#039;&#039;&#039; field is used to say the xpath within the feed item to the link for the audio or video file. For the CNet example, this is:
+The &#039;&#039;&#039;Alternative Link Tag&#039;&#039;&#039; field is used to say the XPath within the feed item to the link for the audio or video file. For the CNet example, this is:
  enclosure
 If it is blank the default link tag is used. The media updater job when run checks if any items in the feed are new. If so, it downloads them to the wiki resource folder of the wiki page provided in the &#039;&#039;&#039;Wiki Destination&#039;&#039;&#039; field. This page is given in the format GroupName@PageName. If you give just PageName, the Public group is assumed. The &#039;&#039;&#039;Expires&#039;&#039;&#039; field controls how long a feed item is kept before it is deleted.
 For example, if we wanted to download the popular Ted talk podcasts into the Ted subfolder of the resource folder of the Example Podcast wiki page of the Public group, where we have podcasts expire after after 1 month, we could do:
@@ -2231,25 +2229,22 @@ Yioop supports the downloading of single video or audio file sources, as well as
 &lt;br /&gt;

 A &#039;&#039;&#039;Scrape podcast source&#039;&#039;&#039; is like a &#039;&#039;&#039;Feed Podcast source&#039;&#039;&#039;, but where one has a HTML or XML page which has a periodically updated link to a video or audio source. For example, it might be an evening news web site.
-The URL field should be the page with the periodically updated link. The &#039;&#039;&#039;Aux Url XPaths&#039;&#039;&#039; field, if not blank, should be a sequence of xpaths or regexes one per line. The first line will be applied to the page to obtain a next url to download. The next line&#039;s xpath or regex is applied to this file and so on. The final url generated should be to the HTML or XML page that contains the media source for that day. Finally, on the page for the given day, &#039;&#039;&#039;Download XPath&#039;&#039;&#039; should be the xpath of the url of the video or audio file to download.
-If a regex is used rather than an xpath, then the first capture group of the regex should give the url. A regex can be followed by json| to indicate the first capture group should be converted to a json object. To reference a path of through sub-objects of this object to a url. As an example, consider the following, which at some point, could download the Nightly News  Scrape Podcast to a wiki group:
+The URL field should be the page with the periodically updated link. The &#039;&#039;&#039;Aux Url XPaths&#039;&#039;&#039; field, if not blank, should be a sequence of XPaths or Regexes one per line. The first line will be applied to the page to obtain a next url to download. The next line&#039;s XPath or Regex is applied to this file and so on. The final url generated should be to the HTML or XML page that contains the media source for that day. Finally, on the page for the given day, &#039;&#039;&#039;Download XPath&#039;&#039;&#039; should be the XPath of the url of the video or audio file to download.
+If a regex is used rather than an XPath, then the first capture group of the regex should give the url. A regex can be followed by json| to indicate the first capture group should be converted to a json object. To reference a path of through sub-objects of this object to a url. As an example, consider the following, which at some point, could download the Daily News  Scrape Podcast to a wiki group:

  Type: Scrape Podcast
- Name: Nightly News Podcast
- URL: https://www.somenetwork.com/nightly-news
+ Name: Dailly News Podcast
+ URL: https://www.somenetwork.com/daily-news
  Language: English
  Aux Url XPaths:
- /(https\:\/\/cdn.somenetwork.com\/nightly-news-netcast\/video\/nightly-[^\&quot;]+)\&quot;/
- /window\.\_\_data\s*\=\s*([^
-]+\}\;)/json|video|current|0|publicUrl
- Download Xpath: //video[contains(@height,&#039;540&#039;)]
+ /(https\:\/\/cdn.somenetwork.com\/daily-news\/video\/daily-[^\&quot;]+)\&quot;/
+ /window\.\_\_data\s*\=\s*([^\]+\}\;)/json|video|current|0|publicUrl
+ Download XPath: //video[contains(@height,&#039;540&#039;)]
  Wiki Destination: My Private Group@Podcasts/%Y-%m-%d.mp4

-The initial page to be download will be: https://www.somenetwork.com/nightly-news. On this page, we will use the first Aux Path to find a string in the page that matches /(https\:\/\/www.somenetwork.com\/nightly-news-netcast\/video\/nightly-[^\&quot;]+)\&quot;/. The contents matching between the parentheses is the first capture group and will be the next url to download. SO for example, one might get a url:
- https://cdn.somenetwork.com/nightly-news-netcast/video/nightly-safghdsjfg
-This url is then downloaded and a string matching  the pattern /window\.\_\_data\s*\=\s*([^
-]+\}\;)/ is found. The capture group portion of this string consists of what matches ([^
-]+\}\;) is then converted to a JSON object, becausee of the json| in the Aux Url XPath. From this JSON object, we look at the video field, then the current subfields, its 0 subfield, and finally, the publicUrl field. This is the url we download next. Lastly, the download Xpath is then used to actually get the final video link from this downloaded page.
+The initial page to be download will be: https://www.somenetwork.com/daily-news. On this page, we will use the first Aux Path to find a string in the page that matches /(https\:\/\/www.somenetwork.com\/daily-news\/video\/daily-[^\&quot;]+)\&quot;/. The contents matching between the parentheses is the first capture group and will be the next url to download. SO for example, one might get a url:
+ https://cdn.somenetwork.com/daily-news/video/daily-safghdsjfg
+This url is then downloaded and a string matching  the pattern /window\.\_\_data\s*\=\s*([^\n]+\}\;)/ is found. The capture group portion of this string consists of what matches ([^\n]+\}\;) is then converted to a JSON object, because of the json| in the Aux Url XPath. From this JSON object, we look at the video field, then the current subfields, its 0 subfield, and finally, the publicUrl field. This is the url we download next. Lastly, the download XPath is then used to actually get the final video link from this downloaded page.
 Once this video is downloaded, it is stored in the Podcasts page&#039;s resource folder of the the My Private Group wiki group in a file with a name in the format: %Y-%m-%d.mp4.
 EOD;
 $help_pages["en-US"]["Monetization"] = <<< EOD
diff --git a/src/data/public_default.db b/src/data/public_default.db
index 6bdb59c5f..1b1f36b6a 100644
Binary files a/src/data/public_default.db and b/src/data/public_default.db differ
diff --git a/src/library/CrawlConstants.php b/src/library/CrawlConstants.php
index ebeb5dc09..52768208e 100755
--- a/src/library/CrawlConstants.php
+++ b/src/library/CrawlConstants.php
@@ -234,4 +234,5 @@ interface CrawlConstants
     const CHANNEL = 'eb';
     const THUMB_URL = 'ec';
     const IS_VR = 'ed';
+    const DURATION = 'ee';
 }
diff --git a/src/library/PhraseParser.php b/src/library/PhraseParser.php
index dea65a263..da892cb27 100755
--- a/src/library/PhraseParser.php
+++ b/src/library/PhraseParser.php
@@ -54,8 +54,8 @@ class PhraseParser
      * @var array
      */
     public static $meta_words_list = ['\-', 'class:', 'class-score:', 'code:',
-        'date:', 'dns:', 'elink:', 'filetype:', 'guid:', 'host:', 'i:',
-        'info:', 'index:', 'ip:', 'link:', 'modified:',
+        'date:', 'dns:', 'duration:', 'elink:', 'filetype:', 'guid:', 'host:',
+        'i:', 'info:', 'index:', 'ip:', 'link:', 'modified:',
         'lang:', 'media:', 'location:', 'numlinks:', 'os:',
         'path:', 'robot:', 'safe:', 'server:', 'site:', 'size:',
         'time:', 'u:', 'version:','weight:', 'w:'
@@ -1102,6 +1102,22 @@ class PhraseParser
         $meta_ids[] = 'media:all';
         if (!empty($site[CrawlConstants::IS_VIDEO])) {
             $meta_ids[] = "media:video";
+            if (!empty($site[CrawlConstants::DURATION])) {
+                $durations = [ 60 => "one-minute", 300 => "five-minute",
+                    600 => "ten-minute", 900 => 'fifteen-minute',
+                    1800 => 'half-hour', 3600 => 'hour', 7200 => 'two-hour'
+                ];
+                $duration = intval($site[CrawlConstants::DURATION]);
+                if ($duration > 0) {
+                    foreach ($durations as $time => $time_words) {
+                        if ($duration > $time) {
+                            $meta_ids[] = "duration:$time_words-plus";
+                        } else {
+                            $meta_ids[] = "duration:$time_words-minus";
+                        }
+                    }
+                }
+            }
         } else if (stripos($site[CrawlConstants::TYPE],
             "image") !== false) {
             if (!empty($site[CrawlConstants::WIDTH]) &&

ViewGit