Hi @adbar, thanks for this awesome library.
While porting this library to Go, I noticed there are two Mediacloud tests that might be wrong:
"https://www.baltimoresun.com/opinion/columnists/zurawik/bs-ed-zontv-media-year-20201223-cnvrlhkhnrbihcxx6wxcxt2b7y-story.html#ed=rss_www.baltimoresun.com/arcio/rss/category/latest/": {
"file": "1805697156.html",
"date": "2020-12-23"
},
"https://elbaladtv.net/%d8%aa%d8%b1%d9%83%d9%89-%d8%a2%d9%84-%d8%a7%d9%84%d8%b4%d9%8a%d8%ae-%d8%a8%d8%b9%d8%af-%d8%a5%d8%b5%d8%a7%d8%a8%d8%a9-%d9%8a%d8%b3%d8%b1%d8%a7-%d8%a8%d9%83%d9%88%d8%b1%d9%88%d9%86%d8%a7-%d9%8a%d8%a7/": {
"file": "1806793639.html",
"date": "2020-12-25"
},
For baltimoresun, its JSON+LD contains following snippet:
{
// ... omitted
"articleSection": "zurawik",
"dateCreated": "2020-12-22T01:06:41.361Z",
"datePublished": "2020-12-23T15:42:33.814Z",
"dateModified": "2020-12-23T15:42:34.197Z",
// ... omitted
}
From that snippet we can see its creation date is 2020-12-22. Since we want the original date, I think we should use that one instead of 2020-12-23?
For elbalad.tv, its JSON+LD contains following snippet:
{
"@type": "WebPage",
"@id": "https://elbaladtv.net/%d8%aa%d8%b1%d9%83%d9%89-%d8%a2%d9%84-%d8%a7%d9%84%d8%b4%d9%8a%d8%ae-%d8%a8%d8%b9%d8%af-%d8%a5%d8%b5%d8%a7%d8%a8%d8%a9-%d9%8a%d8%b3%d8%b1%d8%a7-%d8%a8%d9%83%d9%88%d8%b1%d9%88%d9%86%d8%a7-%d9%8a%d8%a7/#webpage",
"url": "https://elbaladtv.net/%d8%aa%d8%b1%d9%83%d9%89-%d8%a2%d9%84-%d8%a7%d9%84%d8%b4%d9%8a%d8%ae-%d8%a8%d8%b9%d8%af-%d8%a5%d8%b5%d8%a7%d8%a8%d8%a9-%d9%8a%d8%b3%d8%b1%d8%a7-%d8%a8%d9%83%d9%88%d8%b1%d9%88%d9%86%d8%a7-%d9%8a%d8%a7/",
"name": "\u062a\u0631\u0643\u0649 \u0622\u0644 \u0627\u0644\u0634\u064a\u062e \u0628\u0639\u062f \u0625\u0635\u0627\u0628\u0629 \u064a\u0633\u0631\u0627 \u0628\u0643\u0648\u0631\u0648\u0646\u0627: \u064a\u0627\u0631\u0628 \u064a\u0631\u0641\u0639 \u0639\u0646\u0643 - \u0642\u0646\u0627\u0629 \u0635\u062f\u0649 \u0627\u0644\u0628\u0644\u062f",
"datePublished": "2020-12-25T01:59:50+02:00",
"dateModified": "2020-12-25T01:59:50+02:00",
"isPartOf": { "@id": "https://elbaladtv.net/#website" },
"primaryImageOfPage": {
"@id": "https://elbaladtv.net/%d8%aa%d8%b1%d9%83%d9%89-%d8%a2%d9%84-%d8%a7%d9%84%d8%b4%d9%8a%d8%ae-%d8%a8%d8%b9%d8%af-%d8%a5%d8%b5%d8%a7%d8%a8%d8%a9-%d9%8a%d8%b3%d8%b1%d8%a7-%d8%a8%d9%83%d9%88%d8%b1%d9%88%d9%86%d8%a7-%d9%8a%d8%a7/#primaryImage"
},
"inLanguage": "ar"
}
It also contains following meta tag:
<meta property="article:published_time" content="2020-12-24T23:59:50+00:00">
From those two, we can see that the published time in JSON+LD and meta tags are actually the same except the former is in UTC+2 while the latter is in UTC+0.
So, for extraction result I think we should use 2020-12-24 since it's use UTC time instead of local time.
Hi @adbar, thanks for this awesome library.
While porting this library to Go, I noticed there are two Mediacloud tests that might be wrong:
For
baltimoresun, its JSON+LD contains following snippet:{ // ... omitted "articleSection": "zurawik", "dateCreated": "2020-12-22T01:06:41.361Z", "datePublished": "2020-12-23T15:42:33.814Z", "dateModified": "2020-12-23T15:42:34.197Z", // ... omitted }From that snippet we can see its creation date is
2020-12-22. Since we want the original date, I think we should use that one instead of2020-12-23?For
elbalad.tv, its JSON+LD contains following snippet:{ "@type": "WebPage", "@id": "https://elbaladtv.net/%d8%aa%d8%b1%d9%83%d9%89-%d8%a2%d9%84-%d8%a7%d9%84%d8%b4%d9%8a%d8%ae-%d8%a8%d8%b9%d8%af-%d8%a5%d8%b5%d8%a7%d8%a8%d8%a9-%d9%8a%d8%b3%d8%b1%d8%a7-%d8%a8%d9%83%d9%88%d8%b1%d9%88%d9%86%d8%a7-%d9%8a%d8%a7/#webpage", "url": "https://elbaladtv.net/%d8%aa%d8%b1%d9%83%d9%89-%d8%a2%d9%84-%d8%a7%d9%84%d8%b4%d9%8a%d8%ae-%d8%a8%d8%b9%d8%af-%d8%a5%d8%b5%d8%a7%d8%a8%d8%a9-%d9%8a%d8%b3%d8%b1%d8%a7-%d8%a8%d9%83%d9%88%d8%b1%d9%88%d9%86%d8%a7-%d9%8a%d8%a7/", "name": "\u062a\u0631\u0643\u0649 \u0622\u0644 \u0627\u0644\u0634\u064a\u062e \u0628\u0639\u062f \u0625\u0635\u0627\u0628\u0629 \u064a\u0633\u0631\u0627 \u0628\u0643\u0648\u0631\u0648\u0646\u0627: \u064a\u0627\u0631\u0628 \u064a\u0631\u0641\u0639 \u0639\u0646\u0643 - \u0642\u0646\u0627\u0629 \u0635\u062f\u0649 \u0627\u0644\u0628\u0644\u062f", "datePublished": "2020-12-25T01:59:50+02:00", "dateModified": "2020-12-25T01:59:50+02:00", "isPartOf": { "@id": "https://elbaladtv.net/#website" }, "primaryImageOfPage": { "@id": "https://elbaladtv.net/%d8%aa%d8%b1%d9%83%d9%89-%d8%a2%d9%84-%d8%a7%d9%84%d8%b4%d9%8a%d8%ae-%d8%a8%d8%b9%d8%af-%d8%a5%d8%b5%d8%a7%d8%a8%d8%a9-%d9%8a%d8%b3%d8%b1%d8%a7-%d8%a8%d9%83%d9%88%d8%b1%d9%88%d9%86%d8%a7-%d9%8a%d8%a7/#primaryImage" }, "inLanguage": "ar" }It also contains following meta tag:
From those two, we can see that the published time in JSON+LD and meta tags are actually the same except the former is in UTC+2 while the latter is in UTC+0.
So, for extraction result I think we should use
2020-12-24since it's use UTC time instead of local time.