[mod] json_engine: add content_html_to_text and title_html_to_text

Some JSON API returns HTML in either in the HTML or the content.
This commit adds two new parameters to the json_engine:
content_html_to_text and title_html_to_text, False by default.

If True, then the searx.utils.html_to_text removes the HTML tags.

Update crossref, openairedatasets and openairepublications engines
This commit is contained in:
Alexandre Flament 2021-02-10 16:40:03 +01:00
parent 436d366448
commit ff84a1af35
2 changed files with 19 additions and 5 deletions

View file

@ -267,7 +267,9 @@ engines:
search_url : https://search.crossref.org/dois?q={query}&page={pageno}
url_query : doi
title_query : title
title_html_to_text: True
content_query : fullCitation
content_html_to_text: True
categories : science
shortcut : cr
about:
@ -757,6 +759,7 @@ engines:
url_query : metadata/oaf:entity/oaf:result/children/instance/webresource/url/$
title_query : metadata/oaf:entity/oaf:result/title/$
content_query : metadata/oaf:entity/oaf:result/description/$
content_html_to_text: True
categories : science
shortcut : oad
timeout: 5.0
@ -776,6 +779,7 @@ engines:
url_query : metadata/oaf:entity/oaf:result/children/instance/webresource/url/$
title_query : metadata/oaf:entity/oaf:result/title/$
content_query : metadata/oaf:entity/oaf:result/description/$
content_html_to_text: True
categories : science
shortcut : oap
timeout: 5.0