One reason for the often seen CAPTCHA of the startpage requests are the
incomplete requests SearXNG sends to startpage.com. To avoid CAPTCHA we need to
send a well formed HTTP POST request with a cookie, we need to form a request
that is identical to the request build by startpage.com itself:
- in the cookie the **region** is selected
- in the POST arguments the **language** is selected
Based on the *engine_properties* boilerplate, SearXNG's startpage engine now
implements a `_fetch_engine_properties()` function to fetch regions & languages
from startpage.com.
This patch is a complete new implementation of the request() function, reversed
engineered from the startpage.com page. The new implementation adds
- time-range support
- save-search support
to the startpage engine which has been missed in the past.
Closes: https://github.com/searxng/searxng/issues/1081
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
Most engines response best results if a region is selected, most often a
language is also in the properties of a engine and sometimes the language
argument is just the language of the UI. Most often choosing a language has a
minor effect on the result list.
To summarize:
Some engines have language codes (e.g. `ca`) in their properties, some have
region codes (e.g. `ca-ES`), some have regions and languages in their properties
and other engine do not have any language or region support.
In the past we generalized *language* over all kind of engines without taking
into mind that most engines gave best result when there is a region selected.
This *language-centric* view in SearXNG is misleading when we need
region-codes to parameterize engine request!
This patch replaces the *language-centric* view by a "language / region" view.
Conclusions:
With regions we can't say any longer that a engine supports *this or that*
language, by example: when the user selects 'zh' and a engine supports only
region codes like 'zh-TW' or 'zh-CN' we do not what results the user expects /
similar with 'en' or 'fr when the engine needs a region tag.
- Since it is unclear what the user expects by his language selection, we can't
assert a property that says: "supports_selected_language"
The feature is replaced in the UI by the wider sense of "language_support",
what stands for:
The engine has some kind of language support, either
by a region tag or by a language tag.
- A list of "supported_languages" does not make sense when there are regions
responsible for the result of an engine.
The "supported_languages" has been removed from the /config URL
- The `has_language` test in the `searx/search/checker/impl.py` has been removed
since it does not cover engines with region support.
If there is a need for such a test we can implement new tests after all
engines with language (region) support has been moved to the *supported
properites* scheme (see searxng_extra/update/update_languages.py)
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
This patch adds the boilerplate code, needed to fetch properties from engines.
In the past we only fetched *languages* but some engines need *regions* to
parameterize the engine request.
To fit into our *fetch language* procedures the boilerplate is implemented in
the `searxng_extra/update/update_languages.py` and the *engine_properties* are
stored along in the `searx/data/engines_languages.json`.
This implementation is downward compatible to the `_fetch_fetch_languages()`
infrastructure we have. If there comes the day we have all
`_fetch_fetch_languages()` implementations moved to `_fetch_engine_properties()`
implementations, we can rename the files and scripts.
The new type `engine_properties` is a dictionary with keys `languages` and
`regions`. The values are dictionaries to map from SearXNG's language & region
to option values the engine does use::
engine_properties = {
'type' : 'engine_properties', # <-- !!!
'regions': {
# 'ca-ES' : <engine's region name>
},
'languages': {
# 'ca' : <engine's language name>
},
}
Similar to the `supported_languages`, in the engine the properties are available
under the name `supported_properties`.
Initial we start with languages & regions, but in a wider sense the type is
named *engine properties*. Engines can store in whatever options they need and
may be in the future there is a need to fetch additional or complete different
properties.
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
- fix the issue of fetching more the 7000 *languages*
- improve the request function and filter by language & country
- implement time_range_support & safesearch
- add more fields to the response from dailymotion (allow_embed, length)
- better clean up of HTML tags in the 'content' field.
This is more or less a complete rework based on the '/videos' API from [1].
This patch cleans up the language list in SearXNG that has been polluted by the
ISO-639-3 2 and 3 letter codes from dailymotion languages which have never been
used.
[1] https://developers.dailymotion.com/tools/
Closes: https://github.com/searxng/searxng/issues/1065
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
There are several reasons why we should prefer markdown-it-py over mistletoe:
- Get identical rendering results in SearXNG's `/info` pages and the SearXNG's
project documentation which is build by Sphinx-doc.
In the Sphinx-doc we use the MyST parser to render Markdown and the MyST
parser itself is built on top of the markdown-it-py package.
- markdown-it-py has a typographer that supports *replacements*
and *smartquotes* (e.g. em-dash, copyright, ellipsis, ...) [1]
- markdown-it-py is much more flexible compared to mistletoe [2]
- markdown-it-py is the fastest CommonMark compliant parser in python [3]
[1] https://markdown-it-py.readthedocs.io/en/latest/using.html#typographic-components
[2] https://markdown-it-py.readthedocs.io/en/latest/plugins.html
[3] https://markdown-it-py.readthedocs.io/en/latest/other.html#performance
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>