Many things have been changed since last review of this engine. This patch fix
xpath selectors, implements suggestion and is a complete review / rewrite of the
engine.
Signed-off-by: Markus Heiser <markus@darmarit.de>
When initing engines a "SearxEngineResponseException" is logged very verbose,
including full traceback information:
ERROR:searx.engines:yggtorrent engine: Fail to initialize
Traceback (most recent call last):
File "share/searx/searx/engines/__init__.py", line 293, in engine_init
init_fn(get_engine_from_settings(engine_name))
File "share/searx/searx/engines/yggtorrent.py", line 42, in init
resp = http_get(url, allow_redirects=False)
File "share/searx/searx/poolrequests.py", line 197, in get
return request('get', url, **kwargs)
File "share/searx/searx/poolrequests.py", line 190, in request
raise_for_httperror(response)
File "share/searx/searx/raise_for_httperror.py", line 60, in raise_for_httperror
raise_for_captcha(resp)
File "share/searx/searx/raise_for_httperror.py", line 43, in raise_for_captcha
raise_for_cloudflare_captcha(resp)
File "share/searx/searx/raise_for_httperror.py", line 30, in raise_for_cloudflare_captcha
raise SearxEngineCaptchaException(message='Cloudflare CAPTCHA', suspended_time=3600 * 24 * 15)
searx.exceptions.SearxEngineCaptchaException: Cloudflare CAPTCHA, suspended_time=1296000
For SearxEngineResponseException this is not needed. Those types of exceptions
can be a normal use case. E.g. for CAPTCHA errors like shown in the example
above. It should be enough to log a warning for such issues:
WARNING:searx.engines:yggtorrent engine: Fail to initialize // Cloudflare CAPTCHA, suspended_time=1296000
closes: #2612
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
The old xpath configuration for google scholar did not work and is replaced by a
python implementation.
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
- unittest2 is a backport of the new features added to the unittest testing
framework in Python 2.7
- unittest2 was only needed in py2 and can be dropped now
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
Bing has a list of regions that it supports and some of these regions
may have more than one possible language.
In some cases, like Switzerland, these languages are always shown as
options, so there is no issue. But in other cases, like Andorra, Bing
will only show one language at the time, either the region's default or
the request's language if the latter is supported by that region.
For example, if the HTTP request is in French, Andorra will appear as
fr-AD but if the same page is requested in any other language Andorra
will appear as ca-AD.
This is specially a problem when Bing assumes that the request is in
English because it overrides enough language codes to make several major
languages like Arabic dissappear from the languages.py file.
To avoid that issue, I set the Accept-Language header to a language
that's only supported in one region to hopefully avoid these overrides.
use a sparql request on wikidata to get the list of currencies.
currencies.json contains the translation for all supported searx languages.
Supersede #993
At the moment videos without a description are not shown - setting
default content to "" fixes this.
Another current bug is that thumbnails are not displayed. This is caused
by a double slash in the url. For this every trailing slash is now
stripped (for backwards compatibility) and the API response is correctly
parsed.
* searx understand "!ddg !g time" as : send "!g time" to DDG
* !g a DDG bang for Google: DDG return a HTTP redirect to Google
This commit adds a the allows_redirect param not to follow HTTP redirect.
The DDG engine returns a empty result as before without HTTP redirect.
on some queries (like an IT error message), wikipedia returns an HTTP error 400.
this commit returns an empty result instead of showing an error to the user.
Some JSON API returns HTML in either in the HTML or the content.
This commit adds two new parameters to the json_engine:
content_html_to_text and title_html_to_text, False by default.
If True, then the searx.utils.html_to_text removes the HTML tags.
Update crossref, openairedatasets and openairepublications engines
The duckduckgo engine requires an additional request after the results have been sent.
This commit makes sure that the second request uses the same HTTPAdapter
= the same IP address, and the same proxy.
The new version of MetaGer needs to reload the reults (into a iframe) with a
unique tag (see HTML response below).
Implementing a dedicated metager-engine for searx makes no sense to me. The
great days of MetaGer seems to be ended. I remember the good old days this
project started in the 90's of the last century. But in the last few years it
becomes more and more crap. As the name suggested, MetaGer was made for
germans in the first place. They have added a english and spain translation but
the i18n is very poor compared to what searx offers.
It's a pity, lets drop MetaGer.
This is the first response, the id (b82679980656899ba5a17ffd02a56846) is unique
for each query:
$ curl "https://metager.org/meta/meta.ger3?eingabe=foo&submit-query=&focus=web"
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<link rel="stylesheet" href="/index.css?id=b82679980656899ba5a17ffd02a56846">
<script src="/index.js?id=b82679980656899ba5a17ffd02a56846"></script>
<title>foo - MetaGer</title>
<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1" />
</head>
<body>
<iframe id="mg-framed" src="https://metager.org/meta/meta.ger3?eingabe=foo&submit-query=&focus=web&mgv=b82679980656899ba5a17ffd02a56846" autofocus="true" onload="this.contentWindow.focus();"></iframe>
</body>
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
Some of our interface locales include uppercase country codes,
which are separated by `_` instead of the more common `-`.
Also, a browser's `Accept-Language` header could be in lowercase.
This commit attempts to normalize those cases so a browser's
language+country codes can better match with our locales.
This solution assumes that our UI locales have nothing more than
language and optionally country. If we ever add a script specific
locale like `zh-Hant-TW` this would have to change to accomodate
that, but the idea would be pretty much the same as this fix.
The language_support variable is set to True by default,
and set to False in only 5 engines.
Except the documentation and the /config URL, this variable is not used.
This commit remove the variable definition in the engines, and
set value according to supported_languages length: False when the length is 0,
True otherwise.
Close#2485
Avoid SearxEngineXPathException errors when parsing non valid results::
.//div[@class="yuRUbf"]//a/@href index 0 not found
Traceback (most recent call last):
File "./searx/engines/google.py", line 274, in response
url = eval_xpath_getindex(result, href_xpath, 0)
File "./searx/searx/utils.py", line 608, in eval_xpath_getindex
raise SearxEngineXPathException(xpath_spec, 'index ' + str(index) + ' not found')
searx.exceptions.SearxEngineXPathException: .//div[@class="yuRUbf"]//a/@href index 0 not found
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
BTW: fix indentation by 2 spaces
The additional tests has been commented out in the google engines to not release
any CAPTCHA issues.
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
BTW: make the engines ready for search.checker:
- replace eval_xpath by eval_xpath_getindex and eval_xpath_list
- google_images: remove outer try/except block
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
The 'video.html' template from the 'oscar' design supports replacement
for *author* and *length*. Google-videos does not have an author, alternatively
the publisher info from is used for the *author*.
Hint: these replacements are not supported by the 'simple' design.
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
This revise is based on the methods developed in the revise of the google engine
(see commit 410c2f9).
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
This revise is based on the methods developed in the revise of the google engine
(see commit 410c2f9).
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
the query "time" is convinient because most of the search engine will return some results,
but some engines in the general category will return documentation about the HTML tags <time> or <input type="time">
Removes module searx/brand.py and creates a namespace at searx.brand.
This patch is a first 'proof of concept'. Later we can decide to remove the
brand namespace entirely or not.
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
Without this commit the module searx checks the secret_key value.
With this commit, make docs, utils/standalone_searx.py,
utils/fetch_firefox_version.py works without SEARX_DEBUG=1
For reference see https://github.com/searx/searx/pull/2386
from_bang is True when the user query contains a bang.
In this case the category is also set to 'none'.
from_bang only usage was in searx.webadapter.parse_specific :
if from_bang is True, then the EngineRef category is ignored and force to 'none'.
This commit also removes the searx.webadapter.parse_sepecific function.
see searx.search.processors.abstract.EngineProcessor
First the method searx call the get_params method.
If the return value is not None, then the searx call the method search.
check HTTP response:
* detect some comme CAPTCHA challenge (no solving). In this case the engine is suspended for long a time.
* otherwise raise HTTPError as before
the check is done in poolrequests.py (was before in search.py).
update qwant, wikipedia, wikidata to use raise_for_httperror instead of raise_for_status
According to
820b468bfe/searx/engines/__init__.py (L87-L88)
an engine can have no category at all.
Without this commit, searx raise an exception in searx/results.py
Note: in this case, the engine is not shown in the preferences.
before commit 58d72f2, category was not set in xpath.py,
so searx/engines/__init__py was setting the category to ['general']
the commit 58d72f2 set the category to [] which is not replaced by searx/engines/__init__.py
consequence: the mojeek engine is hidden in the preferences.
this commit revert the xpath.py change.
close#2368
Add a new parameter "raise_for_status", set by default to True.
When True, any HTTP status code >= 300 raise an exception ( #2332 )
When False, the engine can manage the HTTP status code by itself.
- strip html tags and superfluous quotation marks from content
- remove not needed cookie from request
- remove superfluous imports
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
Error pattern::
Engines cannot retrieve results:
digg (unexpected crash time data '2020-10-16T14:09:55Z' does not match format '%Y-%m-%d %H:%M:%S')
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
recoll is a local search engine based on Xapian:
http://www.lesbonscomptes.com/recoll/
By itself recoll does not offer web or API access,
this can be achieved using recoll-webui:
https://framagit.org/medoc92/recollwebui.git
This engine uses a custom 'files' result template
set `base_url` to the location where recoll-webui can be reached
set `dl_prefix` to a location where the file hierarchy as indexed by recoll can be reached
set `search_dir` to the part of the indexed file hierarchy to be searched, use an empty string to search the entire search domain
This change is backward compatible with the existing configurations.
If a settings.yml loaded from an user defined location (SEARX_SETTINGS_PATH or /etc/searx/settings.yml),
then this settings can relied on the default settings.yml with this option:
user_default_settings:True
Devian's request and response forms has been changed.
- fixed title
- fixed time_range_dict to 'popular-*-***'
- use image from <noscript> if exists
- drop obsolete "http to https, remove domain sharding"
- use query URL https://www.deviantart.com/search/deviations?page=5&q=foo
- add searx/engines/deviantart.py to pylint check (test.pylint)
Error pattern::
There DEBUG:searx:result: invalid title: {'url': 'https://www.deviantart.com/ ...
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
use
from searx.engines.duckduckgo import _fetch_supported_languages, supported_languages_url # NOQA
so it is possible to easily remove all unused import using autoflake:
autoflake --in-place --recursive --remove-all-unused-imports searx tests
* URL / : the index page displayed the selected or the default category.
* URL / : when the q parameter is set using the URL, the redirect includes the URL query.
* URL /search : an empty query doesn't raise an exception.
This makes it easier to separately handle search and index requests
from a web server or from a reverse proxy.
If a request to index contains a query, a permanent redirect HTTP response
is returned. This should give some level of backwards compatibility
for users that have set a searx instance in their browser's search bar.
Xpath engine and results template changed to account for the fact that
archive.org doesn't cache .onions, though some onion engines migth have
their own cache.
Disabled by default. Can be enabled by setting the SOCKS proxies to
wherever Tor is listening and setting using_tor_proxy as True.
Requires Tor and updating packages.
To avoid manually adding the timeout on each engine, you can set
extra_proxy_timeout to account for Tor's (or whatever proxy used) extra
time.
- remove paging support: a "vqd" parameter is required between each request. This parameter is uniq for each request
- update the URL (no redirect), use the POST method
- language support: works if there is no more than request per minute, otherwise it is ignored !
* Fix "?q=test&engines=wikipedia": fix exception
* Fix "?q=test&engines=wikipedia&categories=images": now the engines from images category are included.
* Fix parse_timeout: make sure a value is always returned
* Various typing fixes (searx.webadapter, searx.search.SearchQuery)
When the user add searx as a search engine, the browser loads the /opensearch.xml URL without the cookies.
Without the query parameters, the user preferences are ignored (method and autocomplete).
In addition, opensearch.xml is modified to support automatic updates,
see https://developer.mozilla.org/en-US/docs/Web/OpenSearch
Always call initialize engines except on the first run of werkzeug with the reload feature.
the reload feature is activated when:
* searx_debug is True (SEARX_DEBUG environment variable or settings.yml)
* FLASK_APP=searx/webapp.py FLASK_ENV=development flask run (see https://flask.palletsprojects.com/en/1.1.x/cli/ )
Fix SEARX_DEBUG=0 make docs
docs/admin/engines.rst : engines are initialized
See https://github.com/searx/searx/issues/2204#issuecomment-701373438
Since 1. October 2020 google has changed the 'class' attribute of the HTML
result page.
Fix the xpath expressions and ignore <div class="g" ../> sections which do not
match to title's xpath expression.
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
requests 2.24.0 uses the ssl module except if it doesn't support SNI, in this case searx fallbacks to pyopenssl.
searx logs a critical message and exit if the ssl modules doesn't support SNI and pyOpenSSL is not installed.
searx logs a critical message and exit if the ssl version is older than 1.0.2.
in requirements.txt, pyopenssl is still required to install searx as a fallback.
was previously a Dict with two or three keys: name, category, from_bang
make clear that this is a engine reference (see tests/unit/test_search.py for example)
all variables using this class are renamed accordingly.
* Log each call to get_locale: display the URL, the locale and the source (browser, preferences, form).
* Rename _get_browser_language to _get_browser_or_settings_language to match the actual code.
AJAX requests send the X-Requested-With HTTP header,
so searx.webapp.autocompleter returns the results with the expected data format.
Related to #2127Close#2203
A new "base" engine called command is introduced. It is the foundation for all command line engines for now.
You can use this engine to create your own command line engine.
Add some engines (commented out to make sure no one enables anything accidentally):
* git grep: This engine lets you grep in the searx repo.
* locate: If locate is installed and initialized, you can search on the FS.
* find: You can find files with a specific name from where you started searx.
* pattern search in files: This engine utilizes the command fgrep.
* regex search in files: This engine runs `grep` to find a file based on its contents.