Some JSON API returns HTML in either in the HTML or the content.
This commit adds two new parameters to the json_engine:
content_html_to_text and title_html_to_text, False by default.
If True, then the searx.utils.html_to_text removes the HTML tags.
Update crossref, openairedatasets and openairepublications engines
The duckduckgo engine requires an additional request after the results have been sent.
This commit makes sure that the second request uses the same HTTPAdapter
= the same IP address, and the same proxy.
The new version of MetaGer needs to reload the reults (into a iframe) with a
unique tag (see HTML response below).
Implementing a dedicated metager-engine for searx makes no sense to me. The
great days of MetaGer seems to be ended. I remember the good old days this
project started in the 90's of the last century. But in the last few years it
becomes more and more crap. As the name suggested, MetaGer was made for
germans in the first place. They have added a english and spain translation but
the i18n is very poor compared to what searx offers.
It's a pity, lets drop MetaGer.
This is the first response, the id (b82679980656899ba5a17ffd02a56846) is unique
for each query:
$ curl "https://metager.org/meta/meta.ger3?eingabe=foo&submit-query=&focus=web"
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<link rel="stylesheet" href="/index.css?id=b82679980656899ba5a17ffd02a56846">
<script src="/index.js?id=b82679980656899ba5a17ffd02a56846"></script>
<title>foo - MetaGer</title>
<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1" />
</head>
<body>
<iframe id="mg-framed" src="https://metager.org/meta/meta.ger3?eingabe=foo&submit-query=&focus=web&mgv=b82679980656899ba5a17ffd02a56846" autofocus="true" onload="this.contentWindow.focus();"></iframe>
</body>
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
Some of our interface locales include uppercase country codes,
which are separated by `_` instead of the more common `-`.
Also, a browser's `Accept-Language` header could be in lowercase.
This commit attempts to normalize those cases so a browser's
language+country codes can better match with our locales.
This solution assumes that our UI locales have nothing more than
language and optionally country. If we ever add a script specific
locale like `zh-Hant-TW` this would have to change to accomodate
that, but the idea would be pretty much the same as this fix.
The language_support variable is set to True by default,
and set to False in only 5 engines.
Except the documentation and the /config URL, this variable is not used.
This commit remove the variable definition in the engines, and
set value according to supported_languages length: False when the length is 0,
True otherwise.
Close#2485
Avoid SearxEngineXPathException errors when parsing non valid results::
.//div[@class="yuRUbf"]//a/@href index 0 not found
Traceback (most recent call last):
File "./searx/engines/google.py", line 274, in response
url = eval_xpath_getindex(result, href_xpath, 0)
File "./searx/searx/utils.py", line 608, in eval_xpath_getindex
raise SearxEngineXPathException(xpath_spec, 'index ' + str(index) + ' not found')
searx.exceptions.SearxEngineXPathException: .//div[@class="yuRUbf"]//a/@href index 0 not found
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
BTW: fix indentation by 2 spaces
The additional tests has been commented out in the google engines to not release
any CAPTCHA issues.
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
BTW: make the engines ready for search.checker:
- replace eval_xpath by eval_xpath_getindex and eval_xpath_list
- google_images: remove outer try/except block
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
The 'video.html' template from the 'oscar' design supports replacement
for *author* and *length*. Google-videos does not have an author, alternatively
the publisher info from is used for the *author*.
Hint: these replacements are not supported by the 'simple' design.
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
This revise is based on the methods developed in the revise of the google engine
(see commit 410c2f9).
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
This revise is based on the methods developed in the revise of the google engine
(see commit 410c2f9).
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
the query "time" is convinient because most of the search engine will return some results,
but some engines in the general category will return documentation about the HTML tags <time> or <input type="time">
Removes module searx/brand.py and creates a namespace at searx.brand.
This patch is a first 'proof of concept'. Later we can decide to remove the
brand namespace entirely or not.
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
Without this commit the module searx checks the secret_key value.
With this commit, make docs, utils/standalone_searx.py,
utils/fetch_firefox_version.py works without SEARX_DEBUG=1
For reference see https://github.com/searx/searx/pull/2386
from_bang is True when the user query contains a bang.
In this case the category is also set to 'none'.
from_bang only usage was in searx.webadapter.parse_specific :
if from_bang is True, then the EngineRef category is ignored and force to 'none'.
This commit also removes the searx.webadapter.parse_sepecific function.
see searx.search.processors.abstract.EngineProcessor
First the method searx call the get_params method.
If the return value is not None, then the searx call the method search.
check HTTP response:
* detect some comme CAPTCHA challenge (no solving). In this case the engine is suspended for long a time.
* otherwise raise HTTPError as before
the check is done in poolrequests.py (was before in search.py).
update qwant, wikipedia, wikidata to use raise_for_httperror instead of raise_for_status
According to
820b468bfe/searx/engines/__init__.py (L87-L88)
an engine can have no category at all.
Without this commit, searx raise an exception in searx/results.py
Note: in this case, the engine is not shown in the preferences.
before commit 58d72f2, category was not set in xpath.py,
so searx/engines/__init__py was setting the category to ['general']
the commit 58d72f2 set the category to [] which is not replaced by searx/engines/__init__.py
consequence: the mojeek engine is hidden in the preferences.
this commit revert the xpath.py change.
close#2368
Add a new parameter "raise_for_status", set by default to True.
When True, any HTTP status code >= 300 raise an exception ( #2332 )
When False, the engine can manage the HTTP status code by itself.
- strip html tags and superfluous quotation marks from content
- remove not needed cookie from request
- remove superfluous imports
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
Error pattern::
Engines cannot retrieve results:
digg (unexpected crash time data '2020-10-16T14:09:55Z' does not match format '%Y-%m-%d %H:%M:%S')
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
recoll is a local search engine based on Xapian:
http://www.lesbonscomptes.com/recoll/
By itself recoll does not offer web or API access,
this can be achieved using recoll-webui:
https://framagit.org/medoc92/recollwebui.git
This engine uses a custom 'files' result template
set `base_url` to the location where recoll-webui can be reached
set `dl_prefix` to a location where the file hierarchy as indexed by recoll can be reached
set `search_dir` to the part of the indexed file hierarchy to be searched, use an empty string to search the entire search domain
This change is backward compatible with the existing configurations.
If a settings.yml loaded from an user defined location (SEARX_SETTINGS_PATH or /etc/searx/settings.yml),
then this settings can relied on the default settings.yml with this option:
user_default_settings:True
Devian's request and response forms has been changed.
- fixed title
- fixed time_range_dict to 'popular-*-***'
- use image from <noscript> if exists
- drop obsolete "http to https, remove domain sharding"
- use query URL https://www.deviantart.com/search/deviations?page=5&q=foo
- add searx/engines/deviantart.py to pylint check (test.pylint)
Error pattern::
There DEBUG:searx:result: invalid title: {'url': 'https://www.deviantart.com/ ...
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
use
from searx.engines.duckduckgo import _fetch_supported_languages, supported_languages_url # NOQA
so it is possible to easily remove all unused import using autoflake:
autoflake --in-place --recursive --remove-all-unused-imports searx tests
* URL / : the index page displayed the selected or the default category.
* URL / : when the q parameter is set using the URL, the redirect includes the URL query.
* URL /search : an empty query doesn't raise an exception.
This makes it easier to separately handle search and index requests
from a web server or from a reverse proxy.
If a request to index contains a query, a permanent redirect HTTP response
is returned. This should give some level of backwards compatibility
for users that have set a searx instance in their browser's search bar.
Xpath engine and results template changed to account for the fact that
archive.org doesn't cache .onions, though some onion engines migth have
their own cache.
Disabled by default. Can be enabled by setting the SOCKS proxies to
wherever Tor is listening and setting using_tor_proxy as True.
Requires Tor and updating packages.
To avoid manually adding the timeout on each engine, you can set
extra_proxy_timeout to account for Tor's (or whatever proxy used) extra
time.
- remove paging support: a "vqd" parameter is required between each request. This parameter is uniq for each request
- update the URL (no redirect), use the POST method
- language support: works if there is no more than request per minute, otherwise it is ignored !
* Fix "?q=test&engines=wikipedia": fix exception
* Fix "?q=test&engines=wikipedia&categories=images": now the engines from images category are included.
* Fix parse_timeout: make sure a value is always returned
* Various typing fixes (searx.webadapter, searx.search.SearchQuery)
When the user add searx as a search engine, the browser loads the /opensearch.xml URL without the cookies.
Without the query parameters, the user preferences are ignored (method and autocomplete).
In addition, opensearch.xml is modified to support automatic updates,
see https://developer.mozilla.org/en-US/docs/Web/OpenSearch
Always call initialize engines except on the first run of werkzeug with the reload feature.
the reload feature is activated when:
* searx_debug is True (SEARX_DEBUG environment variable or settings.yml)
* FLASK_APP=searx/webapp.py FLASK_ENV=development flask run (see https://flask.palletsprojects.com/en/1.1.x/cli/ )
Fix SEARX_DEBUG=0 make docs
docs/admin/engines.rst : engines are initialized
See https://github.com/searx/searx/issues/2204#issuecomment-701373438
Since 1. October 2020 google has changed the 'class' attribute of the HTML
result page.
Fix the xpath expressions and ignore <div class="g" ../> sections which do not
match to title's xpath expression.
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
requests 2.24.0 uses the ssl module except if it doesn't support SNI, in this case searx fallbacks to pyopenssl.
searx logs a critical message and exit if the ssl modules doesn't support SNI and pyOpenSSL is not installed.
searx logs a critical message and exit if the ssl version is older than 1.0.2.
in requirements.txt, pyopenssl is still required to install searx as a fallback.
was previously a Dict with two or three keys: name, category, from_bang
make clear that this is a engine reference (see tests/unit/test_search.py for example)
all variables using this class are renamed accordingly.
* Log each call to get_locale: display the URL, the locale and the source (browser, preferences, form).
* Rename _get_browser_language to _get_browser_or_settings_language to match the actual code.
AJAX requests send the X-Requested-With HTTP header,
so searx.webapp.autocompleter returns the results with the expected data format.
Related to #2127Close#2203
A new "base" engine called command is introduced. It is the foundation for all command line engines for now.
You can use this engine to create your own command line engine.
Add some engines (commented out to make sure no one enables anything accidentally):
* git grep: This engine lets you grep in the searx repo.
* locate: If locate is installed and initialized, you can search on the FS.
* find: You can find files with a specific name from where you started searx.
* pattern search in files: This engine utilizes the command fgrep.
* regex search in files: This engine runs `grep` to find a file based on its contents.
and some other exceptions:
* KeyboardInterrupt
* SystemExit
* RuntimeError
* SystemError
* ImportError: an engine with an unmet dependency will stop everything.
Sending queries through POST, while better for privacy, breaks functionality
with certain extensions (e.g. Firefox containers). Since Firefox does
not send cookies when requesting `/opensearch.xml`, users cannot easily
switch to GET on the client side unless they make a custom search
engine. This commit allows admins to modify the default method on their
side so they can set it to GET if needed.
Sending query params over GET seems to be the only way to be able to
enable autocomplete in the browser. This commit adds the necessary URL
formatting to opensearch.xml. In order to identify queries coming from
the URL bar (rather than an AJAX request), which requires a different
JSON format and MIME type, the request headers are checked for
"X-Requested-With: XMLHttpRequest" which is added by jQuery request.
- enabling HTTPS for sci-hub.tw by default
- making sci-hub the default DOI resolver as it has the largest collection of scientific articles.
- replaced doai.io with dissem.in, as it redirects to this new domain.
Co-authored-by: Aurora of Earth <auroraofearth@ya.ru>
* Made first attempt at the bangs redirects plugin.
* It redirects. But in a messy way via javascript.
* First version with custom plugin
* Added a help page and a operator to see all the bangs available.
* Changed to .format because of support
* Changed to .format because of support
* Removed : in params
* Fixed path to json file and changed bang operator
* Changed bang operator back to &
* Made first attempt at the bangs redirects plugin.
* It redirects. But in a messy way via javascript.
* First version with custom plugin
* Added a help page and a operator to see all the bangs available.
* Changed to .format because of support
* Changed to .format because of support
* Removed : in params
* Fixed path to json file and changed bang operator
* Changed bang operator back to &
* Refactored getting search query. Also changed bang operator to ! and is now working.
* Removed prints
* Removed temporary bangs_redirect.js file. Updated plugin documentation
* Added unit test for the bangs plugin
* Fixed a unit test and added 2 more for bangs plugin
* Changed back to default settings.yml
* Added myself to AUTHORS.rst
* Refacored working of custom plugin.
* Refactored _get_bangs_data from list to dict to improve search speed.
* Decoupled bangs plugin from webserver with redirect_url
* Refactored bangs unit tests
* Fixed unit test bangs. Removed dubbel parsing in bangs.py
* Removed a dumb print statement
* Refactored bangs plugin to core engine.
* Removed bangs plugin.
* Refactored external bangs unit tests from plugin to core.
* Removed custom_results/bangs documentation from plugins.rst
* Added newline in settings.yml so the PR stays clean.
* Changed searx/plugins/__init__.py back to the old file
* Removed newline search.py
* Refactored get_external_bang_operator from utils to external_bang.py
* Removed unnecessary import form test_plugins.py
* Removed _parseExternalBang and _isExternalBang from query.py
* Removed get_external_bang_operator since it was not necessary
* Simplified external_bang.py
* Simplified external_bang.py
* Moved external_bangs unit tests to test_webapp.py. Fixed return in search with external_bang
* Refactored query parsing to unicode to support python2
* Refactored query parsing to unicode to support python2
* Refactored bangs plugin to core engine.
* Refactored search parameter to search_query in external_bang.py
Previously only image/jpeg was not proxied.
This commit don't proxify all MIME types starting with "image/".
This is a quick fix for the PR #1985 : the google_image engine can returns some data URL.