searxngRebrandZaclys/searx/engines/xpath.py

# SPDX-License-Identifier: AGPL-3.0-or-later
# lint: pylint
# pylint: disable=missing-function-docstring
"""The XPath engine is a *generic* engine with which it is possible to configure
engines in the settings.

Here is a simple example of a XPath engine configured in the
:ref:`settings engine` section, further read :ref:`engines-dev`.

.. code:: yaml

  - name : bitbucket
    engine : xpath
    paging : True
    search_url : https://bitbucket.org/repo/all/{pageno}?name={query}
    url_xpath : //article[@class="repo-summary"]//a[@class="repo-link"]/@href
    title_xpath : //article[@class="repo-summary"]//a[@class="repo-link"]
    content_xpath : //article[@class="repo-summary"]/p

"""

from urllib.parse import urlencode

from lxml import html
from searx.utils import extract_text, extract_url, eval_xpath, eval_xpath_list
from searx import logger

logger = logger.getChild('XPath engine')

search_url = None
"""
Search URL of the engine, replacements are:

``{query}``:
  Search terms from user.

``{pageno}``:
  Page number if engine supports pagging :py:obj:`paging`

"""

soft_max_redirects = 0
'''Maximum redirects, soft limit. Record an error but don't stop the engine'''

results_xpath = ''
'''XPath selector for the list of result items'''

url_xpath = None
'''XPath selector of result's ``url``.'''

content_xpath = None
'''XPath selector of result's ``content``.'''

title_xpath = None
'''XPath selector of result's ``title``.'''

thumbnail_xpath = False
'''XPath selector of result's ``img_src``.'''

suggestion_xpath = ''
'''XPath selector of result's ``suggestion``.'''

cached_xpath = ''
cached_url = ''

paging = False
'''Engine supports paging [True or False].'''

page_size = 1
'''Number of results on each page.  Only needed if the site requires not a page
number, but an offset.'''

first_page_num = 1
'''Number of the first page (usually 0 or 1).'''

def request(query, params):
    '''Build request parameters (see :ref:`engine request`).

    '''
    query = urlencode({'q': query})[2:]

    fargs = {'query': query}
    if paging and search_url.find('{pageno}') >= 0:
        fargs['pageno'] = (params['pageno'] - 1) * page_size + first_page_num

    params['url'] = search_url.format(**fargs)
    params['query'] = query
    params['soft_max_redirects'] = soft_max_redirects
    logger.debug("query_url --> %s", params['url'])

    return params

def response(resp):
    '''Scrap *results* from the response (see :ref:`engine results`).

    '''
    results = []
    dom = html.fromstring(resp.text)
    is_onion = 'onions' in categories  # pylint: disable=undefined-variable

    if results_xpath:
        for result in eval_xpath_list(dom, results_xpath):

            url = extract_url(eval_xpath_list(result, url_xpath, min_len=1), search_url)
            title = extract_text(eval_xpath_list(result, title_xpath, min_len=1))
            content = extract_text(eval_xpath_list(result, content_xpath, min_len=1))
            tmp_result = {'url': url, 'title': title, 'content': content}

            # add thumbnail if available
            if thumbnail_xpath:
                thumbnail_xpath_result = eval_xpath_list(result, thumbnail_xpath)
                if len(thumbnail_xpath_result) > 0:
                    tmp_result['img_src'] = extract_url(thumbnail_xpath_result, search_url)

            # add alternative cached url if available
            if cached_xpath:
                tmp_result['cached_url'] = (
                    cached_url
                    + extract_text(eval_xpath_list(result, cached_xpath, min_len=1))
                )

            if is_onion:
                tmp_result['is_onion'] = True

            results.append(tmp_result)

    else:
        if cached_xpath:
            for url, title, content, cached in zip(
                (extract_url(x, search_url) for
                 x in eval_xpath_list(dom, url_xpath)),
                map(extract_text, eval_xpath_list(dom, title_xpath)),
                map(extract_text, eval_xpath_list(dom, content_xpath)),
                map(extract_text, eval_xpath_list(dom, cached_xpath))
            ):
                results.append({
                    'url': url,
                    'title': title,
                    'content': content,
                    'cached_url': cached_url + cached, 'is_onion': is_onion
                })
        else:
            for url, title, content in zip(
                (extract_url(x, search_url) for
                 x in eval_xpath_list(dom, url_xpath)),
                map(extract_text, eval_xpath_list(dom, title_xpath)),
                map(extract_text, eval_xpath_list(dom, content_xpath))
            ):
                results.append({
                    'url': url,
                    'title': title,
                    'content': content,
                    'is_onion': is_onion
                })

    if suggestion_xpath:
        for suggestion in eval_xpath(dom, suggestion_xpath):
            results.append({'suggestion': extract_text(suggestion)})

    logger.debug("found %s results", len(results))
    return results
[enh] engines: add about variable move meta information from comment to the about variable so the preferences, the documentation can show these information 2021-01-13 10:31:25 +00:00			`# SPDX-License-Identifier: AGPL-3.0-or-later`
[doc] add documentation about the XPath engine - pylint searx/engines/xpath.py - fix indentation of some long lines - add logging - add doc-strings Signed-off-by: Markus Heiser <markus.heiser@darmarit.de> 2021-05-23 08:56:29 +00:00			`# lint: pylint`
			`# pylint: disable=missing-function-docstring`
			`"""The XPath engine is a generic engine with which it is possible to configure`
			`engines in the settings.`

			`Here is a simple example of a XPath engine configured in the`
			:ref:`settings engine` section, further read :ref:`engines-dev`.

			`.. code:: yaml`

			`- name : bitbucket`
			`engine : xpath`
			`paging : True`
			`search_url : https://bitbucket.org/repo/all/{pageno}?name={query}`
			`url_xpath : //article[@class="repo-summary"]//a[@class="repo-link"]/@href`
			`title_xpath : //article[@class="repo-summary"]//a[@class="repo-link"]`
			`content_xpath : //article[@class="repo-summary"]/p`

			`"""`
[enh] engines: add about variable move meta information from comment to the about variable so the preferences, the documentation can show these information 2021-01-13 10:31:25 +00:00
[mod] move extract_text, extract_url to searx.utils 2020-10-02 16:13:56 +00:00			`from urllib.parse import urlencode`
[doc] add documentation about the XPath engine - pylint searx/engines/xpath.py - fix indentation of some long lines - add logging - add doc-strings Signed-off-by: Markus Heiser <markus.heiser@darmarit.de> 2021-05-23 08:56:29 +00:00
			`from lxml import html`
[mod] xpath, 1337x, acgsou, apkmirror, archlinux, arxiv: use eval_xpath_* functions 2020-11-26 14:49:33 +00:00			`from searx.utils import extract_text, extract_url, eval_xpath, eval_xpath_list`
[doc] add documentation about the XPath engine - pylint searx/engines/xpath.py - fix indentation of some long lines - add logging - add doc-strings Signed-off-by: Markus Heiser <markus.heiser@darmarit.de> 2021-05-23 08:56:29 +00:00			`from searx import logger`

			`logger = logger.getChild('XPath engine')`
[enh] xpath engine added 2013-10-26 00:22:20 +00:00
[fix] pep/flake8 compatibility 2014-01-20 01:31:20 +00:00			`search_url = None`
[doc] add documentation about the XPath engine - pylint searx/engines/xpath.py - fix indentation of some long lines - add logging - add doc-strings Signed-off-by: Markus Heiser <markus.heiser@darmarit.de> 2021-05-23 08:56:29 +00:00			`"""`
			`Search URL of the engine, replacements are:`

			``{query}``:
			`Search terms from user.`

			``{pageno}``:
			Page number if engine supports pagging :py:obj:`paging`

			`"""`

			`soft_max_redirects = 0`
			`'''Maximum redirects, soft limit. Record an error but don't stop the engine'''`

			`results_xpath = ''`
			`'''XPath selector for the list of result items'''`

[fix] pep/flake8 compatibility 2014-01-20 01:31:20 +00:00			`url_xpath = None`
[doc] add documentation about the XPath engine - pylint searx/engines/xpath.py - fix indentation of some long lines - add logging - add doc-strings Signed-off-by: Markus Heiser <markus.heiser@darmarit.de> 2021-05-23 08:56:29 +00:00			'''XPath selector of result's ``url``.'''

[enh] xpath engine added 2013-10-26 00:22:20 +00:00			`content_xpath = None`
[doc] add documentation about the XPath engine - pylint searx/engines/xpath.py - fix indentation of some long lines - add logging - add doc-strings Signed-off-by: Markus Heiser <markus.heiser@darmarit.de> 2021-05-23 08:56:29 +00:00			'''XPath selector of result's ``content``.'''

[fix] pep/flake8 compatibility 2014-01-20 01:31:20 +00:00			`title_xpath = None`
[doc] add documentation about the XPath engine - pylint searx/engines/xpath.py - fix indentation of some long lines - add logging - add doc-strings Signed-off-by: Markus Heiser <markus.heiser@darmarit.de> 2021-05-23 08:56:29 +00:00			'''XPath selector of result's ``title``.'''

[fix] fixes google play engines and adds thumbnails to their results (#1612) fix google play apps, google play apps, google play music engines xpath engine: thumbnail_xpath can define an optional thumbnail 2019-07-25 05:46:41 +00:00			`thumbnail_xpath = False`
[doc] add documentation about the XPath engine - pylint searx/engines/xpath.py - fix indentation of some long lines - add logging - add doc-strings Signed-off-by: Markus Heiser <markus.heiser@darmarit.de> 2021-05-23 08:56:29 +00:00			'''XPath selector of result's ``img_src``.'''

[enh] suggestion support for xpath engine 2013-11-13 18:33:09 +00:00			`suggestion_xpath = ''`
[doc] add documentation about the XPath engine - pylint searx/engines/xpath.py - fix indentation of some long lines - add logging - add doc-strings Signed-off-by: Markus Heiser <markus.heiser@darmarit.de> 2021-05-23 08:56:29 +00:00			'''XPath selector of result's ``suggestion``.'''

[enh] Add onions category with Ahmia, Not Evil and Torch Xpath engine and results template changed to account for the fact that archive.org doesn't cache .onions, though some onion engines migth have their own cache. Disabled by default. Can be enabled by setting the SOCKS proxies to wherever Tor is listening and setting using_tor_proxy as True. Requires Tor and updating packages. To avoid manually adding the timeout on each engine, you can set extra_proxy_timeout to account for Tor's (or whatever proxy used) extra time. 2016-05-19 05:38:43 +00:00			`cached_xpath = ''`
			`cached_url = ''`
[enh] xpath engine added 2013-10-26 00:22:20 +00:00
[doc] add documentation about the XPath engine - pylint searx/engines/xpath.py - fix indentation of some long lines - add logging - add doc-strings Signed-off-by: Markus Heiser <markus.heiser@darmarit.de> 2021-05-23 08:56:29 +00:00			`paging = False`
			`'''Engine supports paging [True or False].'''`

Add paging support to XPath & Erowid engines 2016-03-28 13:15:03 +00:00			`page_size = 1`
[doc] add documentation about the XPath engine - pylint searx/engines/xpath.py - fix indentation of some long lines - add logging - add doc-strings Signed-off-by: Markus Heiser <markus.heiser@darmarit.de> 2021-05-23 08:56:29 +00:00			`'''Number of results on each page. Only needed if the site requires not a page`
			`number, but an offset.'''`
Add paging support to XPath & Erowid engines 2016-03-28 13:15:03 +00:00
[doc] add documentation about the XPath engine - pylint searx/engines/xpath.py - fix indentation of some long lines - add logging - add doc-strings Signed-off-by: Markus Heiser <markus.heiser@darmarit.de> 2021-05-23 08:56:29 +00:00			`first_page_num = 1`
			`'''Number of the first page (usually 0 or 1).'''`
[fix] pep/flake8 compatibility 2014-01-20 01:31:20 +00:00
[enh] xpath engine added 2013-10-26 00:22:20 +00:00			`def request(query, params):`
[doc] add documentation about the XPath engine - pylint searx/engines/xpath.py - fix indentation of some long lines - add logging - add doc-strings Signed-off-by: Markus Heiser <markus.heiser@darmarit.de> 2021-05-23 08:56:29 +00:00			'''Build request parameters (see :ref:`engine request`).

			`'''`
[enh] xpath engine added 2013-10-26 00:22:20 +00:00			`query = urlencode({'q': query})[2:]`
Add paging support to XPath & Erowid engines 2016-03-28 13:15:03 +00:00
[doc] add documentation about the XPath engine - pylint searx/engines/xpath.py - fix indentation of some long lines - add logging - add doc-strings Signed-off-by: Markus Heiser <markus.heiser@darmarit.de> 2021-05-23 08:56:29 +00:00			`fargs = {'query': query}`
Add paging support to XPath & Erowid engines 2016-03-28 13:15:03 +00:00			`if paging and search_url.find('{pageno}') >= 0:`
[doc] add documentation about the XPath engine - pylint searx/engines/xpath.py - fix indentation of some long lines - add logging - add doc-strings Signed-off-by: Markus Heiser <markus.heiser@darmarit.de> 2021-05-23 08:56:29 +00:00			`fargs['pageno'] = (params['pageno'] - 1) * page_size + first_page_num`
Add paging support to XPath & Erowid engines 2016-03-28 13:15:03 +00:00
[doc] add documentation about the XPath engine - pylint searx/engines/xpath.py - fix indentation of some long lines - add logging - add doc-strings Signed-off-by: Markus Heiser <markus.heiser@darmarit.de> 2021-05-23 08:56:29 +00:00			`params['url'] = search_url.format(**fargs)`
[enh] xpath engine added 2013-10-26 00:22:20 +00:00			`params['query'] = query`
[enh] xpath engine - add request parameter 'soft_max_redirects' Make 'soft_max_redirects' configurable per Xpath engine:: - name : <engine-name> engine : xpath soft_max_redirects: 1 ... Signed-off-by: Markus Heiser <markus.heiser@darmarit.de> 2021-05-17 13:04:55 +00:00			`params['soft_max_redirects'] = soft_max_redirects`
[doc] add documentation about the XPath engine - pylint searx/engines/xpath.py - fix indentation of some long lines - add logging - add doc-strings Signed-off-by: Markus Heiser <markus.heiser@darmarit.de> 2021-05-23 08:56:29 +00:00			`logger.debug("query_url --> %s", params['url'])`
Add paging support to XPath & Erowid engines 2016-03-28 13:15:03 +00:00
[enh] xpath engine added 2013-10-26 00:22:20 +00:00			`return params`

			`def response(resp):`
[doc] add documentation about the XPath engine - pylint searx/engines/xpath.py - fix indentation of some long lines - add logging - add doc-strings Signed-off-by: Markus Heiser <markus.heiser@darmarit.de> 2021-05-23 08:56:29 +00:00			'''Scrap results from the response (see :ref:`engine results`).

			`'''`
[enh] xpath engine added 2013-10-26 00:22:20 +00:00			`results = []`
			`dom = html.fromstring(resp.text)`
[doc] add documentation about the XPath engine - pylint searx/engines/xpath.py - fix indentation of some long lines - add logging - add doc-strings Signed-off-by: Markus Heiser <markus.heiser@darmarit.de> 2021-05-23 08:56:29 +00:00			`is_onion = 'onions' in categories # pylint: disable=undefined-variable`
[enh] Add onions category with Ahmia, Not Evil and Torch Xpath engine and results template changed to account for the fact that archive.org doesn't cache .onions, though some onion engines migth have their own cache. Disabled by default. Can be enabled by setting the SOCKS proxies to wherever Tor is listening and setting using_tor_proxy as True. Requires Tor and updating packages. To avoid manually adding the timeout on each engine, you can set extra_proxy_timeout to account for Tor's (or whatever proxy used) extra time. 2016-05-19 05:38:43 +00:00
[enh] xpath engine absolute xpath support 2013-10-26 11:45:43 +00:00			`if results_xpath:`
[mod] xpath, 1337x, acgsou, apkmirror, archlinux, arxiv: use eval_xpath_* functions 2020-11-26 14:49:33 +00:00			`for result in eval_xpath_list(dom, results_xpath):`
[doc] add documentation about the XPath engine - pylint searx/engines/xpath.py - fix indentation of some long lines - add logging - add doc-strings Signed-off-by: Markus Heiser <markus.heiser@darmarit.de> 2021-05-23 08:56:29 +00:00
[mod] xpath, 1337x, acgsou, apkmirror, archlinux, arxiv: use eval_xpath_* functions 2020-11-26 14:49:33 +00:00			`url = extract_url(eval_xpath_list(result, url_xpath, min_len=1), search_url)`
			`title = extract_text(eval_xpath_list(result, title_xpath, min_len=1))`
			`content = extract_text(eval_xpath_list(result, content_xpath, min_len=1))`
[fix] fixes google play engines and adds thumbnails to their results (#1612) fix google play apps, google play apps, google play music engines xpath engine: thumbnail_xpath can define an optional thumbnail 2019-07-25 05:46:41 +00:00			`tmp_result = {'url': url, 'title': title, 'content': content}`

			`# add thumbnail if available`
			`if thumbnail_xpath:`
[mod] xpath, 1337x, acgsou, apkmirror, archlinux, arxiv: use eval_xpath_* functions 2020-11-26 14:49:33 +00:00			`thumbnail_xpath_result = eval_xpath_list(result, thumbnail_xpath)`
[fix] fixes google play engines (#1651) update commit 87baa74a863ac74ae4c86bbfcb04148ba7f70696 2019-07-25 07:31:47 +00:00			`if len(thumbnail_xpath_result) > 0:`
			`tmp_result['img_src'] = extract_url(thumbnail_xpath_result, search_url)`
[fix] fixes google play engines and adds thumbnails to their results (#1612) fix google play apps, google play apps, google play music engines xpath engine: thumbnail_xpath can define an optional thumbnail 2019-07-25 05:46:41 +00:00
[enh] Add onions category with Ahmia, Not Evil and Torch Xpath engine and results template changed to account for the fact that archive.org doesn't cache .onions, though some onion engines migth have their own cache. Disabled by default. Can be enabled by setting the SOCKS proxies to wherever Tor is listening and setting using_tor_proxy as True. Requires Tor and updating packages. To avoid manually adding the timeout on each engine, you can set extra_proxy_timeout to account for Tor's (or whatever proxy used) extra time. 2016-05-19 05:38:43 +00:00			`# add alternative cached url if available`
			`if cached_xpath:`
[doc] add documentation about the XPath engine - pylint searx/engines/xpath.py - fix indentation of some long lines - add logging - add doc-strings Signed-off-by: Markus Heiser <markus.heiser@darmarit.de> 2021-05-23 08:56:29 +00:00			`tmp_result['cached_url'] = (`
			`cached_url`
[mod] xpath, 1337x, acgsou, apkmirror, archlinux, arxiv: use eval_xpath_* functions 2020-11-26 14:49:33 +00:00			`+ extract_text(eval_xpath_list(result, cached_xpath, min_len=1))`
[doc] add documentation about the XPath engine - pylint searx/engines/xpath.py - fix indentation of some long lines - add logging - add doc-strings Signed-off-by: Markus Heiser <markus.heiser@darmarit.de> 2021-05-23 08:56:29 +00:00			`)`
[enh] Add onions category with Ahmia, Not Evil and Torch Xpath engine and results template changed to account for the fact that archive.org doesn't cache .onions, though some onion engines migth have their own cache. Disabled by default. Can be enabled by setting the SOCKS proxies to wherever Tor is listening and setting using_tor_proxy as True. Requires Tor and updating packages. To avoid manually adding the timeout on each engine, you can set extra_proxy_timeout to account for Tor's (or whatever proxy used) extra time. 2016-05-19 05:38:43 +00:00
			`if is_onion:`
			`tmp_result['is_onion'] = True`

[fix] fixes google play engines and adds thumbnails to their results (#1612) fix google play apps, google play apps, google play music engines xpath engine: thumbnail_xpath can define an optional thumbnail 2019-07-25 05:46:41 +00:00			`results.append(tmp_result)`
[doc] add documentation about the XPath engine - pylint searx/engines/xpath.py - fix indentation of some long lines - add logging - add doc-strings Signed-off-by: Markus Heiser <markus.heiser@darmarit.de> 2021-05-23 08:56:29 +00:00
[enh] xpath engine absolute xpath support 2013-10-26 11:45:43 +00:00			`else:`
[enh] Add onions category with Ahmia, Not Evil and Torch Xpath engine and results template changed to account for the fact that archive.org doesn't cache .onions, though some onion engines migth have their own cache. Disabled by default. Can be enabled by setting the SOCKS proxies to wherever Tor is listening and setting using_tor_proxy as True. Requires Tor and updating packages. To avoid manually adding the timeout on each engine, you can set extra_proxy_timeout to account for Tor's (or whatever proxy used) extra time. 2016-05-19 05:38:43 +00:00			`if cached_xpath:`
			`for url, title, content, cached in zip(`
			`(extract_url(x, search_url) for`
[mod] xpath, 1337x, acgsou, apkmirror, archlinux, arxiv: use eval_xpath_* functions 2020-11-26 14:49:33 +00:00			`x in eval_xpath_list(dom, url_xpath)),`
			`map(extract_text, eval_xpath_list(dom, title_xpath)),`
			`map(extract_text, eval_xpath_list(dom, content_xpath)),`
			`map(extract_text, eval_xpath_list(dom, cached_xpath))`
[enh] Add onions category with Ahmia, Not Evil and Torch Xpath engine and results template changed to account for the fact that archive.org doesn't cache .onions, though some onion engines migth have their own cache. Disabled by default. Can be enabled by setting the SOCKS proxies to wherever Tor is listening and setting using_tor_proxy as True. Requires Tor and updating packages. To avoid manually adding the timeout on each engine, you can set extra_proxy_timeout to account for Tor's (or whatever proxy used) extra time. 2016-05-19 05:38:43 +00:00			`):`
[doc] add documentation about the XPath engine - pylint searx/engines/xpath.py - fix indentation of some long lines - add logging - add doc-strings Signed-off-by: Markus Heiser <markus.heiser@darmarit.de> 2021-05-23 08:56:29 +00:00			`results.append({`
			`'url': url,`
			`'title': title,`
			`'content': content,`
			`'cached_url': cached_url + cached, 'is_onion': is_onion`
			`})`
[enh] Add onions category with Ahmia, Not Evil and Torch Xpath engine and results template changed to account for the fact that archive.org doesn't cache .onions, though some onion engines migth have their own cache. Disabled by default. Can be enabled by setting the SOCKS proxies to wherever Tor is listening and setting using_tor_proxy as True. Requires Tor and updating packages. To avoid manually adding the timeout on each engine, you can set extra_proxy_timeout to account for Tor's (or whatever proxy used) extra time. 2016-05-19 05:38:43 +00:00			`else:`
			`for url, title, content in zip(`
			`(extract_url(x, search_url) for`
[mod] xpath, 1337x, acgsou, apkmirror, archlinux, arxiv: use eval_xpath_* functions 2020-11-26 14:49:33 +00:00			`x in eval_xpath_list(dom, url_xpath)),`
			`map(extract_text, eval_xpath_list(dom, title_xpath)),`
			`map(extract_text, eval_xpath_list(dom, content_xpath))`
[enh] Add onions category with Ahmia, Not Evil and Torch Xpath engine and results template changed to account for the fact that archive.org doesn't cache .onions, though some onion engines migth have their own cache. Disabled by default. Can be enabled by setting the SOCKS proxies to wherever Tor is listening and setting using_tor_proxy as True. Requires Tor and updating packages. To avoid manually adding the timeout on each engine, you can set extra_proxy_timeout to account for Tor's (or whatever proxy used) extra time. 2016-05-19 05:38:43 +00:00			`):`
[doc] add documentation about the XPath engine - pylint searx/engines/xpath.py - fix indentation of some long lines - add logging - add doc-strings Signed-off-by: Markus Heiser <markus.heiser@darmarit.de> 2021-05-23 08:56:29 +00:00			`results.append({`
			`'url': url,`
			`'title': title,`
			`'content': content,`
			`'is_onion': is_onion`
			`})`

			`if suggestion_xpath:`
			`for suggestion in eval_xpath(dom, suggestion_xpath):`
			`results.append({'suggestion': extract_text(suggestion)})`

			`logger.debug("found %s results", len(results))`
[enh] xpath engine added 2013-10-26 00:22:20 +00:00			`return results`