Utility functions for the engines¶
Utility functions for the engines
- searx.utils.convert_str_to_int(number_str: str) int [source]¶
Convert number_str to int or 0 if number_str is not a number.
- searx.utils.detect_language(text: str, threshold: float = 0.3, only_search_languages: bool = False) str | None [source]¶
Detect the language of the
text
parameter.- Parameters:
text (str) – The string whose language is to be detected.
threshold (float) – Threshold filters the returned labels by a threshold on probability. A choice of 0.3 will return labels with at least 0.3 probability.
only_search_languages (bool) – If
True
, returns only supported SearXNG search languages. seesearx.languages
- Return type:
str, None
- Returns:
The detected language code or
None
. See below.- Raises:
ValueError – If
text
is not a string.
The language detection is done by using a fork of the fastText library (python fasttext). fastText distributes the language identification model, for reference:
The language identification model support the language codes (ISO-639-3):
af als am an ar arz as ast av az azb ba bar bcl be bg bh bn bo bpy br bs bxr ca cbk ce ceb ckb co cs cv cy da de diq dsb dty dv el eml en eo es et eu fa fi fr frr fy ga gd gl gn gom gu gv he hi hif hr hsb ht hu hy ia id ie ilo io is it ja jbo jv ka kk km kn ko krc ku kv kw ky la lb lez li lmo lo lrc lt lv mai mg mhr min mk ml mn mr mrj ms mt mwl my myv mzn nah nap nds ne new nl nn no oc or os pa pam pfl pl pms pnb ps pt qu rm ro ru rue sa sah sc scn sco sd sh si sk sl so sq sr su sv sw ta te tg th tk tl tr tt tyv ug uk ur uz vec vep vi vls vo wa war wuu xal xmf yi yo yue zh
By using
only_search_languages=True
the language identification model is harmonized with the SearXNG’s language (locale) model. General conditions of SearXNG’s locale model are:SearXNG’s locale of a query is passed to the
searx.locales.get_engine_locale
to get a language and/or region code that is used by an engine.Most of SearXNG’s engines do not support all the languages from language identification model and there is also a discrepancy in the ISO-639-3 (fasttext) and ISO-639-2 (SearXNG)handling. Further more, in SearXNG the locales like
zh-TH
(zh-CN
) are mapped tozh_Hant
(zh_Hans
) while the language identification model reduce both tozh
.
- searx.utils.dict_subset(dictionary: MutableMapping, properties: Set[str]) Dict [source]¶
Extract a subset of a dict
- Examples:
>>> dict_subset({'A': 'a', 'B': 'b', 'C': 'c'}, ['A', 'C']) {'A': 'a', 'C': 'c'} >>> >> dict_subset({'A': 'a', 'B': 'b', 'C': 'c'}, ['A', 'D']) {'A': 'a'}
- searx.utils.ecma_unescape(string: str) str [source]¶
Python implementation of the unescape javascript function
https://www.ecma-international.org/ecma-262/6.0/#sec-unescape-string https://developer.mozilla.org/fr/docs/Web/JavaScript/Reference/Objets_globaux/unescape
- Examples:
>>> ecma_unescape('%u5409') '吉' >>> ecma_unescape('%20') ' ' >>> ecma_unescape('%F3') 'ó'
- searx.utils.eval_xpath(element: ElementBase, xpath_spec: str | XPath)[source]¶
Equivalent of element.xpath(xpath_str) but compile xpath_str once for all. See https://lxml.de/xpathxslt.html#xpath-return-values
- Args:
element (ElementBase): [description]
xpath_spec (str|lxml.etree.XPath): XPath as a str or lxml.etree.XPath
- Returns:
result (bool, float, list, str): Results.
- Raises:
TypeError: Raise when xpath_spec is neither a str nor a lxml.etree.XPath
SearxXPathSyntaxException: Raise when there is a syntax error in the XPath
SearxEngineXPathException: Raise when the XPath can’t be evaluated.
- searx.utils.eval_xpath_getindex(elements: ~lxml.etree.ElementBase, xpath_spec: str | ~lxml.etree.XPath, index: int, default=<searx.utils._NotSetClass object>)[source]¶
Call eval_xpath_list then get one element using the index parameter. If the index does not exist, either raise an exception is default is not set, other return the default value (can be None).
- Args:
elements (ElementBase): lxml element to apply the xpath.
xpath_spec (str|lxml.etree.XPath): XPath as a str or lxml.etree.XPath.
index (int): index to get
default (Object, optional): Defaults if index doesn’t exist.
- Raises:
TypeError: Raise when xpath_spec is neither a str nor a lxml.etree.XPath
SearxXPathSyntaxException: Raise when there is a syntax error in the XPath
SearxEngineXPathException: if the index is not found. Also see eval_xpath.
- Returns:
result (bool, float, list, str): Results.
- searx.utils.eval_xpath_list(element: ElementBase, xpath_spec: str | XPath, min_len: int | None = None)[source]¶
Same as eval_xpath, check if the result is a list
- Args:
element (ElementBase): [description]
xpath_spec (str|lxml.etree.XPath): XPath as a str or lxml.etree.XPath
min_len (int, optional): [description]. Defaults to None.
- Raises:
TypeError: Raise when xpath_spec is neither a str nor a lxml.etree.XPath
SearxXPathSyntaxException: Raise when there is a syntax error in the XPath
SearxEngineXPathException: raise if the result is not a list
- Returns:
result (bool, float, list, str): Results.
- searx.utils.extract_text(xpath_results, allow_none: bool = False) str | None [source]¶
Extract text from a lxml result
if xpath_results is list, extract the text from each result and concat the list
if xpath_results is a xml element, extract all the text node from it ( text_content() method from lxml )
if xpath_results is a string element, then it’s already done
- searx.utils.extract_url(xpath_results, base_url) str [source]¶
Extract and normalize URL from lxml Element
- Args:
xpath_results (Union[List[html.HtmlElement], html.HtmlElement]): lxml Element(s)
base_url (str): Base URL
- Example:
>>> def f(s, search_url): >>> return searx.utils.extract_url(html.fromstring(s), search_url) >>> f('<span id="42">https://example.com</span>', 'http://example.com/') 'https://example.com/' >>> f('https://example.com', 'http://example.com/') 'https://example.com/' >>> f('//example.com', 'http://example.com/') 'http://example.com/' >>> f('//example.com', 'https://example.com/') 'https://example.com/' >>> f('/path?a=1', 'https://example.com') 'https://example.com/path?a=1' >>> f('', 'https://example.com') raise lxml.etree.ParserError >>> searx.utils.extract_url([], 'https://example.com') raise ValueError
- Raises:
ValueError
lxml.etree.ParserError
- Returns:
str: normalized URL
- searx.utils.gen_useragent(os_string: str | None = None) str [source]¶
Return a random browser User Agent
See searx/data/useragents.json
- searx.utils.get_engine_from_settings(name: str) Dict [source]¶
Return engine configuration from settings.yml of a given engine name
- searx.utils.get_torrent_size(filesize: str, filesize_multiplier: str) int | None [source]¶
- Args:
filesize (str): size
filesize_multiplier (str): TB, GB, …. TiB, GiB…
- Returns:
int: number of bytes
- Example:
>>> get_torrent_size('5', 'GB') 5368709120 >>> get_torrent_size('3.14', 'MiB') 3140000
- searx.utils.get_xpath(xpath_spec: str | XPath) XPath [source]¶
Return cached compiled XPath
There is no thread lock. Worst case scenario, xpath_str is compiled more than one time.
- Args:
xpath_spec (str|lxml.etree.XPath): XPath as a str or lxml.etree.XPath
- Returns:
result (bool, float, list, str): Results.
- Raises:
TypeError: Raise when xpath_spec is neither a str nor a lxml.etree.XPath
SearxXPathSyntaxException: Raise when there is a syntax error in the XPath
- searx.utils.html_to_text(html_str: str) str [source]¶
Extract text from a HTML string
- Args:
html_str (str): string HTML
- Returns:
str: extracted text
- Examples:
>>> html_to_text('Example <span id="42">#2</span>') 'Example #2'
>>> html_to_text('<style>.span { color: red; }</style><span>Example</span>') 'Example'
>>> html_to_text(r'regexp: (?<![a-zA-Z]') 'regexp: (?<![a-zA-Z]'
- searx.utils.int_or_zero(num: List[str] | str) int [source]¶
Convert num to int or 0. num can be either a str or a list. If num is a list, the first element is converted to int (or return 0 if the list is empty). If num is a str, see convert_str_to_int
- searx.utils.is_valid_lang(lang) Tuple[bool, str, str] | None [source]¶
Return language code and name if lang describe a language.
- Examples:
>>> is_valid_lang('zz') None >>> is_valid_lang('uk') (True, 'uk', 'ukrainian') >>> is_valid_lang(b'uk') (True, 'uk', 'ukrainian') >>> is_valid_lang('en') (True, 'en', 'english') >>> searx.utils.is_valid_lang('Español') (True, 'es', 'spanish') >>> searx.utils.is_valid_lang('Spanish') (True, 'es', 'spanish')
- searx.utils.js_variable_to_python(js_variable)[source]¶
Convert a javascript variable into JSON and then load the value
It does not deal with all cases, but it is good enough for now. chompjs has a better implementation.
- searx.utils.markdown_to_text(markdown_str: str) str [source]¶
Extract text from a Markdown string
- Args:
markdown_str (str): string Markdown
- Returns:
str: extracted text
- Examples:
>>> markdown_to_text('[example](https://example.com)') 'example'
>>> markdown_to_text('## Headline') 'Headline'
- searx.utils.normalize_url(url: str, base_url: str) str [source]¶
Normalize URL: add protocol, join URL with base_url, add trailing slash if there is no path
- Args:
url (str): Relative URL
base_url (str): Base URL, it must be an absolute URL.
- Example:
>>> normalize_url('https://example.com', 'http://example.com/') 'https://example.com/' >>> normalize_url('//example.com', 'http://example.com/') 'http://example.com/' >>> normalize_url('//example.com', 'https://example.com/') 'https://example.com/' >>> normalize_url('/path?a=1', 'https://example.com') 'https://example.com/path?a=1' >>> normalize_url('', 'https://example.com') 'https://example.com/' >>> normalize_url('/test', '/path') raise ValueError
- Raises:
lxml.etree.ParserError
- Returns:
str: normalized URL
- searx.utils.SEARCH_LANGUAGE_CODES = frozenset({'af', 'ar', 'be', 'bg', 'ca', 'cs', 'da', 'de', 'el', 'en', 'es', 'et', 'fa', 'fi', 'fr', 'he', 'hi', 'hr', 'hu', 'id', 'is', 'it', 'ja', 'ko', 'lt', 'lv', 'nb', 'nl', 'pl', 'pt', 'ro', 'ru', 'sk', 'sl', 'sr', 'sv', 'th', 'tr', 'uk', 'vi', 'zh'})¶
Languages supported by most searxng engines (
searx.sxng_locales.sxng_locales
).