Bot Detection

X-Forwarded-For

Attention

A correct setup of the HTTP request headers X-Forwarded-For and X-Real-IP is essential to be able to assign a request to an IP correctly:

searx.botdetection.get_real_ip(request: Request) str[source]

Returns real IP of the request. Since not all proxies set all the HTTP headers and incoming headers can be faked it may happen that the IP cannot be determined correctly.

This function tries to get the remote IP in the order listed below, additional some tests are done and if inconsistencies or errors are detected, they are logged.

The remote IP of the request is taken from (first match):

Limiter

Bot protection / IP rate limitation. The intention of rate limitation is to limit suspicious requests from an IP. The motivation behind this is the fact that SearXNG passes through requests from bots and is thus classified as a bot itself. As a result, the SearXNG engine then receives a CAPTCHA or is blocked by the search engine (the origin) in some other way.

To avoid blocking, the requests from bots to SearXNG must also be blocked, this is the task of the limiter. To perform this task, the limiter uses the methods from the searx.botdetection.

To enable the limiter activate:

server:
  ...
  limiter: true  # rate limit the number of request on the instance, block some bots

and set the redis-url connection. Check the value, it depends on your redis DB (see redis:), by example:

redis:
  url: unix:///usr/local/searxng-redis/run/redis.sock?db=0
searx.botdetection.limiter.LIMITER_CFG = PosixPath('/etc/searxng/limiter.toml')

Lokal Limiter configuration.

searx.botdetection.limiter.LIMITER_CFG_SCHEMA = PosixPath('/home/runner/work/searxng/searxng/searx/botdetection/limiter.toml')

Base configuration (schema) of the botdetection.

Method ip_lists

The ip_lists method implements IP block- and pass-lists.

[botdetection.ip_lists]

pass_ip = [
 '140.238.172.132', # IPv4 of check.searx.space
 '192.168.0.0/16',  # IPv4 private network
 'fe80::/10'        # IPv6 linklocal
]
block_ip = [
   '93.184.216.34', # IPv4 of example.org
   '257.1.1.1',     # invalid IP --> will be ignored, logged in ERROR class
]
searx.botdetection.ip_lists.block_ip(real_ip: IPv4Address | IPv6Address, cfg: config.Config) Tuple[bool, str][source]

Checks if the IP on the subnet is in one of the members of the botdetection.ip_lists.block_ip list.

searx.botdetection.ip_lists.pass_ip(real_ip: IPv4Address | IPv6Address, cfg: config.Config) Tuple[bool, str][source]

Checks if the IP on the subnet is in one of the members of the botdetection.ip_lists.pass_ip list.

searx.botdetection.ip_lists.SEARXNG_ORG = ['140.238.172.132', '2603:c022:0:4900::/56']

Passlist of IPs from the SearXNG organization, e.g. check.searx.space.

Rate limit

Method ip_limit

The ip_limit method counts request from an IP in sliding windows. If there are to many requests in a sliding window, the request is evaluated as a bot request. This method requires a redis DB and needs a HTTP X-Forwarded-For header. To take privacy only the hash value of an IP is stored in the redis DB and at least for a maximum of 10 minutes.

The link_token method can be used to investigate whether a request is suspicious. To activate the link_token method in the ip_limit method add the following to your /etc/searxng/limiter.toml:

[botdetection.ip_limit]
link_token = true

If the link_token method is activated and a request is suspicious the request rates are reduced:

To intercept bots that get their IPs from a range of IPs, there is a SUSPICIOUS_IP_WINDOW. In this window the suspicious IPs are stored for a longer time. IPs stored in this sliding window have a maximum of SUSPICIOUS_IP_MAX accesses before they are blocked. As soon as the IP makes a request that is not suspicious, the sliding window for this IP is droped.

searx.botdetection.ip_limit.API_MAX = 4

Maximum requests from one IP in the API_WONDOW

searx.botdetection.ip_limit.API_WONDOW = 3600

Time (sec) before sliding window for API requests (format != html) expires.

searx.botdetection.ip_limit.BURST_MAX = 15

Maximum requests from one IP in the BURST_WINDOW

searx.botdetection.ip_limit.BURST_MAX_SUSPICIOUS = 2

Maximum of suspicious requests from one IP in the BURST_WINDOW

searx.botdetection.ip_limit.BURST_WINDOW = 20

Time (sec) before sliding window for burst requests expires.

searx.botdetection.ip_limit.LONG_MAX = 150

Maximum requests from one IP in the LONG_WINDOW

searx.botdetection.ip_limit.LONG_MAX_SUSPICIOUS = 10

Maximum suspicious requests from one IP in the LONG_WINDOW

searx.botdetection.ip_limit.LONG_WINDOW = 600

Time (sec) before the longer sliding window expires.

searx.botdetection.ip_limit.SUSPICIOUS_IP_MAX = 3

Maximum requests from one suspicious IP in the SUSPICIOUS_IP_WINDOW.

searx.botdetection.ip_limit.SUSPICIOUS_IP_WINDOW = 2592000

Time (sec) before sliding window for one suspicious IP expires.

Generates a hashed key that fits (more or less) to a WEB-browser session in a network.

Returns current token. If there is no currently active token a new token is generated randomly and stored in the redis DB.

Checks whether a valid ping is exists for this (client) network, if not this request is rated as suspicious. If a valid ping exists and argument renew is True the expire time of this ping is reset to PING_LIVE_TIME.

This function is called by a request to URL /client<token>.css. If token is valid a PING_KEY for the client is stored in the DB. The expire time of this ping-key is PING_LIVE_TIME.

Prefix of all ping-keys generated by get_ping_key

Livetime (sec) of the ping-key from a client (request)

Key for which the current token is stored in the DB

Livetime (sec) of limiter’s CSS token.

Probe HTTP headers

Method http_accept

The http_accept method evaluates a request as the request of a bot if the Accept header ..

  • did not contain text/html

Method http_accept_encoding

The http_accept_encoding method evaluates a request as the request of a bot if the Accept-Encoding header ..

  • did not contain gzip AND deflate (if both values are missed)

  • did not contain text/html

Method http_accept_language

The http_accept_language method evaluates a request as the request of a bot if the Accept-Language header is unset.

Method http_connection

The http_connection method evaluates a request as the request of a bot if the Connection header is set to close.

Method http_user_agent

The http_user_agent method evaluates a request as the request of a bot if the User-Agent header is unset or matches the regular expression USER_AGENT.

searx.botdetection.http_user_agent.USER_AGENT = '(unknown|[Cc][Uu][Rr][Ll]|[wW]get|Scrapy|splash|JavaFX|FeedFetcher|python-requests|Go-http-client|Java|Jakarta|okhttp|HttpClient|Jersey|Python|libwww-perl|Ruby|SynHttpClient|UniversalFeedParser|Googlebot|GoogleImageProxy|bingbot|Baiduspider|yacybot|YandexMobileBot|YandexBot|Yahoo! Slurp|MJ12bot|AhrefsBot|archive.org_bot|msnbot|MJ12bot|SeznamBot|linkdexbot|Netvibes|SMTBot|zgrab|James BOT|Sogou|Abonti|Pixray|Spinn3r|SemrushBot|Exabot|ZmEu|BLEXBot|bitlybot|Mozilla/5\\.0\\ \\(compatible;\\ Farside/0\\.1\\.0;\\ \\+https://farside\\.link\\)|.*PetalBot.*)'

Regular expression that matches to User-Agent from known bots