Bot Detection¶
X-Forwarded-For¶
Attention
A correct setup of the HTTP request headers X-Forwarded-For
and
X-Real-IP
is essential to be able to assign a request to an IP correctly:
- searx.botdetection.get_real_ip(request: Request) str [source]¶
Returns real IP of the request. Since not all proxies set all the HTTP headers and incoming headers can be faked it may happen that the IP cannot be determined correctly.
This function tries to get the remote IP in the order listed below, additional some tests are done and if inconsistencies or errors are detected, they are logged.
The remote IP of the request is taken from (first match):
Limiter¶
Bot protection / IP rate limitation. The intention of rate limitation is to limit suspicious requests from an IP. The motivation behind this is the fact that SearXNG passes through requests from bots and is thus classified as a bot itself. As a result, the SearXNG engine then receives a CAPTCHA or is blocked by the search engine (the origin) in some other way.
To avoid blocking, the requests from bots to SearXNG must also be blocked, this
is the task of the limiter. To perform this task, the limiter uses the methods
from the searx.botdetection
.
To enable the limiter activate:
server:
...
limiter: true # rate limit the number of request on the instance, block some bots
and set the redis-url connection. Check the value, it depends on your redis DB (see redis:), by example:
redis:
url: unix:///usr/local/searxng-redis/run/redis.sock?db=0
- searx.botdetection.limiter.LIMITER_CFG = PosixPath('/etc/searxng/limiter.toml')¶
Lokal Limiter configuration.
- searx.botdetection.limiter.LIMITER_CFG_SCHEMA = PosixPath('/home/runner/work/searxng/searxng/searx/botdetection/limiter.toml')¶
Base configuration (schema) of the botdetection.
Method ip_lists
¶
The ip_lists
method implements IP block-
and
pass-lists
.
[botdetection.ip_lists]
pass_ip = [
'140.238.172.132', # IPv4 of check.searx.space
'192.168.0.0/16', # IPv4 private network
'fe80::/10' # IPv6 linklocal
]
block_ip = [
'93.184.216.34', # IPv4 of example.org
'257.1.1.1', # invalid IP --> will be ignored, logged in ERROR class
]
- searx.botdetection.ip_lists.block_ip(real_ip: IPv4Address | IPv6Address, cfg: config.Config) Tuple[bool, str] [source]¶
Checks if the IP on the subnet is in one of the members of the
botdetection.ip_lists.block_ip
list.
- searx.botdetection.ip_lists.pass_ip(real_ip: IPv4Address | IPv6Address, cfg: config.Config) Tuple[bool, str] [source]¶
Checks if the IP on the subnet is in one of the members of the
botdetection.ip_lists.pass_ip
list.
- searx.botdetection.ip_lists.SEARXNG_ORG = ['140.238.172.132', '2603:c022:0:4900::/56']¶
Passlist of IPs from the SearXNG organization, e.g. check.searx.space.
Rate limit¶
Method ip_limit
¶
The ip_limit
method counts request from an IP in sliding windows. If
there are to many requests in a sliding window, the request is evaluated as a
bot request. This method requires a redis DB and needs a HTTP X-Forwarded-For
header. To take privacy only the hash value of an IP is stored in the redis DB
and at least for a maximum of 10 minutes.
The link_token
method can be used to investigate whether a request is
suspicious. To activate the link_token
method in the
ip_limit
method add the following to your
/etc/searxng/limiter.toml
:
[botdetection.ip_limit]
link_token = true
If the link_token
method is activated and a request is suspicious
the request rates are reduced:
To intercept bots that get their IPs from a range of IPs, there is a
SUSPICIOUS_IP_WINDOW
. In this window the suspicious IPs are stored
for a longer time. IPs stored in this sliding window have a maximum of
SUSPICIOUS_IP_MAX
accesses before they are blocked. As soon as the IP
makes a request that is not suspicious, the sliding window for this IP is
droped.
- searx.botdetection.ip_limit.API_MAX = 4¶
Maximum requests from one IP in the
API_WONDOW
- searx.botdetection.ip_limit.API_WONDOW = 3600¶
Time (sec) before sliding window for API requests (format != html) expires.
- searx.botdetection.ip_limit.BURST_MAX = 15¶
Maximum requests from one IP in the
BURST_WINDOW
- searx.botdetection.ip_limit.BURST_MAX_SUSPICIOUS = 2¶
Maximum of suspicious requests from one IP in the
BURST_WINDOW
- searx.botdetection.ip_limit.BURST_WINDOW = 20¶
Time (sec) before sliding window for burst requests expires.
- searx.botdetection.ip_limit.LONG_MAX = 150¶
Maximum requests from one IP in the
LONG_WINDOW
- searx.botdetection.ip_limit.LONG_MAX_SUSPICIOUS = 10¶
Maximum suspicious requests from one IP in the
LONG_WINDOW
- searx.botdetection.ip_limit.LONG_WINDOW = 600¶
Time (sec) before the longer sliding window expires.
- searx.botdetection.ip_limit.SUSPICIOUS_IP_MAX = 3¶
Maximum requests from one suspicious IP in the
SUSPICIOUS_IP_WINDOW
.
- searx.botdetection.ip_limit.SUSPICIOUS_IP_WINDOW = 2592000¶
Time (sec) before sliding window for one suspicious IP expires.
Method link_token
¶
The link_token
method evaluates a request as suspicious
if the URL /client<token>.css
is not requested by the
client. By adding a random component (the token) in the URL, a bot can not send
a ping by request a static URL.
Note
This method requires a redis DB and needs a HTTP X-Forwarded-For header.
To get in use of this method a flask URL route needs to be added:
@app.route('/client<token>.css', methods=['GET', 'POST'])
def client_token(token=None):
link_token.ping(request, token)
return Response('', mimetype='text/css')
And in the HTML template from flask a stylesheet link is needed (the value of
link_token
comes from get_token
):
<link rel="stylesheet"
href="{{ url_for('client_token', token=link_token) }}"
type="text/css" />
- searx.botdetection.link_token.get_ping_key(network: IPv4Network | IPv6Network, request: flask.Request) str [source]¶
Generates a hashed key that fits (more or less) to a WEB-browser session in a network.
- searx.botdetection.link_token.get_token() str [source]¶
Returns current token. If there is no currently active token a new token is generated randomly and stored in the redis DB.
- searx.botdetection.link_token.is_suspicious(network: IPv4Network | IPv6Network, request: flask.Request, renew: bool = False)[source]¶
Checks whether a valid ping is exists for this (client) network, if not this request is rated as suspicious. If a valid ping exists and argument
renew
isTrue
the expire time of this ping is reset toPING_LIVE_TIME
.
- searx.botdetection.link_token.ping(request: Request, token: str)[source]¶
This function is called by a request to URL
/client<token>.css
. Iftoken
is valid aPING_KEY
for the client is stored in the DB. The expire time of this ping-key isPING_LIVE_TIME
.
- searx.botdetection.link_token.PING_KEY = 'SearXNG_limiter.ping'¶
Prefix of all ping-keys generated by
get_ping_key
- searx.botdetection.link_token.PING_LIVE_TIME = 3600¶
Livetime (sec) of the ping-key from a client (request)
- searx.botdetection.link_token.TOKEN_KEY = 'SearXNG_limiter.token'¶
Key for which the current token is stored in the DB
- searx.botdetection.link_token.TOKEN_LIVE_TIME = 600¶
Livetime (sec) of limiter’s CSS token.
Probe HTTP headers¶
Method http_accept
¶
The http_accept
method evaluates a request as the request of a bot if the
Accept header ..
did not contain
text/html
Method http_accept_encoding
¶
The http_accept_encoding
method evaluates a request as the request of a
bot if the Accept-Encoding header ..
did not contain
gzip
ANDdeflate
(if both values are missed)did not contain
text/html
Method http_accept_language
¶
The http_accept_language
method evaluates a request as the request of a bot
if the Accept-Language header is unset.
Method http_connection
¶
The http_connection
method evaluates a request as the request of a bot if
the Connection header is set to close
.
Method http_user_agent
¶
The http_user_agent
method evaluates a request as the request of a bot if
the User-Agent header is unset or matches the regular expression
USER_AGENT
.
- searx.botdetection.http_user_agent.USER_AGENT = '(unknown|[Cc][Uu][Rr][Ll]|[wW]get|Scrapy|splash|JavaFX|FeedFetcher|python-requests|Go-http-client|Java|Jakarta|okhttp|HttpClient|Jersey|Python|libwww-perl|Ruby|SynHttpClient|UniversalFeedParser|Googlebot|GoogleImageProxy|bingbot|Baiduspider|yacybot|YandexMobileBot|YandexBot|Yahoo! Slurp|MJ12bot|AhrefsBot|archive.org_bot|msnbot|MJ12bot|SeznamBot|linkdexbot|Netvibes|SMTBot|zgrab|James BOT|Sogou|Abonti|Pixray|Spinn3r|SemrushBot|Exabot|ZmEu|BLEXBot|bitlybot|Mozilla/5\\.0\\ \\(compatible;\\ Farside/0\\.1\\.0;\\ \\+https://farside\\.link\\)|.*PetalBot.*)'¶
Regular expression that matches to User-Agent from known bots