bing: extract original url instead of url tracked by bing

Bing returns URLs like
```
https://www.bing.com/ck/a?!&&p=7b6f95ee4bc34febe56210eec479fa7a84a991257e9773fda5e753ff482f9068JmltdHM9MTY1MzE1MDgzNSZpZ3VpZD0yYTZkNWQ4Yi05MDcwLTRkOGEtYWRmNi1jNWI2M2Y1NjJlOGQmaW5zaWQ9NTE1NA&ptn=3&fclid=ce60cbc5-d923-11ec-b22d-0e153102d4e8&u=a1aHR0cHM6Ly9kb2NzLnNlYXJ4bmcub3JnLw&ntb=1
```
for tracking clicks. Looking into HTML source I found bing stores original URLs
in "cite" element. Lets use it instead of.
This commit is contained in:
Denis Shaposhnikov 2022-05-21 18:33:44 +02:00
parent 61535a4c20
commit 1ceacf5fe9

View file

@ -96,7 +96,10 @@ def response(resp):
for result in eval_xpath(dom, '//li[@class="b_algo"]'): for result in eval_xpath(dom, '//li[@class="b_algo"]'):
link = eval_xpath(result, './/h2/a')[0] link = eval_xpath(result, './/h2/a')[0]
url = link.attrib.get('href') # url = link.attrib.get('href')
# href attr is encoded by bing and directs back to bing for tracking
# instead of original. Lets extract original URL.
url = extract_text(eval_xpath(result, './/div[@class="b_attribution"]/cite'))
title = extract_text(link) title = extract_text(link)
content = extract_text(eval_xpath(result, './/p')) content = extract_text(eval_xpath(result, './/p'))