* [enh] engines: rework bing engine
Only Bing-Web has been reworked.
Some features now require JavaScript (paging and time-range results).
Cookies no longer work, parameters such as `cc`, `ui`, ... alter the results.
The engine only appears to use the locale from `Accept-Language` header properly.
The rest of Bing's child engines (Bing-Image, Bing-Video, ...) seem to benefit
from using `mkt` param in conjunction with the `Accept-Language` header
override, although Bing-Web does not (?)
* [enh] explicit mkt
* [fix] engines: bing_videos.py
https://github.com/searxng/searxng/pull/5793#pullrequestreview-3881883250
Remove |safe filter from 6 template locations where data from external
search engine APIs was rendered as raw HTML without sanitization. Jinja2
autoescape now properly escapes these fields.
The |safe filter was originally added in commit 213041adc (March 2021)
by copying the pattern from result.title|safe and result.content|safe.
However, title and content are pre-escaped via escape() in webapp.py
lines 704-706 before highlight_content() adds trusted <span> tags for
search term highlighting. The metadata, info.value, link.url_label,
repository, and filename fields never go through any escaping and flow
directly from external API responses to the template.
Affected templates and their untrusted data sources:
- macros.html: result.metadata from DuckDuckGo, Reuters, Presearch,
Podcast Index, Fyyd, bpb, moviepilot, mediawiki, and others
- paper.html: result.metadata from academic search engines
- map.html: info.value and link.url_label from OpenStreetMap
user-contributed extratags
- code.html: result.repository and result.filename from GitHub API
Example exploit: a search engine API returning
metadata='<img src=x onerror=alert(document.cookie)>' would execute
arbitrary JavaScript in every user's browser viewing that result.
Since about a month, the website just says "temporarily unavailable", so it's safe to assume that it's just no longer working
Related:
- https://github.com/searxng/searxng/pull/3798
Google recently changed the DOM structure for mobile-centric responses, causing the `google_videos` engine to return zero results and the main `google` engine to drop the majority of its results (due to missing snippets or failed URL parsing). These changes restore the functionality and improve the result count for both engines.
This patch updates the parsing logic for both the `google` and `google_videos` engines to handle the modern HTML structure returned by Google when using GSA (Google Search App) User-Agents.
**Specific changes include:**
* **Google Videos (`gov`)**:
* Updated title XPath to support `role="heading"`.
* Improved URL extraction to correctly decode Google redirectors (`/url?q=...`) using `unquote`.
* Added support for the `WRu9Cd` class to capture publication metadata (author/date).
* Broadened thumbnail search and added a fallback to YouTube's `hqdefault.jpg`.
* **Google Web**:
* Relaxed the strict snippet (`content`) requirement. Valid results are no longer discarded if a snippet is missing in the mobile UI.
* Hardened URL extraction to handle both direct and redirected URLs safely.
* Improved thumbnail extraction by searching the entire result block.
Removes the `fasttext-predict` dependency and the language detection code.
If a user now selects `auto` for the search language, the detected language now
falls back directly to the `Accept-Language` header sent by the browser (which was already the fallback when fasttext returned no result).
- fasttext's [language detection is unreliable](https://github.com/searxng/searxng/issues/4195) for some languages, especially short search queries, and in particular for queries containing proper names which is a common case.
- `fasttext-predict` consumes [significant memory](https://github.com/searxng/searxng/pull/1969#issuecomment-1345366676) without offering users much real value.
- the upstream fasttext project was archived by Meta in 2024
- users already have two better alternatives: the `Accept-Language` header and the search-syntax language prefix (e.g. `:fr` or `:de`).
Related: https://github.com/searxng/searxng/issues/4195
Closes: https://github.com/searxng/searxng/issues/5790
The online engines emulate a request as it would come from a web browser, which
is why the HTTP headers in the default settings should also be set the way a
standard web browser would set them.
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
The default settings for the suspend times were previously 24 hours and 3 hours,
respectively. Based on my experience, these defaults are too high; most engines
handle suspend times of 3 minutes or 1 hour (captcha) without any problems.
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
Unit test fails::
Traceback (most recent call last):
File "/share/searxng/local/py3/lib/python3.10/site-packages/parameterized/parameterized.py", line 620, in standalone_func
return func(*(a + p.args), **p.kwargs, **kw)
File "/share/searxng/tests/unit/test_locales.py", line 121, in test_locale_optimized_territory
self.assertEqual(locales.match_locale(locale, locale_list), expected_locale)
AssertionError: 'fr-CH' != 'fr-BE'
- fr-CH
+ fr-BE
With the `babel` update from 2.17.0 to 2.18.0 the population DB has been
updated (the test was implemented for the old values).
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>