40 Commits

Author SHA1 Message Date
Mike Fährmann
e006d26c8e Revert "use f-strings when building 'pattern'"
revert d7c97d5a97.
2025-12-20 22:07:37 +01:00
Mike Fährmann
968597a302 yield 3-tuples for Message.Directory
adapt tuples to the same length and semantics as other messages
2025-12-05 21:39:52 +01:00
Mike Fährmann
d7c97d5a97 use f-strings when building 'pattern' 2025-10-20 21:23:11 +02:00
Mike Fährmann
9bf76c1352 replace 'util.re()' with 'text.re()'
remove unnecessary 'util' imports
2025-10-20 17:44:58 +02:00
Mike Fährmann
6c71b279b6 [dt] update 'parse_datetime' calls with one argument 2025-10-17 22:49:41 +02:00
Mike Fährmann
085616e0a8 [dt] replace 'text.parse_datetime()' & 'text.parse_timestamp()' 2025-10-17 17:43:06 +02:00
Mike Fährmann
e5c91d33ec [blogger] fix video extraction (#7892) 2025-07-30 16:45:29 +02:00
Mike Fährmann
b0580aba86 update 'match.lastindex' usage 2025-06-18 20:24:13 +02:00
Mike Fährmann
41191bb60a 'match.group(N)' -> 'match[N]' (#7671)
2.5x faster
2025-06-18 13:05:58 +02:00
Mike Fährmann
e08ec7e083 update copyright notices 2025-06-13 00:03:41 +02:00
Mike Fährmann
56ea27c474 [blogger] move original/s0 URL code into a separate function 2025-06-12 17:07:56 +02:00
Mike Fährmann
b5c88b3d3e replace standard library 're' uses with 'util.re()' 2025-06-06 13:24:52 +02:00
Mike Fährmann
88ba85d285 [blogger] use default API key when 'api-key' is empty
… and not just when 'api-key' is not set.
2024-11-20 16:02:16 +01:00
Mike Fährmann
3194bcbccc [blogger] remove 'micmicidol.club' 2024-10-10 14:23:58 +02:00
Wiiplay123
6eb62f2140 Combine lh*(-**).googleusercontent.com URL regex into one line.
Co-authored-by: Mike Fährmann <mike_faehrmann@web.de>
2024-01-20 15:53:11 -06:00
Wiiplay123
a6fed628dd [blogger] Fix lh*.googleusercontent.com forward slash bug, add support for lh*-**.googleusercontent.com
Some URLs use "lh(number)-(locale).googleusercontent.com" format, so I added support for those.

Also, "lh(number).googleusercontent.com" formats were broken because the regex was looking for a second forward slash.

Examples:
lh7.googleusercontent.com
lh7-us.googleusercontent.com
2024-01-20 15:07:52 -06:00
Mike Fährmann
e17a48fe56 [blogger] inherit from BaseExtractor
- support www.micmicidol.club (#4759)
2023-11-21 16:52:25 +01:00
Mike Fährmann
27ec653991 fix bug in test_init and update example URLs 2023-09-14 13:27:03 +02:00
Mike Fährmann
a453335a9f remove test results in extractor modules
and add generic example URLs
2023-09-11 16:30:55 +02:00
Mike Fährmann
a383eca7f6 decouple extractor initialization
Introduce an 'initialize()' function that does the actual init
(session, cookies, config options) and can called separately from
the constructor __init__().

This allows, for example, to adjust config access inside a Job
before most of it already happened when calling 'extractor.find()'.
2023-07-25 22:16:16 +02:00
Mike Fährmann
0ad59c92b1 [blogger] download files from 'lh*.googleusercontent.com' (4070) 2023-05-28 19:58:20 +02:00
enduser420
bbb1e34c34 [blogger] update sub regex 2023-04-03 12:43:58 +05:30
Mike Fährmann
dd884b02ee replace json.loads with direct calls to JSONDecoder.decode 2023-02-09 15:22:00 +01:00
Mike Fährmann
b0cb4a1b9c replace 'text.extract()' with 'text.extr()' where possible 2022-11-05 01:14:09 +01:00
Mike Fährmann
d699310fdf [blogger] add 'label' or 'query' metadata fields (#2930)
for '/search/label/…' or '/search?q=…' URLs
2022-09-20 11:37:39 +02:00
Mike Fährmann
eef50c1f28 [blogger] split 'search' extractor (#2930) 2022-09-19 21:01:21 +02:00
Mike Fährmann
5038893cdd [blogger] emit metadata for posts without files (#2789) 2022-07-29 13:38:39 +02:00
Mike Fährmann
c6a9bab019 update extractor test results 2022-07-12 15:49:22 +02:00
Mike Fährmann
698f35215e [blogger] support new image domain (fixes #2204) 2022-01-20 23:13:07 +01:00
Vrihub
96fcff182c generic extractor (#735)
* Generic extractor, see issue #683

* Fix failed test_names test, no subcategory needed

* Prefix directory_fmt with "generic"

* Relax regex (would break some urls)

* Flake8 compliance

* pattern: don't require a scheme

This fixes a bug when we force the generic extractor on urls without a
scheme (that are allowed by all other extractors).

* Fix using g: and r: on urls without http(s) scheme

Almost all extractors accept urls without an initial http(s) scheme.

Many extractors also allow for generic subdomains in their "pattern"
variable; some of them implement this with the regex character class
"[^.]+" (everything but a dot).

This leads to a problem when the extractor is given a url starting
with g: or r: (to force using the generic or recursive extractor)
and without the http(s) scheme: e.g. with "r:foobar.tumblr.com"
the "r:" is wrongly considered part of the subdomain.

This commit fixes the bug, replacing the too generic "[^.]+" with the
more specific "[\w-]+" (letters, digits and "-", the only characters
allowed in domain names), which is already used by some extractors.

* Relax imageurl_pattern_ext: allow relative urls

* First round of small suggested changes

* Support image urls starting with "//"

* self.baseurl: remove trailing slash

* Relax regexp (didn't catch some image urls)

* Some fixes and cleanup

* Fix domain pattern; option to enable extractor

Fixed the domain section for "pattern", to pass "test_add" and
"test_add_module" tests.
Added the "enabled" configuration option (default False) to enable the
generic extractor. Using "g(eneric):URL" forces using the extractor.
2021-12-29 22:39:29 +01:00
Mike Fährmann
bd08ee2859 remove most 'yield Message.Version' statements
only leave them in oauth.py as noop results
2021-08-16 03:10:48 +02:00
Mike Fährmann
968d3e8465 remove '&' from URL patterns
'/?&#' -> '/?#' and '?&#' -> '?#'

According to https://www.ietf.org/rfc/rfc3986.txt, URLs are
"organized hierarchically" by using "the slash ("/"), question
mark ("?"), and number sign ("#") characters to delimit components"
2020-10-22 23:31:25 +02:00
Mike Fährmann
6491db3eaf [blogger] handle URLs with specified width/height (closes #1061)
get highest quality for images with
/wXXX-hXXX/ instead of the usual /sXXX/
2020-10-15 15:14:18 +02:00
Mike Fährmann
2b88c90f6f [blogger] add search extractor (#925) 2020-08-06 19:43:39 +02:00
Mike Fährmann
aa64149583 [blogger] support searching posts by labels (closes #925) 2020-08-04 22:49:37 +02:00
Mike Fährmann
453f3bc519 [blogger] improve error messages for missing posts/blogs (#903) 2020-07-22 23:51:48 +02:00
Mike Fährmann
d6a480682f update test results 2020-05-02 21:13:00 +02:00
Mike Fährmann
4e361b3008 add tests for specific datetime values 2020-02-23 16:48:30 +01:00
Mike Fährmann
6703b8a86b [blogger] implement video extraction (closes #587) 2020-01-24 23:37:23 +01:00
Mike Fährmann
109718a5e3 [blogger] add blog and post extractors (closes #364) 2019-10-26 14:15:55 +02:00