Commit Graph

21 Commits

Author SHA1 Message Date
Mike Fährmann
a383eca7f6 decouple extractor initialization
Introduce an 'initialize()' function that does the actual init
(session, cookies, config options) and can called separately from
the constructor __init__().

This allows, for example, to adjust config access inside a Job
before most of it already happened when calling 'extractor.find()'.
2023-07-25 22:16:16 +02:00
Mike Fährmann
0ad59c92b1 [blogger] download files from 'lh*.googleusercontent.com' (4070) 2023-05-28 19:58:20 +02:00
enduser420
bbb1e34c34 [blogger] update sub regex 2023-04-03 12:43:58 +05:30
Mike Fährmann
dd884b02ee replace json.loads with direct calls to JSONDecoder.decode 2023-02-09 15:22:00 +01:00
Mike Fährmann
b0cb4a1b9c replace 'text.extract()' with 'text.extr()' where possible 2022-11-05 01:14:09 +01:00
Mike Fährmann
d699310fdf [blogger] add 'label' or 'query' metadata fields (#2930)
for '/search/label/…' or '/search?q=…' URLs
2022-09-20 11:37:39 +02:00
Mike Fährmann
eef50c1f28 [blogger] split 'search' extractor (#2930) 2022-09-19 21:01:21 +02:00
Mike Fährmann
5038893cdd [blogger] emit metadata for posts without files (#2789) 2022-07-29 13:38:39 +02:00
Mike Fährmann
c6a9bab019 update extractor test results 2022-07-12 15:49:22 +02:00
Mike Fährmann
698f35215e [blogger] support new image domain (fixes #2204) 2022-01-20 23:13:07 +01:00
Vrihub
96fcff182c generic extractor (#735)
* Generic extractor, see issue #683

* Fix failed test_names test, no subcategory needed

* Prefix directory_fmt with "generic"

* Relax regex (would break some urls)

* Flake8 compliance

* pattern: don't require a scheme

This fixes a bug when we force the generic extractor on urls without a
scheme (that are allowed by all other extractors).

* Fix using g: and r: on urls without http(s) scheme

Almost all extractors accept urls without an initial http(s) scheme.

Many extractors also allow for generic subdomains in their "pattern"
variable; some of them implement this with the regex character class
"[^.]+" (everything but a dot).

This leads to a problem when the extractor is given a url starting
with g: or r: (to force using the generic or recursive extractor)
and without the http(s) scheme: e.g. with "r:foobar.tumblr.com"
the "r:" is wrongly considered part of the subdomain.

This commit fixes the bug, replacing the too generic "[^.]+" with the
more specific "[\w-]+" (letters, digits and "-", the only characters
allowed in domain names), which is already used by some extractors.

* Relax imageurl_pattern_ext: allow relative urls

* First round of small suggested changes

* Support image urls starting with "//"

* self.baseurl: remove trailing slash

* Relax regexp (didn't catch some image urls)

* Some fixes and cleanup

* Fix domain pattern; option to enable extractor

Fixed the domain section for "pattern", to pass "test_add" and
"test_add_module" tests.
Added the "enabled" configuration option (default False) to enable the
generic extractor. Using "g(eneric):URL" forces using the extractor.
2021-12-29 22:39:29 +01:00
Mike Fährmann
bd08ee2859 remove most 'yield Message.Version' statements
only leave them in oauth.py as noop results
2021-08-16 03:10:48 +02:00
Mike Fährmann
968d3e8465 remove '&' from URL patterns
'/?&#' -> '/?#' and '?&#' -> '?#'

According to https://www.ietf.org/rfc/rfc3986.txt, URLs are
"organized hierarchically" by using "the slash ("/"), question
mark ("?"), and number sign ("#") characters to delimit components"
2020-10-22 23:31:25 +02:00
Mike Fährmann
6491db3eaf [blogger] handle URLs with specified width/height (closes #1061)
get highest quality for images with
/wXXX-hXXX/ instead of the usual /sXXX/
2020-10-15 15:14:18 +02:00
Mike Fährmann
2b88c90f6f [blogger] add search extractor (#925) 2020-08-06 19:43:39 +02:00
Mike Fährmann
aa64149583 [blogger] support searching posts by labels (closes #925) 2020-08-04 22:49:37 +02:00
Mike Fährmann
453f3bc519 [blogger] improve error messages for missing posts/blogs (#903) 2020-07-22 23:51:48 +02:00
Mike Fährmann
d6a480682f update test results 2020-05-02 21:13:00 +02:00
Mike Fährmann
4e361b3008 add tests for specific datetime values 2020-02-23 16:48:30 +01:00
Mike Fährmann
6703b8a86b [blogger] implement video extraction (closes #587) 2020-01-24 23:37:23 +01:00
Mike Fährmann
109718a5e3 [blogger] add blog and post extractors (closes #364) 2019-10-26 14:15:55 +02:00