generic extractor (#735)

* Generic extractor, see issue #683

* Fix failed test_names test, no subcategory needed

* Prefix directory_fmt with "generic"

* Relax regex (would break some urls)

* Flake8 compliance

* pattern: don't require a scheme

This fixes a bug when we force the generic extractor on urls without a
scheme (that are allowed by all other extractors).

* Fix using g: and r: on urls without http(s) scheme

Almost all extractors accept urls without an initial http(s) scheme.

Many extractors also allow for generic subdomains in their "pattern"
variable; some of them implement this with the regex character class
"[^.]+" (everything but a dot).

This leads to a problem when the extractor is given a url starting
with g: or r: (to force using the generic or recursive extractor)
and without the http(s) scheme: e.g. with "r:foobar.tumblr.com"
the "r:" is wrongly considered part of the subdomain.

This commit fixes the bug, replacing the too generic "[^.]+" with the
more specific "[\w-]+" (letters, digits and "-", the only characters
allowed in domain names), which is already used by some extractors.

* Relax imageurl_pattern_ext: allow relative urls

* First round of small suggested changes

* Support image urls starting with "//"

* self.baseurl: remove trailing slash

* Relax regexp (didn't catch some image urls)

* Some fixes and cleanup

* Fix domain pattern; option to enable extractor

Fixed the domain section for "pattern", to pass "test_add" and
"test_add_module" tests.
Added the "enabled" configuration option (default False) to enable the
generic extractor. Using "g(eneric):URL" forces using the extractor.
This commit is contained in:
Vrihub
2021-12-29 22:39:29 +01:00
committed by GitHub
parent 4376b39a2b
commit 96fcff182c
16 changed files with 229 additions and 20 deletions

View File

@@ -21,8 +21,8 @@ class PhotobucketAlbumExtractor(Extractor):
directory_fmt = ("{category}", "{username}", "{location}")
filename_fmt = "{offset:>03}{pictureId:?_//}_{titleOrFilename}.{extension}"
archive_fmt = "{id}"
pattern = (r"(?:https?://)?((?:[^.]+\.)?photobucket\.com)"
r"/user/[^/?#]+/library(?:/[^?#]*)?")
pattern = (r"(?:https?://)?((?:[\w-]+\.)?photobucket\.com)"
r"/user/[^/?&#]+/library(?:/[^?&#]*)?")
test = (
("https://s369.photobucket.com/user/CrpyLrkr/library", {
"pattern": r"https?://[oi]+\d+.photobucket.com/albums/oo139/",
@@ -109,9 +109,9 @@ class PhotobucketImageExtractor(Extractor):
directory_fmt = ("{category}", "{username}")
filename_fmt = "{pictureId:?/_/}{titleOrFilename}.{extension}"
archive_fmt = "{username}_{id}"
pattern = (r"(?:https?://)?(?:[^.]+\.)?photobucket\.com"
r"(?:/gallery/user/([^/?#]+)/media/([^/?#]+)"
r"|/user/([^/?#]+)/media/[^?#]+\.html)")
pattern = (r"(?:https?://)?(?:[\w-]+\.)?photobucket\.com"
r"(?:/gallery/user/([^/?&#]+)/media/([^/?&#]+)"
r"|/user/([^/?&#]+)/media/[^?&#]+\.html)")
test = (
(("https://s271.photobucket.com/user/lakerfanryan"
"/media/Untitled-3-1.jpg.html"), {