Commit Graph

49 Commits

Author SHA1 Message Date
Mike Fährmann
a383eca7f6 decouple extractor initialization
Introduce an 'initialize()' function that does the actual init
(session, cookies, config options) and can called separately from
the constructor __init__().

This allows, for example, to adjust config access inside a Job
before most of it already happened when calling 'extractor.find()'.
2023-07-25 22:16:16 +02:00
Mike Fährmann
a996d936d2 [imagefap] fix pagination (#3013) 2023-07-18 17:56:33 +02:00
Mike Fährmann
2dfd4a3de2 [imagefap] extract 'categories' metadata and fix empty 'tags' 2023-04-17 14:49:50 +02:00
Mike Fährmann
02ec5bb8e5 [imagefap] extract 'description' metadata (#3905) 2023-04-16 17:02:16 +02:00
Mike Fährmann
dd884b02ee replace json.loads with direct calls to JSONDecoder.decode 2023-02-09 15:22:00 +01:00
Mike Fährmann
137a395ae0 [imagefap] fix infinite pagination loop (#3594) 2023-01-31 19:21:43 +01:00
Mike Fährmann
3c708ade8f [imagefap] fix metadata extraction 2023-01-31 15:38:55 +01:00
Mike Fährmann
17e24eacf0 [imagefap] update 'gallery' URLs (#3595) 2023-01-31 15:33:35 +01:00
Mike Fährmann
4833ec323e [imagefap] add 'folder' extractor (#3504) 2023-01-08 16:57:31 +01:00
Mike Fährmann
cbaeee9533 [imagefap] warn about redirects to '/human-verification' (#1140) 2023-01-07 13:04:42 +01:00
Mike Fährmann
435de1329a [imagefap] use default delay between requests (#1140) 2023-01-07 12:59:09 +01:00
Mike Fährmann
b0cb4a1b9c replace 'text.extract()' with 'text.extr()' where possible 2022-11-05 01:14:09 +01:00
Mike Fährmann
bc9d291c13 [imagefap] fix and improve folder extraction (#3013) 2022-10-08 15:41:39 +02:00
Mike Fährmann
55fca5fe4b [imagefap] fix and improve gallery pagination (#3013) 2022-10-08 15:41:39 +02:00
Mike Fährmann
c6a9bab019 update extractor test results 2022-07-12 15:49:22 +02:00
Mike Fährmann
47a780942c update extractor test results 2021-09-03 19:36:12 +02:00
Mike Fährmann
bd08ee2859 remove most 'yield Message.Version' statements
only leave them in oauth.py as noop results
2021-08-16 03:10:48 +02:00
Mike Fährmann
968d3e8465 remove '&' from URL patterns
'/?&#' -> '/?#' and '?&#' -> '?#'

According to https://www.ietf.org/rfc/rfc3986.txt, URLs are
"organized hierarchically" by using "the slash ("/"), question
mark ("?"), and number sign ("#") characters to delimit components"
2020-10-22 23:31:25 +02:00
Mike Fährmann
2ecf1efb16 update extractor test results
- tumblr: remove deleted post
- jaiminisbox: replace removed manga/chapters
- smugmug: one inconsequential field got removed
2020-07-18 15:12:28 +02:00
Mike Fährmann
1afb91363c [imagefap] generalize URL patterns and add tests (#552) 2020-01-02 14:26:18 +01:00
Xope Totec
f701e9f33a Handle beta.imagefap.com URLs (#552) 2020-01-02 14:22:00 +01:00
Mike Fährmann
dcaa3d01bd [imagefap] adapt to new image URL format 2019-11-30 23:48:02 +01:00
Mike Fährmann
108963d138 [imagefap] include Referer headers 2019-06-24 21:31:29 +02:00
Mike Fährmann
5530871b5a change results of text.nameext_from_url()
Instead of getting a complete 'filename' from an URL and splitting that
into 'name' and 'extension', the new approach gets rid of the complete
version and renames 'name' to 'filename'. (Using anything other than
{extension} for a filename extension doesn't really work anyway)

Example: "https://example.org/path/filename.ext"

before:
- filename : filename.ext
- name     : filename
- extension: ext

now:
- filename : filename
- extension: ext
2019-02-14 16:07:17 +01:00
Mike Fährmann
61741d7333 provide type information for Queue messages
Child extractors are now directly constructed with Extractor.from_url()
if the extractor class is known beforehand, instead of using
extractor.find() and searching through all possible extractor classes.
2019-02-12 21:32:32 +01:00
Mike Fährmann
4b1880fa5e propagate 'match' to base extractor constructor 2019-02-11 13:31:10 +01:00
Mike Fährmann
6284731107 simplify extractor constants
- single strings for URL patterns
- tuples instead of lists for 'directory_fmt' and 'test'
- single-tuple tests where applicable
2019-02-08 13:45:40 +01:00
Mike Fährmann
34bab080ae rewrite URL patterns to use only 1 per extractor 2019-02-08 12:03:10 +01:00
Mike Fährmann
7f6a0be982 adjust some tests 2018-11-15 22:50:04 +01:00
Mike Fährmann
c69150f715 [imagefap] fix extraction
also adds tags to gallery-metadata and converts suitable values to int
2018-10-20 18:32:25 +02:00
Mike Fährmann
34b556922d update/restore tests 2018-08-23 15:47:40 +02:00
Mike Fährmann
188e956c4e [imagefap] use HTTPS + update test results 2018-06-30 19:40:46 +02:00
Mike Fährmann
cc36f88586 rename safe_int to parse_int; move parse_* to text module 2018-04-20 14:53:21 +02:00
Mike Fährmann
34873dbd90 set 'archive_fmt' values
These are going to be used to create an unique id for each image.
2018-02-01 15:30:49 +01:00
Mike Fährmann
035ef655f1 [imagefap] update unit tests
old gallery/image has been deleted
2017-10-27 12:22:16 +02:00
Mike Fährmann
81a7788b40 replace space characters in unit test URLs 2017-10-23 17:00:53 +02:00
Mike Fährmann
26a866e7d8 implement (sub)category-transfer between extractors (#41)
ImageFap- and all Manga-Extractors will transfer their (sub)category
values to other extractors instantiated by them, which will in turn
allow those to use options set for their parents.

Example:
ImagefapGalleryExtractors will use options set under
extractor.imagefap.user, if (and only if) they have been instantiated by
a ImagefapUserExtractor; and options from extractor.imagefap.gallery
otherwise.
2017-09-26 21:05:11 +02:00
Mike Fährmann
9fc1d0c901 implement and use 'util.safe_int()'
same as Python's 'int()', except it doesn't raise any exceptions and
accepts a default value
2017-09-24 15:59:25 +02:00
Mike Fährmann
0dedbe759c enable '--chapter-filter'
The same filter infrastructure that can be applied to image URLS now
also works for manga chapters and other delegated URLs.

TODO: actually provide any metadata (currently supported is only
deviantart and imagefap).
2017-09-12 16:19:00 +02:00
Mike Fährmann
6f30cf4c64 change keyword names to valid Python identifiers
This commit mostly replaces all minus-signs ('-') in keyword names with
underscores ('_') to allow them to be used in filter-expressions. For
example 'gallery-id' got renamed to 'gallery_id'.

(It is theoretically possible to access any variable, regardless of its
name, with 'locals()["NAME"]', but that seems a bit too convoluted if
just 'NAME' could be enough)
2017-09-10 22:20:47 +02:00
Mike Fährmann
43e3bb24ae [imagefap] don't rely on image-count
(fixes #9)
2017-03-09 20:34:39 +01:00
Mike Fährmann
94e10f249a code adjustments according to pep8 nr2 2017-02-01 00:53:19 +01:00
Mike Fährmann
56d810c896 update keyword hashes for tests 2016-09-25 17:28:46 +02:00
Mike Fährmann
19c2d4ff6f remove explicit (sub)category keywords 2016-09-25 14:22:07 +02:00
Mike Fährmann
d7e168799d consistent extractor naming scheme + docstrings 2016-09-12 10:34:31 +02:00
Mike Fährmann
fa14ef17ea [imagefap] deal with long filenames 2016-08-11 15:50:32 +02:00
Mike Fährmann
5a5b47e77a [imagefap] add user extractor 2016-08-10 12:54:18 +02:00
Mike Fährmann
d9c5b7a102 [imagefap] add single-image extractor 2016-08-10 10:27:32 +02:00
Mike Fährmann
dac796879a [imagefap] add extractor 2016-08-09 14:05:12 +02:00