Commit Graph

133 Commits

Author SHA1 Message Date
Mike Fährmann
0d406c8daf [common] restrict values used in 'generate_extractors()' 2020-12-11 13:46:47 +01:00
Mike Fährmann
8ca7f54750 rename '_request_…' variables
- remove '_' at the beginning
- _request_last -> request_timestamp
2020-12-05 00:09:15 +01:00
Mike Fährmann
c57a918f4a [e621] implement delay via '_request_interval_min' 2020-11-25 00:19:32 +01:00
Mike Fährmann
1e3dd7330e merge SharedConfigMixin functionality into Extractor 2020-11-17 00:34:07 +01:00
Mike Fährmann
198c33ec36 also collect post processors from 'basecategory' entries
(fixes #1084)
2020-10-27 19:56:48 +01:00
Mike Fährmann
1e313d5b84 implement 'sleep-request' option 2020-09-20 20:28:17 +02:00
Mike Fährmann
055c32e0f7 precompute extractor config paths 2020-09-14 22:06:54 +02:00
Mike Fährmann
231dd4c800 accumulate postprocessor objects (#994)
Instead of one 'postprocessors' setting overwriting all others lower
in the hierarchy, all postprocessors along the config path will now
get collected into one big list.

For example '--mtime-from-date' will therefore no longer cause
other postprocessor settings in a config file to get ignored.
2020-09-14 21:51:55 +02:00
Mike Fährmann
f6fd449b59 reduce wait time growth rate from exponential to linear
Waiting for 2**N seconds after each error grows too fast.
Simply waiting N seconds seems far more reasonable.
2020-09-06 22:38:25 +02:00
Mike Fährmann
2c9766b29f fix UnboundLocalError in Extractor.request()
introduced in d6a271d
2020-08-05 21:52:04 +02:00
Mike Fährmann
d6a271d2c7 add 'response' objects to 'HttpError's 2020-07-30 18:23:26 +02:00
Mike Fährmann
53cc498d9c improve config lookup when there are multiple possible locations
This specifically applies to all Mastodon extractors and all
extractors with a 'basecategory', i.e. 'booru', 'foolslide', etc.

Values inside those general config locations wouldn't be recognized
when a value with the same was set on the 'extractor' level.

For example 'extractor.mastodon.directory' should be used over
'extractor.directory' when both are set, but this was impossible
with the previous implementation.

(fixes #843)
2020-06-21 00:07:10 +02:00
Mike Fährmann
1ae1df0d27 update '--write-pages' (#737)
- fix infinite recursion for responses with multiple entries in
  'history'
- hide values of Set-Cookie headers
- only write the response content by default
  (use '-o write-pages=all' to also include HTTP headers)
2020-06-18 15:07:30 +02:00
Mike Fährmann
15c3d29062 move dump_response() into a separate function (#737) 2020-05-25 22:21:58 +02:00
Mike Fährmann
a363da4b43 include redirects and headers in --write-pages dumps (#737) 2020-05-25 22:21:57 +02:00
Mike Fährmann
3201fe3521 add global SENTINEL object 2020-05-19 22:32:53 +02:00
Mike Fährmann
f8f95e68a7 improve '--write-pages' (#737)
- move code into its own function
- add enumeration index to filenames
- dump responses regardless of status code
2020-05-12 20:40:25 +02:00
Vrihub
4cc761c730 Implement --write-pages option (#736)
* Implement --write-pages option

* Fix long lines

* Fix file mode to binary

* Fix pattern for Windows compatibility
2020-05-12 14:25:21 +02:00
Mike Fährmann
5d7ca76885 retry Cloudflare challenges 2020-04-24 22:47:27 +02:00
Mike Fährmann
d02f7c1118 improve Extractor.wait()
- allow 'until' to be a datetime object
- do "time calculations" with UTC timestamps
- set a default 'reason'
2020-04-05 21:23:05 +02:00
Mike Fährmann
2a4f227e08 warn about expired cookies 2020-02-25 00:34:42 +01:00
Mike Fährmann
56f1c96168 implement 'parent-directory' option (#551) 2020-01-29 18:32:37 +01:00
Mike Fährmann
2a9be48511 improve util.load/save_cookiestxt() and add tests
- take a file object as argument instead of an filename
- accept whitespace before comments ("   # comment")
- map expiration "0" to None and not the number 0
2020-01-25 23:02:15 +01:00
Mike Fährmann
c1a6862863 implement functions to load/save cookies.txt files (closes #586)
The methods of the standard libraries' MozillaCookieJar have
several shortcomings (#HttpOnly_ cookies, 0 expiration timestamps, etc.)
and require construction of an ultimately pointless CookieJar object.
2020-01-21 21:59:36 +01:00
Mike Fährmann
bd5ce9855c allow GalleryExtractors to set URL-independent extensions 2020-01-14 11:53:32 +01:00
Mike Fährmann
3811fd8a25 fix time formatting for Python 3.4 and 3.5
'datetime.time.isoformat()' only has an optional 'timespec' argument
since Python 3.6.
2020-01-05 00:47:10 +01:00
Mike Fährmann
569747a78d implement extractor.wait() 2020-01-04 23:42:07 +01:00
Mike Fährmann
ce54b8c04c let extractors opt-out of cookie option usage
useful to avoid sending unnecessary cookies when all authentication
is done through OAuth tokens
2020-01-01 21:12:37 +01:00
Mike Fährmann
d3e44e899d raise NotFoundErrors for 404 responses in GalleryExtractors 2019-12-13 18:42:04 +01:00
Mike Fährmann
a4dd8b3dab improve _check_cookies()
Only loop over all cookies once instead of calling
cookiejar._find() for each cookie name.
2019-12-13 15:51:20 +01:00
Mike Fährmann
15f9bb3d14 add option to disable pyOpenSSL usage (#508)
(pyOpenSSL is now disabled by default)
2019-12-08 21:21:00 +01:00
Mike Fährmann
e17907ee2a change default value of 'cookies-update' to 'true' 2019-12-05 23:43:49 +01:00
Mike Fährmann
e2710702d4 fix Cloudflare bypss 2019-12-01 01:07:24 +01:00
Mike Fährmann
ae09f87602 improve SharedConfigMixin config lookups 2019-11-25 18:31:38 +01:00
Mike Fährmann
f5604492c3 update interface of config functions 2019-11-24 00:42:28 +01:00
Mike Fährmann
d45fabb79d match user profile handling on deviantart and newgrounds 2019-11-22 23:20:21 +01:00
Mike Fährmann
1a197d2195 store the original cookiejar as Extractor._cookiejar 2019-11-05 21:53:22 +01:00
Mike Fährmann
de83ae4576 make 'method' argument of Extractor.request keyword-only 2019-11-05 17:28:09 +01:00
Mike Fährmann
d44f790e81 adjust output for HTTP status related errors 2019-10-27 23:55:02 +01:00
Mike Fährmann
389d2d7e38 implement 'cookies-update' option (#445) 2019-10-19 15:23:55 +02:00
Mike Fährmann
1693d97bd3 update extractor class hierarchies
- let the GalleryExtractor class inherit directly from Extractor
- make ChapterExtractor a subclass of GalleryExtractor
- change enumeration field names of GalleryExtractors to 'num'
2019-10-16 18:15:29 +02:00
Mike Fährmann
f4bc75e854 fix rate limit handling for OAuth APIs (#368) 2019-08-03 13:43:00 +02:00
Mike Fährmann
21991acc49 add 'ciphers' option; update default User-Agent 2019-07-19 17:14:40 +02:00
Mike Fährmann
84f4d3bc0b replace urllib3's default cipher list with Firefox's (#342)
Avoids Cloudflare CAPTCHAs on both Linux in Windows without
pyOpenSSL installed.
2019-07-18 19:42:13 +02:00
Mike Fährmann
09f37fde39 [reddit] move date-min/-max handling into Extractor class 2019-07-16 22:54:39 +02:00
Mike Fährmann
56c7a66a4a detect Cloudflare CAPTCHAs and update cipher list 2019-07-10 15:18:20 +02:00
Mike Fährmann
fdec59f8e2 replace extractor.request() 'expect' argument
with
- 'fatal': allow 4xx status codes
- 'notfound': raise NotFoundError on 404
2019-07-05 00:42:16 +02:00
Mike Fährmann
69205df68d allow '-1' for infinite retries (#300) 2019-06-30 23:10:47 +02:00
Mike Fährmann
f7b5c4c3e7 use values of 'retries' options correctly
The RE-tries option now specifies exactly that: the maximum number a
failed HTTP request is re-tried. For example a value of 2 will now
correctly stop after 3 attempts: the initial one + 2 re-tries.

The maximum wait-time now also caps at 30min and increases exponentially
for both extractor.request() and downloader.http.download().
2019-06-30 23:10:18 +02:00
Mike Fährmann
399e8e965a also update urllib3's cipher list for versions >= 1.25 2019-05-21 23:02:20 +02:00