Commit Graph

165 Commits

Author SHA1 Message Date
Mike Fährmann
29db716a63 implement 'datetime_to_timestamp()'
and rename 'to_timestamp()'
to the more descriptive 'datetime_to_timestamp_string()'
2022-03-23 22:36:01 +01:00
Mike Fährmann
500a479026 fix a third(!) bug in _check_cookies() (#2372)
turns out tests are worthless if you get em wrong ...
2022-03-18 19:52:37 +01:00
Mike Fährmann
47cf05c4ab refactor proxy handling code (#2357)
- allow gallery-dl proxy settings to overwrite environment proxies
- allow specifying different proxies for data extraction and download
  - add 'downloader.proxy' option
  - '-o extractor.proxy=–PROXY_URL -o downloader.proxy=null'
    now has the same effect as youtube-dl's '--geo-verification-proxy'
2022-03-10 23:55:35 +01:00
Mike Fährmann
bddcec49f1 implement 'text.root_from_url()'
use domain from input URL for kemono
2022-03-01 03:09:57 +01:00
Mike Fährmann
f5b2b9333f fix another bug in _check:cookies (#2160)
regression introduced in ed317bfc

Added a couple of tests to hopefully catch such bugs
before they land in a release.
2022-02-16 22:58:57 +01:00
Mike Fährmann
ed317bfcf1 warn about cookies expiring in less than 24 hours
requires an expiration timestamp,
so this only works with cookies from a cookies.txt file
2022-02-13 23:00:49 +01:00
Mike Fährmann
b4f8e15a1f allow BaseExtractors to use the domain pf the matched URL 2022-02-10 01:38:50 +01:00
Mike Fährmann
f58364f6a8 update Firefox cipher list 2022-02-01 02:33:01 +01:00
Mike Fährmann
7e6981dda6 rename 'disabletls12' to 'tls12'
and let config options override any default settings
2022-02-01 01:37:03 +01:00
Mike Fährmann
bb3e182562 overhaul session initialization
- share adapter & connection pool across sessions with the same
  ssl options, ssl ciphers, and source address
- simplify browser emulation to just a list of headers and ciphers
2022-01-31 23:12:08 +01:00
Robert Pendell
4c651f6252 [patreon] Disable TLS 1.2 by default (#2249)
Disables TLS 1.2 on Patreon by default.
2022-01-30 23:30:44 +01:00
Robert Pendell
392cf079f7 Add ability to disable TLS 1.2 (#2243)
Fix for Patreon Cloudflare issues by having only TLS v1.3 or higher establish HTTPS connections

This now allows you to disable it on a per-host or global basis.  Add disabletls12 as a config option either under extractor.(host) or just under extractor.  Option is false by default.

Example:
        "patreon":
        {
            "disabletls12": true,
            "cookies": {
                "session_id": "X"
            }
        }
2022-01-30 22:14:43 +01:00
Mike Fährmann
de754590e0 add --source-address command-line option (closes #2206) 2022-01-21 17:07:56 +01:00
Mike Fährmann
6f2e0c9c3d fix cookie checks for patreon, fanbox, fantia
The changes in 9a255344 caused a warning about missing cookies to be
displayed even if those cookies were present, because _check_cookies()
did not account for an empty cookiedomain.
2022-01-01 17:55:58 +01:00
Mike Fährmann
ad30653b17 allow running a BaseExtractor for any URL
by prefixing it with '<base-category>:'

For example:
  shopify:https://partakefoods.com/products/crunchy-cookie-variety-pack
  gelbooru_v01:https://5naf.booru.org/index.php?page=post&s=view&id=46963

Available base categories are:
  mastodon, shopify, moebooru, gelbooru_v01, gelbooru_v02,
  reactor, foolslide, foolfuuka,  philomena
2021-12-15 00:32:17 +01:00
Mike Fährmann
dad2875a3e fix calculating retry sleep times (fixes #1990) 2021-10-29 23:53:48 +02:00
Mike Fährmann
e69ee41f25 implement 'page-reverse' option (#1854) 2021-09-23 18:02:19 +02:00
Mike Fährmann
c9e6693530 allow specifying a minimum/maximum for 'sleep-*' options (#1835)
for example '"sleep-request": [5.0, 10.0]' to wait between 5 and 10
seconds between each HTTP request
2021-09-14 17:40:05 +02:00
Mike Fährmann
2ff2974353 [common] update default argument handling in Extractor.request()
more lines of code, but slightly less execution time
2021-09-11 01:26:11 +02:00
Mike Fährmann
d79bcb6236 allow extractors to register a 'finalize()' method 2021-09-07 21:15:30 +02:00
Mike Fährmann
bb6a130942 automatically set required DDoS-GUARD cookies (#1779)
for kemono.party and seiso.party
2021-08-16 17:40:29 +02:00
Mike Fährmann
bd08ee2859 remove most 'yield Message.Version' statements
only leave them in oauth.py as noop results
2021-08-16 03:10:48 +02:00
Mike Fährmann
9cb5ea5eda update default User-Agent headers 2021-08-14 04:01:41 +02:00
Mike Fährmann
0179581340 add 'T' format string conversion (#1646)
to convert 'date'/datetime to timestamp
2021-06-25 22:35:45 +02:00
Mike Fährmann
94faf8c85a add type check before applying 'browser' option (fixes #1358) 2021-03-06 18:15:32 +01:00
Mike Fährmann
6cfc9613fe update some code in Extractor constructor
- combine '_init_headers' and '_emulate_browser' functionality
  into new '_init_session'
- add 'headers' and 'ciphers' options
2021-03-03 23:13:29 +01:00
Mike Fährmann
29ea54dc41 [patreon] use '"browser": "firefox"' by default (#1117) 2021-02-27 16:26:42 +01:00
Mike Fährmann
cf5fa75d4c add 'browser' option (#1117)
- change default user agent to Firefox ESR 78 on Windows 10
- remove 'ciphers' option
2021-02-26 13:41:27 +01:00
Mike Fährmann
e1a12761d7 strip '/' from instance root URLs 2021-02-17 23:07:17 +01:00
Mike Fährmann
d656892670 remove cloudflare.py
The old IUAM challenge doesn't get used anymore, i.e. code to bypass it
is pointless, and the 'is_...()' checks are simple enough to directly
include them in 'extractor.request()'.
2021-02-15 23:17:02 +01:00
Mike Fährmann
88fae99811 remove 'generate_extractors()' 2021-01-28 01:04:50 +01:00
Mike Fährmann
745a114c61 [common] implement BaseExtractor class
Should be used when the same extractor logic applies to different
instances/domains of several sites, e.g. FoolFuuka, Shopify, etc.

This will replace the functionality of 'generate_extractors()' in
a more efficient way, by condensing everything into 1 class and not
dynamically generating an extractor class for each instance.
2021-01-26 03:48:02 +01:00
Mike Fährmann
0d406c8daf [common] restrict values used in 'generate_extractors()' 2020-12-11 13:46:47 +01:00
Mike Fährmann
8ca7f54750 rename '_request_…' variables
- remove '_' at the beginning
- _request_last -> request_timestamp
2020-12-05 00:09:15 +01:00
Mike Fährmann
c57a918f4a [e621] implement delay via '_request_interval_min' 2020-11-25 00:19:32 +01:00
Mike Fährmann
1e3dd7330e merge SharedConfigMixin functionality into Extractor 2020-11-17 00:34:07 +01:00
Mike Fährmann
198c33ec36 also collect post processors from 'basecategory' entries
(fixes #1084)
2020-10-27 19:56:48 +01:00
Mike Fährmann
1e313d5b84 implement 'sleep-request' option 2020-09-20 20:28:17 +02:00
Mike Fährmann
055c32e0f7 precompute extractor config paths 2020-09-14 22:06:54 +02:00
Mike Fährmann
231dd4c800 accumulate postprocessor objects (#994)
Instead of one 'postprocessors' setting overwriting all others lower
in the hierarchy, all postprocessors along the config path will now
get collected into one big list.

For example '--mtime-from-date' will therefore no longer cause
other postprocessor settings in a config file to get ignored.
2020-09-14 21:51:55 +02:00
Mike Fährmann
f6fd449b59 reduce wait time growth rate from exponential to linear
Waiting for 2**N seconds after each error grows too fast.
Simply waiting N seconds seems far more reasonable.
2020-09-06 22:38:25 +02:00
Mike Fährmann
2c9766b29f fix UnboundLocalError in Extractor.request()
introduced in d6a271d
2020-08-05 21:52:04 +02:00
Mike Fährmann
d6a271d2c7 add 'response' objects to 'HttpError's 2020-07-30 18:23:26 +02:00
Mike Fährmann
53cc498d9c improve config lookup when there are multiple possible locations
This specifically applies to all Mastodon extractors and all
extractors with a 'basecategory', i.e. 'booru', 'foolslide', etc.

Values inside those general config locations wouldn't be recognized
when a value with the same was set on the 'extractor' level.

For example 'extractor.mastodon.directory' should be used over
'extractor.directory' when both are set, but this was impossible
with the previous implementation.

(fixes #843)
2020-06-21 00:07:10 +02:00
Mike Fährmann
1ae1df0d27 update '--write-pages' (#737)
- fix infinite recursion for responses with multiple entries in
  'history'
- hide values of Set-Cookie headers
- only write the response content by default
  (use '-o write-pages=all' to also include HTTP headers)
2020-06-18 15:07:30 +02:00
Mike Fährmann
15c3d29062 move dump_response() into a separate function (#737) 2020-05-25 22:21:58 +02:00
Mike Fährmann
a363da4b43 include redirects and headers in --write-pages dumps (#737) 2020-05-25 22:21:57 +02:00
Mike Fährmann
3201fe3521 add global SENTINEL object 2020-05-19 22:32:53 +02:00
Mike Fährmann
f8f95e68a7 improve '--write-pages' (#737)
- move code into its own function
- add enumeration index to filenames
- dump responses regardless of status code
2020-05-12 20:40:25 +02:00
Vrihub
4cc761c730 Implement --write-pages option (#736)
* Implement --write-pages option

* Fix long lines

* Fix file mode to binary

* Fix pattern for Windows compatibility
2020-05-12 14:25:21 +02:00