Mike Fährmann
bddcec49f1
implement 'text.root_from_url()'
...
use domain from input URL for kemono
2022-03-01 03:09:57 +01:00
Mike Fährmann
f5b2b9333f
fix another bug in _check:cookies ( #2160 )
...
regression introduced in ed317bfc
Added a couple of tests to hopefully catch such bugs
before they land in a release.
2022-02-16 22:58:57 +01:00
Mike Fährmann
ed317bfcf1
warn about cookies expiring in less than 24 hours
...
requires an expiration timestamp,
so this only works with cookies from a cookies.txt file
2022-02-13 23:00:49 +01:00
Mike Fährmann
b4f8e15a1f
allow BaseExtractors to use the domain pf the matched URL
2022-02-10 01:38:50 +01:00
Mike Fährmann
f58364f6a8
update Firefox cipher list
2022-02-01 02:33:01 +01:00
Mike Fährmann
7e6981dda6
rename 'disabletls12' to 'tls12'
...
and let config options override any default settings
2022-02-01 01:37:03 +01:00
Mike Fährmann
bb3e182562
overhaul session initialization
...
- share adapter & connection pool across sessions with the same
ssl options, ssl ciphers, and source address
- simplify browser emulation to just a list of headers and ciphers
2022-01-31 23:12:08 +01:00
Robert Pendell
4c651f6252
[patreon] Disable TLS 1.2 by default ( #2249 )
...
Disables TLS 1.2 on Patreon by default.
2022-01-30 23:30:44 +01:00
Robert Pendell
392cf079f7
Add ability to disable TLS 1.2 ( #2243 )
...
Fix for Patreon Cloudflare issues by having only TLS v1.3 or higher establish HTTPS connections
This now allows you to disable it on a per-host or global basis. Add disabletls12 as a config option either under extractor.(host) or just under extractor. Option is false by default.
Example:
"patreon":
{
"disabletls12": true,
"cookies": {
"session_id": "X"
}
}
2022-01-30 22:14:43 +01:00
Mike Fährmann
de754590e0
add --source-address command-line option ( closes #2206 )
2022-01-21 17:07:56 +01:00
Mike Fährmann
6f2e0c9c3d
fix cookie checks for patreon, fanbox, fantia
...
The changes in 9a255344 caused a warning about missing cookies to be
displayed even if those cookies were present, because _check_cookies()
did not account for an empty cookiedomain.
2022-01-01 17:55:58 +01:00
Mike Fährmann
ad30653b17
allow running a BaseExtractor for any URL
...
by prefixing it with '<base-category>:'
For example:
shopify:https://partakefoods.com/products/crunchy-cookie-variety-pack
gelbooru_v01:https://5naf.booru.org/index.php?page=post&s=view&id=46963
Available base categories are:
mastodon, shopify, moebooru, gelbooru_v01, gelbooru_v02,
reactor, foolslide, foolfuuka, philomena
2021-12-15 00:32:17 +01:00
Mike Fährmann
dad2875a3e
fix calculating retry sleep times ( fixes #1990 )
2021-10-29 23:53:48 +02:00
Mike Fährmann
e69ee41f25
implement 'page-reverse' option ( #1854 )
2021-09-23 18:02:19 +02:00
Mike Fährmann
c9e6693530
allow specifying a minimum/maximum for 'sleep-*' options ( #1835 )
...
for example '"sleep-request": [5.0, 10.0]' to wait between 5 and 10
seconds between each HTTP request
2021-09-14 17:40:05 +02:00
Mike Fährmann
2ff2974353
[common] update default argument handling in Extractor.request()
...
more lines of code, but slightly less execution time
2021-09-11 01:26:11 +02:00
Mike Fährmann
d79bcb6236
allow extractors to register a 'finalize()' method
2021-09-07 21:15:30 +02:00
Mike Fährmann
bb6a130942
automatically set required DDoS-GUARD cookies ( #1779 )
...
for kemono.party and seiso.party
2021-08-16 17:40:29 +02:00
Mike Fährmann
bd08ee2859
remove most 'yield Message.Version' statements
...
only leave them in oauth.py as noop results
2021-08-16 03:10:48 +02:00
Mike Fährmann
9cb5ea5eda
update default User-Agent headers
2021-08-14 04:01:41 +02:00
Mike Fährmann
0179581340
add 'T' format string conversion ( #1646 )
...
to convert 'date'/datetime to timestamp
2021-06-25 22:35:45 +02:00
Mike Fährmann
94faf8c85a
add type check before applying 'browser' option ( fixes #1358 )
2021-03-06 18:15:32 +01:00
Mike Fährmann
6cfc9613fe
update some code in Extractor constructor
...
- combine '_init_headers' and '_emulate_browser' functionality
into new '_init_session'
- add 'headers' and 'ciphers' options
2021-03-03 23:13:29 +01:00
Mike Fährmann
29ea54dc41
[patreon] use '"browser": "firefox"' by default ( #1117 )
2021-02-27 16:26:42 +01:00
Mike Fährmann
cf5fa75d4c
add 'browser' option ( #1117 )
...
- change default user agent to Firefox ESR 78 on Windows 10
- remove 'ciphers' option
2021-02-26 13:41:27 +01:00
Mike Fährmann
e1a12761d7
strip '/' from instance root URLs
2021-02-17 23:07:17 +01:00
Mike Fährmann
d656892670
remove cloudflare.py
...
The old IUAM challenge doesn't get used anymore, i.e. code to bypass it
is pointless, and the 'is_...()' checks are simple enough to directly
include them in 'extractor.request()'.
2021-02-15 23:17:02 +01:00
Mike Fährmann
88fae99811
remove 'generate_extractors()'
2021-01-28 01:04:50 +01:00
Mike Fährmann
745a114c61
[common] implement BaseExtractor class
...
Should be used when the same extractor logic applies to different
instances/domains of several sites, e.g. FoolFuuka, Shopify, etc.
This will replace the functionality of 'generate_extractors()' in
a more efficient way, by condensing everything into 1 class and not
dynamically generating an extractor class for each instance.
2021-01-26 03:48:02 +01:00
Mike Fährmann
0d406c8daf
[common] restrict values used in 'generate_extractors()'
2020-12-11 13:46:47 +01:00
Mike Fährmann
8ca7f54750
rename '_request_…' variables
...
- remove '_' at the beginning
- _request_last -> request_timestamp
2020-12-05 00:09:15 +01:00
Mike Fährmann
c57a918f4a
[e621] implement delay via '_request_interval_min'
2020-11-25 00:19:32 +01:00
Mike Fährmann
1e3dd7330e
merge SharedConfigMixin functionality into Extractor
2020-11-17 00:34:07 +01:00
Mike Fährmann
198c33ec36
also collect post processors from 'basecategory' entries
...
(fixes #1084 )
2020-10-27 19:56:48 +01:00
Mike Fährmann
1e313d5b84
implement 'sleep-request' option
2020-09-20 20:28:17 +02:00
Mike Fährmann
055c32e0f7
precompute extractor config paths
2020-09-14 22:06:54 +02:00
Mike Fährmann
231dd4c800
accumulate postprocessor objects ( #994 )
...
Instead of one 'postprocessors' setting overwriting all others lower
in the hierarchy, all postprocessors along the config path will now
get collected into one big list.
For example '--mtime-from-date' will therefore no longer cause
other postprocessor settings in a config file to get ignored.
2020-09-14 21:51:55 +02:00
Mike Fährmann
f6fd449b59
reduce wait time growth rate from exponential to linear
...
Waiting for 2**N seconds after each error grows too fast.
Simply waiting N seconds seems far more reasonable.
2020-09-06 22:38:25 +02:00
Mike Fährmann
2c9766b29f
fix UnboundLocalError in Extractor.request()
...
introduced in d6a271d
2020-08-05 21:52:04 +02:00
Mike Fährmann
d6a271d2c7
add 'response' objects to 'HttpError's
2020-07-30 18:23:26 +02:00
Mike Fährmann
53cc498d9c
improve config lookup when there are multiple possible locations
...
This specifically applies to all Mastodon extractors and all
extractors with a 'basecategory', i.e. 'booru', 'foolslide', etc.
Values inside those general config locations wouldn't be recognized
when a value with the same was set on the 'extractor' level.
For example 'extractor.mastodon.directory' should be used over
'extractor.directory' when both are set, but this was impossible
with the previous implementation.
(fixes #843 )
2020-06-21 00:07:10 +02:00
Mike Fährmann
1ae1df0d27
update '--write-pages' ( #737 )
...
- fix infinite recursion for responses with multiple entries in
'history'
- hide values of Set-Cookie headers
- only write the response content by default
(use '-o write-pages=all' to also include HTTP headers)
2020-06-18 15:07:30 +02:00
Mike Fährmann
15c3d29062
move dump_response() into a separate function ( #737 )
2020-05-25 22:21:58 +02:00
Mike Fährmann
a363da4b43
include redirects and headers in --write-pages dumps ( #737 )
2020-05-25 22:21:57 +02:00
Mike Fährmann
3201fe3521
add global SENTINEL object
2020-05-19 22:32:53 +02:00
Mike Fährmann
f8f95e68a7
improve '--write-pages' ( #737 )
...
- move code into its own function
- add enumeration index to filenames
- dump responses regardless of status code
2020-05-12 20:40:25 +02:00
Vrihub
4cc761c730
Implement --write-pages option ( #736 )
...
* Implement --write-pages option
* Fix long lines
* Fix file mode to binary
* Fix pattern for Windows compatibility
2020-05-12 14:25:21 +02:00
Mike Fährmann
5d7ca76885
retry Cloudflare challenges
2020-04-24 22:47:27 +02:00
Mike Fährmann
d02f7c1118
improve Extractor.wait()
...
- allow 'until' to be a datetime object
- do "time calculations" with UTC timestamps
- set a default 'reason'
2020-04-05 21:23:05 +02:00
Mike Fährmann
2a4f227e08
warn about expired cookies
2020-02-25 00:34:42 +01:00