322 Commits

Author SHA1 Message Date
Mike Fährmann
53cc498d9c improve config lookup when there are multiple possible locations
This specifically applies to all Mastodon extractors and all
extractors with a 'basecategory', i.e. 'booru', 'foolslide', etc.

Values inside those general config locations wouldn't be recognized
when a value with the same was set on the 'extractor' level.

For example 'extractor.mastodon.directory' should be used over
'extractor.directory' when both are set, but this was impossible
with the previous implementation.

(fixes #843)
2020-06-21 00:07:10 +02:00
Mike Fährmann
1ae1df0d27 update '--write-pages' (#737)
- fix infinite recursion for responses with multiple entries in
  'history'
- hide values of Set-Cookie headers
- only write the response content by default
  (use '-o write-pages=all' to also include HTTP headers)
2020-06-18 15:07:30 +02:00
Mike Fährmann
15c3d29062 move dump_response() into a separate function (#737) 2020-05-25 22:21:58 +02:00
Mike Fährmann
a363da4b43 include redirects and headers in --write-pages dumps (#737) 2020-05-25 22:21:57 +02:00
Mike Fährmann
3201fe3521 add global SENTINEL object 2020-05-19 22:32:53 +02:00
Mike Fährmann
f8f95e68a7 improve '--write-pages' (#737)
- move code into its own function
- add enumeration index to filenames
- dump responses regardless of status code
2020-05-12 20:40:25 +02:00
Vrihub
4cc761c730 Implement --write-pages option (#736)
* Implement --write-pages option

* Fix long lines

* Fix file mode to binary

* Fix pattern for Windows compatibility
2020-05-12 14:25:21 +02:00
Mike Fährmann
5d7ca76885 retry Cloudflare challenges 2020-04-24 22:47:27 +02:00
Mike Fährmann
d02f7c1118 improve Extractor.wait()
- allow 'until' to be a datetime object
- do "time calculations" with UTC timestamps
- set a default 'reason'
2020-04-05 21:23:05 +02:00
Mike Fährmann
2a4f227e08 warn about expired cookies 2020-02-25 00:34:42 +01:00
Mike Fährmann
56f1c96168 implement 'parent-directory' option (#551) 2020-01-29 18:32:37 +01:00
Mike Fährmann
2a9be48511 improve util.load/save_cookiestxt() and add tests
- take a file object as argument instead of an filename
- accept whitespace before comments ("   # comment")
- map expiration "0" to None and not the number 0
2020-01-25 23:02:15 +01:00
Mike Fährmann
c1a6862863 implement functions to load/save cookies.txt files (closes #586)
The methods of the standard libraries' MozillaCookieJar have
several shortcomings (#HttpOnly_ cookies, 0 expiration timestamps, etc.)
and require construction of an ultimately pointless CookieJar object.
2020-01-21 21:59:36 +01:00
Mike Fährmann
bd5ce9855c allow GalleryExtractors to set URL-independent extensions 2020-01-14 11:53:32 +01:00
Mike Fährmann
3811fd8a25 fix time formatting for Python 3.4 and 3.5
'datetime.time.isoformat()' only has an optional 'timespec' argument
since Python 3.6.
2020-01-05 00:47:10 +01:00
Mike Fährmann
569747a78d implement extractor.wait() 2020-01-04 23:42:07 +01:00
Mike Fährmann
ce54b8c04c let extractors opt-out of cookie option usage
useful to avoid sending unnecessary cookies when all authentication
is done through OAuth tokens
2020-01-01 21:12:37 +01:00
Mike Fährmann
d3e44e899d raise NotFoundErrors for 404 responses in GalleryExtractors 2019-12-13 18:42:04 +01:00
Mike Fährmann
a4dd8b3dab improve _check_cookies()
Only loop over all cookies once instead of calling
cookiejar._find() for each cookie name.
2019-12-13 15:51:20 +01:00
Mike Fährmann
15f9bb3d14 add option to disable pyOpenSSL usage (#508)
(pyOpenSSL is now disabled by default)
2019-12-08 21:21:00 +01:00
Mike Fährmann
e17907ee2a change default value of 'cookies-update' to 'true' 2019-12-05 23:43:49 +01:00
Mike Fährmann
e2710702d4 fix Cloudflare bypss 2019-12-01 01:07:24 +01:00
Mike Fährmann
ae09f87602 improve SharedConfigMixin config lookups 2019-11-25 18:31:38 +01:00
Mike Fährmann
f5604492c3 update interface of config functions 2019-11-24 00:42:28 +01:00
Mike Fährmann
d45fabb79d match user profile handling on deviantart and newgrounds 2019-11-22 23:20:21 +01:00
Mike Fährmann
1a197d2195 store the original cookiejar as Extractor._cookiejar 2019-11-05 21:53:22 +01:00
Mike Fährmann
de83ae4576 make 'method' argument of Extractor.request keyword-only 2019-11-05 17:28:09 +01:00
Mike Fährmann
d44f790e81 adjust output for HTTP status related errors 2019-10-27 23:55:02 +01:00
Mike Fährmann
389d2d7e38 implement 'cookies-update' option (#445) 2019-10-19 15:23:55 +02:00
Mike Fährmann
1693d97bd3 update extractor class hierarchies
- let the GalleryExtractor class inherit directly from Extractor
- make ChapterExtractor a subclass of GalleryExtractor
- change enumeration field names of GalleryExtractors to 'num'
2019-10-16 18:15:29 +02:00
Mike Fährmann
f4bc75e854 fix rate limit handling for OAuth APIs (#368) 2019-08-03 13:43:00 +02:00
Mike Fährmann
21991acc49 add 'ciphers' option; update default User-Agent 2019-07-19 17:14:40 +02:00
Mike Fährmann
84f4d3bc0b replace urllib3's default cipher list with Firefox's (#342)
Avoids Cloudflare CAPTCHAs on both Linux in Windows without
pyOpenSSL installed.
2019-07-18 19:42:13 +02:00
Mike Fährmann
09f37fde39 [reddit] move date-min/-max handling into Extractor class 2019-07-16 22:54:39 +02:00
Mike Fährmann
56c7a66a4a detect Cloudflare CAPTCHAs and update cipher list 2019-07-10 15:18:20 +02:00
Mike Fährmann
fdec59f8e2 replace extractor.request() 'expect' argument
with
- 'fatal': allow 4xx status codes
- 'notfound': raise NotFoundError on 404
2019-07-05 00:42:16 +02:00
Mike Fährmann
69205df68d allow '-1' for infinite retries (#300) 2019-06-30 23:10:47 +02:00
Mike Fährmann
f7b5c4c3e7 use values of 'retries' options correctly
The RE-tries option now specifies exactly that: the maximum number a
failed HTTP request is re-tried. For example a value of 2 will now
correctly stop after 3 attempts: the initial one + 2 re-tries.

The maximum wait-time now also caps at 30min and increases exponentially
for both extractor.request() and downloader.http.download().
2019-06-30 23:10:18 +02:00
Mike Fährmann
399e8e965a also update urllib3's cipher list for versions >= 1.25 2019-05-21 23:02:20 +02:00
Mike Fährmann
c02f12ce2f avoid Cloudflare CAPTCHAs for OpenSSL < 1.1.1
see https://github.com/Anorov/cloudflare-scrape/pull/242
2019-05-15 12:25:20 +02:00
Mike Fährmann
5fd94c6b83 import urllib3 from requests.packages 2019-05-04 22:28:07 +02:00
Mike Fährmann
35f343206c update default SSL cipher list in urllib3 < 1.25
Cloudflare now also checks the client's SSL/TLS cipher capabilities and
produces a 403: Forbidden response with CAPTCHA if they are insufficient.

This commit replaces the default cipher list in urllib3 < 1.25 with the
one from 1.25 (1), which doesn't cause problems as long as the client
platform actually supports these ciphers. On some platforms (tested with
Python 3.4 on Linux and Python 3.7 on an outdated Windows 7 VM) it is
necessary to install pyOpenSSL to get everything to work.

Explicitly setting a minimum/maximum version for urllib3 is also no
longer necessary and installing gallery-dl will therefore not pull a
incompatible urllib3 version (#229)

Fixes the "403: Forbidden" error on Artstation (#227)

(1) 0cedb3b0f1
2019-05-03 22:40:04 +02:00
Mike Fährmann
e25ebc4bff don't disable certificate checks anymore
Executables generated with PyInstaller auto-include the root certificate
file and certificate checks now work out-of-the-box.
2019-04-17 13:27:19 +02:00
Mike Fährmann
49a6522c38 ensure consistent headers and params ordering
Necessary to avoid being labeled a bot and getting a CAPTCHA response
after solving a Cloudflare challenge.
2019-04-09 10:52:27 +02:00
Mike Fährmann
f612284d24 cache cfclearance cookies 2019-03-14 16:14:29 +01:00
Mike Fährmann
591a07f20c small code changes and cleanups 2019-03-13 22:03:02 +01:00
Mike Fährmann
6dae6bee37 automatically detect and bypass cloudflare challenge pages
TODO: cache and re-apply cfclearance cookies
2019-03-10 15:31:33 +01:00
Mike Fährmann
4ca4631bad simplify auto-disabling certificate verification
if no certificate bundle is found
2019-03-08 16:34:01 +01:00
Mike Fährmann
09d872a2b1 generalize extractor creation code 2019-03-07 22:55:26 +01:00
Mike Fährmann
3595cd582f use GalleryExtractor as common base class 2019-03-01 14:13:16 +01:00