Commit Graph

40 Commits

Author SHA1 Message Date
Mike Fährmann
a383eca7f6 decouple extractor initialization
Introduce an 'initialize()' function that does the actual init
(session, cookies, config options) and can called separately from
the constructor __init__().

This allows, for example, to adjust config access inside a Job
before most of it already happened when calling 'extractor.find()'.
2023-07-25 22:16:16 +02:00
Mike Fährmann
c6a9bab019 update extractor test results 2022-07-12 15:49:22 +02:00
Mike Fährmann
6c0fa2f258 [readcomiconline] update 2022-06-05 21:40:08 +02:00
Mike Fährmann
310fee99d5 [readcomiconline] remove automatic 'browser' setting (#2625) 2022-05-27 13:44:28 +02:00
Mike Fährmann
82c1cc130b [readcomiconline] update deobfuscation code (#2481) 2022-05-17 10:52:45 +02:00
Mike Fährmann
12bd9ba33a [readcomiconline] add 'quality' option (#2467) 2022-04-15 18:10:37 +02:00
Mike Fährmann
60ad46ddcc [readcomiconline] unobfuscate image URLs (#2481) 2022-04-15 18:04:09 +02:00
Mike Fährmann
2133f1d77f [readcomiconline] change domain to 'readcomiconline.li'
(closes #1517)
2021-05-01 16:41:16 +02:00
Mike Fährmann
fc15930266 [readcomiconline] download high quality image versions
(fixes #1347)
2021-02-28 01:11:32 +01:00
Mike Fährmann
968d3e8465 remove '&' from URL patterns
'/?&#' -> '/?#' and '?&#' -> '?#'

According to https://www.ietf.org/rfc/rfc3986.txt, URLs are
"organized hierarchically" by using "the slash ("/"), question
mark ("?"), and number sign ("#") characters to delimit components"
2020-10-22 23:31:25 +02:00
Mike Fährmann
c874071f5a [kissmanga] remove module 2020-10-04 22:46:41 +02:00
Mike Fährmann
4465a3ea68 [kissmanga][readcomiconline] add 'captcha' option (#279)
to configure how to handle CAPTCHA page redirects:
- either interactively wait for the user to solve the CAPTCHA
- or raise StopExtraction like before
2019-05-27 22:24:48 +02:00
Mike Fährmann
48233f00c0 [readcomiconline] detect 'AreYouHuman' redirects (#279) 2019-05-26 15:58:37 +02:00
Mike Fährmann
6dae6bee37 automatically detect and bypass cloudflare challenge pages
TODO: cache and re-apply cfclearance cookies
2019-03-10 15:31:33 +01:00
Mike Fährmann
5530871b5a change results of text.nameext_from_url()
Instead of getting a complete 'filename' from an URL and splitting that
into 'name' and 'extension', the new approach gets rid of the complete
version and renames 'name' to 'filename'. (Using anything other than
{extension} for a filename extension doesn't really work anyway)

Example: "https://example.org/path/filename.ext"

before:
- filename : filename.ext
- name     : filename
- extension: ext

now:
- filename : filename
- extension: ext
2019-02-14 16:07:17 +01:00
Mike Fährmann
32edf4fc7b add '_extractor' info to manga extractor results 2019-02-13 13:23:36 +01:00
Mike Fährmann
580baef72c change Chapter and MangaExtractor classes
- unify and simplify constructors
- rename get_metadata and get_images to just metadata() and images()
- rename self.url to chapter_url and manga_url
2019-02-11 18:38:47 +01:00
Mike Fährmann
4b1880fa5e propagate 'match' to base extractor constructor 2019-02-11 13:31:10 +01:00
Mike Fährmann
6284731107 simplify extractor constants
- single strings for URL patterns
- tuples instead of lists for 'directory_fmt' and 'test'
- single-tuple tests where applicable
2019-02-08 13:45:40 +01:00
Mike Fährmann
6126615698 update URLs for supportedsites.rst 2019-01-30 16:18:22 +01:00
Mike Fährmann
259123732f [readcomiconline] improve comic-page parsing 2018-12-30 13:19:23 +01:00
Mike Fährmann
1c6b9ba322 [readcomiconline] use HTTPS 2018-12-09 14:54:55 +01:00
Mike Fährmann
1d43cbbf52 [gelbooru] tag-splitting for non-api mode 2018-07-06 15:24:19 +02:00
Mike Fährmann
cc36f88586 rename safe_int to parse_int; move parse_* to text module 2018-04-20 14:53:21 +02:00
Mike Fährmann
d11fcf4804 smaller changes and fixes
- fix the cloudflare challenge result if the last decimal places
  are zero (JS`s toFixed() removes trailing zeroes)
- fix downloading of kissmanga chapter-pages hosted on blogspot
  (accessing blogspot with "kissmanga.com" as referrer yields a 401)
- disable certificate validation for 'mangahere' tests
- update flickr test result
2018-04-06 15:30:09 +02:00
Mike Fährmann
179bcdd349 adjust archive-ids 2018-02-13 04:50:45 +01:00
Mike Fährmann
3cec533c28 Merge branch 'archive' 2018-02-12 18:07:58 +01:00
Mike Fährmann
5b3c34aa96 use generic chapter-extractor in more modules 2018-02-07 12:36:39 +01:00
Mike Fährmann
34873dbd90 set 'archive_fmt' values
These are going to be used to create an unique id for each image.
2018-02-01 15:30:49 +01:00
Mike Fährmann
e6814aebe2 add 'extractor.*.user-agent' config option 2017-11-15 14:01:33 +01:00
Mike Fährmann
68a0a7579c fix/improve some regular expressions 2017-10-09 22:37:50 +02:00
Mike Fährmann
885bd4cbe2 [readcomiconline] extract comic metadata 2017-09-18 19:18:24 +02:00
Mike Fährmann
92a11528d1 smaller changes 2017-06-28 09:42:49 +02:00
Mike Fährmann
f226417420 simplify code by using a MangaExtractor base class 2017-05-20 11:27:43 +02:00
Mike Fährmann
f537ad5f2f [kissmanga] re-enable module 2017-04-05 12:16:23 +02:00
Mike Fährmann
94e10f249a code adjustments according to pep8 nr2 2017-02-01 00:53:19 +01:00
Mike Fährmann
40dbea7ed2 rewrite parts of the cloudflare bypass system 2016-12-16 13:28:36 +01:00
Mike Fährmann
2449825d53 [kissmanga] solve cloudflare challenge on demand 2016-11-23 12:48:44 +01:00
Mike Fährmann
9e3788175e implement decorator for cloudflare bypass
this method for enabling and caching a cloudflare bypass for a
requests.session object allows for different cache-timeouts for
different domains
2016-11-20 18:05:49 +01:00
Mike Fährmann
b634ace39e [readcomiconline] add comic-issue and comic extractor 2016-11-14 18:29:45 +01:00