Commit Graph

1155 Commits

Author SHA1 Message Date
Mike Fährmann
8bf3cdd82b implement logging options
Standard logging to stderr, logfiles, and unsupported URL files (which
are now handled through the logging module) can now be configured by
setting their respective option keys (log, logfile, unsupportedfile)
to a dict and specifying the following options;

- format:
    format string for logging messages
    available keys: see [1]
    default: "[{name}][{levelname}] {message}"
- format-date:
    format string for {asctime} fields in logging messages
    available keys: see [2]
    default: "%Y-%m-%d %H:%M:%S"
- level:
    the lowercase levelname until which the logger should activate;
    available levels are debug, info, warning, error, exception
    default: "info"
- path:
    path of the file to be written to
- mode:
    'mode' argument when opening the specified file
    can be either "w" to truncate the file or "a" to append to it (see [3])

If 'output.log', '.logfile', or '.unsupportedfile' is a string, it will
be interpreted, as it has been, as the filepath
(or as format string for .log)

[1] https://docs.python.org/3/library/logging.html#logrecord-attributes
[2] https://docs.python.org/3/library/time.html#time.strftime
[3] https://docs.python.org/3/library/functions.html#open
2018-05-01 17:54:52 +02:00
Mike Fährmann
95392554ee use text.urljoin() 2018-04-26 17:00:26 +02:00
Mike Fährmann
2721417dd8 Merge branch 'master' into 1.4-dev 2018-04-24 11:33:02 +02:00
Mike Fährmann
c6d5154fc3 fix flake8 errors, ignore W504
pycodestyle 2.4.0 enforces some new style guidelines
2018-04-24 11:25:32 +02:00
Mike Fährmann
2d17a9e07f improve extractor.request()
- better retry behavior
- exponential back-off
- removed 'allow_empty' argument
2018-04-23 18:45:59 +02:00
Mike Fährmann
80521ae1f6 [deviantart] improve API error handling
The previous implementation would retry requests with 4xx status codes
in an infinite loop, which is especially a problem when querying
non-existent users or groups. These are now properly handled with a
NotFoundError exception.
2018-04-23 10:10:43 +02:00
Mike Fährmann
e54b43be08 [mangadex] add title info for chapter extractors 2018-04-22 16:20:04 +02:00
Mike Fährmann
f471161920 Merge branch 'master' into 1.4-dev 2018-04-21 12:15:40 +02:00
Mike Fährmann
a2020c736e release version 1.3.4 2018-04-20 18:42:09 +02:00
Mike Fährmann
eb37fbf0e8 [hentaifoundry] improve extractor
- use common base class
- better pagination
- respect '.../page/<num>'
- implement skip() / --range support
- get YII_CSRF_TOKEN from cookies
2018-04-20 18:26:23 +02:00
Mike Fährmann
80bead739d [oauth] require custom client-* values for pinterest 2018-04-20 15:31:05 +02:00
Mike Fährmann
cc36f88586 rename safe_int to parse_int; move parse_* to text module 2018-04-20 14:53:21 +02:00
Mike Fährmann
ff643793bd improve and document cloudflare bypass code 2018-04-19 21:32:10 +02:00
Mike Fährmann
10cc59f3b5 fix extractor names 2018-04-18 18:12:57 +02:00
Mike Fährmann
b1325d4d2c fix extractor docstrings 2018-04-18 18:03:43 +02:00
Mike Fährmann
df7e18399e [luscious] fix image order 2018-04-17 17:32:21 +02:00
Mike Fährmann
d10579edb5 [pinterest] improve PinterestAPI code; remove OAuth mentions
on another note: access_tokens have been set to only allow for
10 requests per hour (from 200 yesterday)
2018-04-17 17:12:42 +02:00
Mike Fährmann
4bd182c107 [pinterest] implement oauth:pinterest (#83)
Pinterest access tokens are rate limited at 200 requests per
hour (or maybe per 2 or 3 hours?) so having just one access token
for all users isn't going to work in the long run.
2018-04-16 20:03:28 +02:00
Mike Fährmann
9651f3fce0 [pinterest] improve error messages (#83) 2018-04-16 19:36:54 +02:00
Mike Fährmann
dbe250f7e5 [pinterest] update access_token (#83) 2018-04-16 09:46:45 +02:00
Mike Fährmann
dd49127408 [spectrumnexus] remove module
Site stopped hosting manga scans (http://view.thespectrum.net/)
2018-04-16 09:45:07 +02:00
Mike Fährmann
5c487300ee improve 'parse_query()' and add tests
- another irrelevant micro-optimization !
- use urllib.parse.parse_qsl directly instead of parse_qs, which
  just packs the results of parse_qsl in a different data structure
- reduced memory requirements since no additional dict and lists are
  created
2018-04-15 19:05:29 +02:00
Mike Fährmann
728c64a3fb [tumblr] rename 'offset' to 'num and adjust formats
Trying to somehow emulate Tumblr filenames is a bad idea ...
2018-04-15 18:58:32 +02:00
Mike Fährmann
4ffa94f634 remove 'shorten_path()' and 'shorten_filename()' 2018-04-15 18:44:13 +02:00
Mike Fährmann
27eab4e467 rewrite text tests and improve functions
- test more edge cases
- consistently return an empty string for invalid arguments
- remove the ungreedy-flag in 'remove_html()'
2018-04-15 18:13:46 +02:00
Mike Fährmann
e3f2bd4087 add tests for 'text.clean_xml()' and improve it 2018-04-14 22:07:01 +02:00
Mike Fährmann
6d8b191ea7 improve 'parse_query()' and add tests
- another irrelevant micro-optimization !
- use urllib.parse.parse_qsl directly instead of parse_qs, which
  just packs the results of parse_qsl in a different data structure
- reduced memory requirements since no additional dict and lists are
  created
2018-04-13 19:21:32 +02:00
Mike Fährmann
51ea699083 add 'abort()' as function to filter expressions
calling 'abort()' in a filter aborts the current extractor run
in a cleaner way than using something like 1/0, which
causes an error message to be printed
2018-04-12 17:07:12 +02:00
Mike Fährmann
6bd857a319 [tumblr] handle rate limits / 429 errors
- wait for the hourly limit to reset
- abort upon exceeding the daily limit (it doesn't seem useful to
  potentially wait for several hours)
2018-04-12 16:25:20 +02:00
Mike Fährmann
7073ab7707 [komikcast] update regex to only match manga pages
The 'readerarea' section now includes some (shady) external
Javascript file, which got matched as well.
2018-04-11 15:48:17 +02:00
Mike Fährmann
a1fa4b43b0 Revert "[tumblr] add option to sort photosets by upload order"
This reverts commit 4a26ae32df.
2018-04-09 16:08:08 +02:00
Mike Fährmann
48a83a89e9 [loveisover] remove module
archive.loveisover.me was shut down on 2018-03-29;
https://www.archiveteam.org/index.php?title=4chan#archive.loveisover.me
2018-04-09 16:05:15 +02:00
Mike Fährmann
564e12ca8f replace 'imgyt' with 'imxto'
https://img.yt/ wasn't available for a couple of days, but has now
re-emerged as https://imx.to/ with a new web-interface.
Links to older images still work (see tests).
2018-04-09 15:53:20 +02:00
Mike Fährmann
1b80fa82a9 [imgur] update URL pattern and tests 2018-04-08 21:06:21 +02:00
Mike Fährmann
4a26ae32df [tumblr] add option to sort photosets by upload order 2018-04-07 15:57:55 +02:00
Mike Fährmann
6b72be8ee6 [tumblr] add 'hash' keyword
'hash' is the middle part of the filename in a tumblr image URL.
For example an image with '.../tumblr_p6tgemp1NZ1wgha4yo1_250.png' as
its URL would have 'p6tgemp1NZ1wgha4yo1' as hash.
2018-04-07 15:54:30 +02:00
Mike Fährmann
ffc0c67701 release version 1.3.3 2018-04-06 15:45:45 +02:00
Mike Fährmann
d11fcf4804 smaller changes and fixes
- fix the cloudflare challenge result if the last decimal places
  are zero (JS`s toFixed() removes trailing zeroes)
- fix downloading of kissmanga chapter-pages hosted on blogspot
  (accessing blogspot with "kissmanga.com" as referrer yields a 401)
- disable certificate validation for 'mangahere' tests
- update flickr test result
2018-04-06 15:30:09 +02:00
Mike Fährmann
f6c95dccf9 [cloudflare] fix bypass procedure
Cloudflare challenges, at least for kissmanga and readcomiconline,
now use slightly different Javascript expressions.

Instead of a single value per expression, they now have a numerator
and a denominator of a fractional value, which in the end gets
truncated to 10 decimal places.
2018-04-05 20:28:04 +02:00
Mike Fährmann
759ba26fb0 [luscious] proper image order for picture albums
... and (try) to start with the first image instead of somewhere
in the middle of an album.
2018-04-05 18:12:01 +02:00
Mike Fährmann
68e9fbee16 [tumblr] check all 4 keys/secrets before using OAuth
it was possible to cause a crash by setting api-key or -secret to null.
(this commit also slightly improves the blog-cache implementation)
2018-04-05 15:42:23 +02:00
Mike Fährmann
4810d446bb remove the obsolete safeprint() and error() functions
- safeprint() was used to print values which might have caused a
  UnicodeEncodeError, but that is no longer necessary (0381ae5)
- errors are now handled via logging output (f94e370)
2018-04-05 13:10:33 +02:00
Mike Fährmann
0381ae5318 replace error handlers for stdout and co.
Python3.5 and lower throw an UnicodeEncodeError when trying to print
not-encodable characters when not using 'utf-8' as encoding.
Setting their error handlers to 'replace' should help.
2018-04-04 17:30:42 +02:00
Mike Fährmann
f8168c693e [tumblr] avoid calls to '/blog/.../info'
The same information returned by the 'blog/.../info' API endpoint
is also included in the result of every 'blog/.../posts' call.
2018-04-04 14:15:24 +02:00
Mike Fährmann
64d7c85b55 [exhentai] improve metadata
- add 'width', 'height' and 'size' (in bytes) for each image
- change the former 'size' and 'size_units' into 'gallery_size'
2018-04-03 18:59:53 +02:00
Mike Fährmann
64b22e0fc1 [pawoo] update URL pattern
adds support for 'https://pawoo.net/@.../media'
2018-04-02 13:00:59 +02:00
Mike Fährmann
7b562907c3 [nijie] add favorites extractor
adds support for 'https://nijie.info/user_like_illust_view.php?id=...'
2018-03-31 18:54:25 +02:00
Mike Fährmann
445db75955 [nijie] improve extraction and metadata
- add 'title' and 'description'
- split 'artist_id' into 'user_id' and 'artist_id'
  - 'user_id' is the ID of the user from which the image entry
    originates from
  - 'artist_id' is the ID of the actual image artist
- improve pagination and URL patterns
2018-03-31 18:48:41 +02:00
Mike Fährmann
a112e3f2a0 [nijie] add doujin extractor
adds support for "https://nijie.info/members_dojin.php?id=<artist_id>"
2018-03-31 18:17:41 +02:00
Mike Fährmann
f39153b6e9 [nhentai] add extractor for search results 2018-03-28 17:21:44 +02:00