Commit Graph

169 Commits

Author SHA1 Message Date
Mike Fährmann
27eab4e467 rewrite text tests and improve functions
- test more edge cases
- consistently return an empty string for invalid arguments
- remove the ungreedy-flag in 'remove_html()'
2018-04-15 18:13:46 +02:00
Mike Fährmann
e3f2bd4087 add tests for 'text.clean_xml()' and improve it 2018-04-14 22:07:01 +02:00
Mike Fährmann
6d8b191ea7 improve 'parse_query()' and add tests
- another irrelevant micro-optimization !
- use urllib.parse.parse_qsl directly instead of parse_qs, which
  just packs the results of parse_qsl in a different data structure
- reduced memory requirements since no additional dict and lists are
  created
2018-04-13 19:21:32 +02:00
Mike Fährmann
51ea699083 add 'abort()' as function to filter expressions
calling 'abort()' in a filter aborts the current extractor run
in a cleaner way than using something like 1/0, which
causes an error message to be printed
2018-04-12 17:07:12 +02:00
Mike Fährmann
48a83a89e9 [loveisover] remove module
archive.loveisover.me was shut down on 2018-03-29;
https://www.archiveteam.org/index.php?title=4chan#archive.loveisover.me
2018-04-09 16:05:15 +02:00
Mike Fährmann
564e12ca8f replace 'imgyt' with 'imxto'
https://img.yt/ wasn't available for a couple of days, but has now
re-emerged as https://imx.to/ with a new web-interface.
Links to older images still work (see tests).
2018-04-09 15:53:20 +02:00
Mike Fährmann
d11fcf4804 smaller changes and fixes
- fix the cloudflare challenge result if the last decimal places
  are zero (JS`s toFixed() removes trailing zeroes)
- fix downloading of kissmanga chapter-pages hosted on blogspot
  (accessing blogspot with "kissmanga.com" as referrer yields a 401)
- disable certificate validation for 'mangahere' tests
- update flickr test result
2018-04-06 15:30:09 +02:00
Mike Fährmann
759ba26fb0 [luscious] proper image order for picture albums
... and (try) to start with the first image instead of somewhere
in the middle of an album.
2018-04-05 18:12:01 +02:00
Mike Fährmann
0381ae5318 replace error handlers for stdout and co.
Python3.5 and lower throw an UnicodeEncodeError when trying to print
not-encodable characters when not using 'utf-8' as encoding.
Setting their error handlers to 'replace' should help.
2018-04-04 17:30:42 +02:00
Mike Fährmann
64d7c85b55 [exhentai] improve metadata
- add 'width', 'height' and 'size' (in bytes) for each image
- change the former 'size' and 'size_units' into 'gallery_size'
2018-04-03 18:59:53 +02:00
Mike Fährmann
a112e3f2a0 [nijie] add doujin extractor
adds support for "https://nijie.info/members_dojin.php?id=<artist_id>"
2018-03-31 18:17:41 +02:00
Mike Fährmann
299ae24996 [test] add a few downloader tests 2018-03-25 15:10:25 +02:00
Mike Fährmann
dd314279fb [test] add unit tests for extractor module functions 2018-03-25 11:49:42 +02:00
Mike Fährmann
f5c6a2d7f5 [nhentai] use API to get gallery info 2018-03-21 12:58:41 +01:00
Mike Fährmann
8ef790de12 update .travis.yml
- restrict builds to master branch and release tags
- implement 'core' and 'results' test categories
2018-03-19 17:57:32 +01:00
Mike Fährmann
557cb94f81 [deviantart] use proper exponential backoff on API errors
... and use separate API credentials for unit tests.
2018-03-15 16:01:42 +01:00
Mike Fährmann
b69cc94f0e [util] implement bencode() 2018-03-14 13:17:34 +01:00
Mike Fährmann
4d74749496 [tests] rework filters for extractor tests
CI incompatible tests will now only be skipped if tests are run in
a CI environment.
2018-03-13 13:11:10 +01:00
Mike Fährmann
32bbd12f08 update extractor tests 2018-03-08 18:04:34 +01:00
Mike Fährmann
749fbbfa6c [mangadex] add chapter- and manga-extractor 2018-03-05 18:37:21 +01:00
Mike Fährmann
5008e105ee update archive IDs
... to behave in a more straightforward way when dealing with
bookmarks/favourites/etc.

specific IDs are now grouped by their owner, album-id, ... to
allow for duplicates when it would be expected.
2018-03-01 18:20:50 +01:00
Mike Fährmann
2fad0b1f1b add 'U' conversion for format strings to unquote their content
(#74)
2018-02-25 21:57:59 +01:00
Mike Fährmann
8f338347b6 [imagehosts] cleanup
removed
- chronos.to      - unable to resolve hostname
- coreimg.net     - same
- imgmaid.net     - same
- hosturimage.com - everything returns 404
- imageontime.org - redirects to some shady site
- imgupload.yt    - cloudflare error 522, host down
- img4ever.net    - read timeout
2018-02-23 01:05:42 +01:00
Mike Fährmann
e1e0668ca8 add option to set default replacement field value
Missing or undefined keywords will now be replaced with the value
set for 'keywords-default'. The default is Python's 'None', which
is equivalent to setting this option to JSON's 'null'.
2018-02-23 00:59:20 +01:00
Mike Fährmann
ac3da8115e [util] don't add text: URLs to list of downloaded URLs 2018-02-20 18:14:27 +01:00
Mike Fährmann
89440382ad [tumblr] use separate API key for unit tests 2018-02-19 16:54:37 +01:00
Mike Fährmann
b50bdbf3d7 change config specifiers in input file format
Instead of a dictionary/object, input file options are now specified
by a 'key=value' pair starting with '-' for options only applying to
the next URL or '-G' for Global options applying to all following URLs.

See the docstring of parse_inputfile() for details.

Example option specifiers:

- filename = "{id}.{extension}"
- extractor.pixiv.user.directory = ["Pixiv Users", "{user[id]}"]
-spaces="are_optional"
-G keywords = {"global": "option"}
2018-02-16 03:10:41 +01:00
Mike Fährmann
be3ea4425d test archive-id creation and uniqueness 2018-02-12 23:02:09 +01:00
Mike Fährmann
b73b8b4f50 add OAuth unittests 2018-02-12 17:07:07 +01:00
Mike Fährmann
f5f2d29f56 [nijie] fix dojin extraction
- correctly extract artist_id
- set extension to "jpg" if it was empty and let filetype checks do
  the rest
2018-02-09 22:06:26 +01:00
Mike Fährmann
7a412f5c32 implement generic manga-chapter extractor 2018-02-04 22:02:04 +01:00
Mike Fährmann
aa38eab2be allow not-defined fields in format strings
... and replace them with "None", for now
2018-02-03 22:28:41 +01:00
Mike Fährmann
619387cbb1 update extractor unittest results 2018-01-28 18:29:05 +01:00
Mike Fährmann
f94e3706a8 use logging module for error messages during downloads 2018-01-26 18:11:13 +01:00
Mike Fährmann
0dd48d644f update test results
nothing broke, but things got updated or changed
2018-01-23 21:38:29 +01:00
Mike Fährmann
1e93955170 [batoto] remove module
Site officially shut down on 2018.01.18
2018-01-23 21:37:32 +01:00
Mike Fährmann
f10ffc0839 update extractor blacklist to also allow classes 2018-01-14 18:47:22 +01:00
Mike Fährmann
35e09869d1 [mangapark] fix image URLs and use HTTPS 2018-01-12 14:59:49 +01:00
Mike Fährmann
4edb25346e [slideshare] support mobile URLs (closes #67) 2018-01-10 14:15:00 +01:00
Mike Fährmann
b33efc99a4 [idolcomplex] add support for idol.sankakucomplex.com 2018-01-09 17:54:37 +01:00
Mike Fährmann
1a70857a12 update extractor-unittest capabilities
- "count" can now be a string defining a comparison in the form of
  '<operator> <value>', for example: '> 12' or '!= 1'. If its value
  is not a string, it is assumed to be a concrete integer as before.

- "keyword" can now be a dictionary defining tests for individual keys.
  These tests can either be a type, a concrete value or a regex
  starting with "re:". Dictionaries can be stacked inside each other.
  Optional keys can be indicated with a "?" before its name.

  For example:
      "keyword:" {
          "image_id": int,
          "gallery_id", 123,
          "name": "re:pattern",
          "user": {
              "id": 321,
          },
          "?optional": None,
      }
2017-12-30 19:05:37 +01:00
Mike Fährmann
28cd78aae0 [kissmanga] extend chapter-string regex (closes #58) 2017-12-24 22:53:10 +01:00
Mike Fährmann
fc7d165c97 [deviantart] add support for OAuth2 authentication
Some user galleries [*] require you to be either logged in or
authenticated via OAuth2 to access their deviations.

[*] e.g. https://polinaegorussia.deviantart.com/gallery/

--------------

known issue:
A deviantart 'refresh_token' can only be used once and gets updated
whenever it is used to request a new 'access_token', so storing its
initial value in a config file and reusing it again and again is not
possible.
2017-12-18 01:16:46 +01:00
Mike Fährmann
0a9a07a6e1 [slideshare] improve metadata; flake8
- added 'views' and 'published' keywords
- fixed longer titles and descriptions
2017-12-13 21:16:49 +01:00
Mike Fährmann
291369eab2 various smaller changes/additions 2017-12-06 21:45:56 +01:00
Mike Fährmann
300346ecdf [mangazuki] remove extractors
This site has been in "rebuild"-mode for a fairly long time and the
current extractor code isn't going to work for the new version either.
2017-12-04 13:36:04 +01:00
Mike Fährmann
93482a1f88 implement 'util.advance()' 2017-12-03 01:38:24 +01:00
Mike Fährmann
a718c6c6cd implement 'util.parse_bytes()' 2017-12-02 01:24:49 +01:00
Mike Fährmann
214972bc9a [gelbooru] use manual extraction
... to compensate for their disabled API.
(https://gelbooru.com/index.php?page=forum&s=view&id=3875)

This also adds an extractor for image-pools.
2017-11-29 20:48:17 +01:00
Mike Fährmann
b14de6ffc2 [tumblr] small improvements
- don't transform inline GIF URLs
- set 'type' parameter for API calls if there is only
  one post type selected
2017-11-24 16:51:07 +01:00