Commit Graph

65 Commits

Author SHA1 Message Date
Mike Fährmann
3ecb512722 send Referer headers by default 2023-09-19 00:02:04 +02:00
Mike Fährmann
a453335a9f remove test results in extractor modules
and add generic example URLs
2023-09-11 16:30:55 +02:00
Mike Fährmann
a383eca7f6 decouple extractor initialization
Introduce an 'initialize()' function that does the actual init
(session, cookies, config options) and can called separately from
the constructor __init__().

This allows, for example, to adjust config access inside a Job
before most of it already happened when calling 'extractor.find()'.
2023-07-25 22:16:16 +02:00
Mike Fährmann
d97b8c2fba consistent cookie-related names
- rename every cookie variable or method to 'cookies_*'
- simplify '.session.cookies' to just '.cookies'
- more consistent 'login()' structure
2023-07-22 01:20:50 +02:00
Mike Fährmann
b0cb4a1b9c replace 'text.extract()' with 'text.extr()' where possible 2022-11-05 01:14:09 +01:00
Mike Fährmann
3b369ce3d1 [nijie] add 'followed' extractor (#3048) 2022-10-14 14:59:18 +02:00
Mike Fährmann
c4a62a48ae [nijie] add 'feed' extractor (#3048) 2022-10-14 12:03:00 +02:00
Mike Fährmann
636d03df95 [nijie] reduce cache maxage to 90 days 2022-08-27 21:57:45 +02:00
Mike Fährmann
241e82e18d [horne] add support for horne.red (#2700) 2022-06-25 16:52:16 +02:00
Mike Fährmann
d11e2191ae [nijie] support /history_nuita.php listings (closes #2541) 2022-05-02 09:03:34 +02:00
Mike Fährmann
1f9a0e2fd8 update extractor test results 2022-04-18 17:24:00 +02:00
Mike Fährmann
bd08ee2859 remove most 'yield Message.Version' statements
only leave them in oauth.py as noop results
2021-08-16 03:10:48 +02:00
Mike Fährmann
b58e605dc7 raise error when required username or password are missing
do not try to login as 'None' (#1192)
2020-12-22 14:40:18 +01:00
Mike Fährmann
6514312126 [nijie] add 'include' option (closes #1018) 2020-09-25 18:18:35 +02:00
Mike Fährmann
e62c209ca0 [nijie] fix 'date' parsing 2019-11-30 23:08:21 +01:00
Mike Fährmann
94dbdbf506 [nijie] change default filename format
… to be consistent with Pixiv filenames
2019-11-04 20:47:38 +01:00
Mike Fährmann
1faec285d1 [nijie] further improvements (closes #423)
- provide a 'user_name' metadata field
  - usually the same as 'artist_id', except for favorite downloads
- extract the whole description text and properly escape HTML entities
- fixed an issue with titles or tags containing double quotes
2019-09-27 23:14:32 +02:00
Mike Fährmann
20eb6c401f [nijie] improvements and fixes (#423)
- ignore unavailable image pages
- more metadata fields: artist_name, date, tags
- rename 'index' to 'num'
- improved code structure
2019-09-26 21:45:01 +02:00
Mike Fährmann
12da6bd0c9 [simplyhentai] fix/improve extraction 2019-07-06 20:25:53 +02:00
Mike Fährmann
fdec59f8e2 replace extractor.request() 'expect' argument
with
- 'fatal': allow 4xx status codes
- 'notfound': raise NotFoundError on 404
2019-07-05 00:42:16 +02:00
Mike Fährmann
b89f0d8d3c update extractor result tests 2019-07-01 20:02:47 +02:00
Mike Fährmann
a2af2d2965 adjust cache maxage values 2019-03-14 22:21:49 +01:00
Mike Fährmann
5530871b5a change results of text.nameext_from_url()
Instead of getting a complete 'filename' from an URL and splitting that
into 'name' and 'extension', the new approach gets rid of the complete
version and renames 'name' to 'filename'. (Using anything other than
{extension} for a filename extension doesn't really work anyway)

Example: "https://example.org/path/filename.ext"

before:
- filename : filename.ext
- name     : filename
- extension: ext

now:
- filename : filename
- extension: ext
2019-02-14 16:07:17 +01:00
Mike Fährmann
4b1880fa5e propagate 'match' to base extractor constructor 2019-02-11 13:31:10 +01:00
Mike Fährmann
6284731107 simplify extractor constants
- single strings for URL patterns
- tuples instead of lists for 'directory_fmt' and 'test'
- single-tuple tests where applicable
2019-02-08 13:45:40 +01:00
Mike Fährmann
00dc37ccbf replace AsynchronousMixin Extractor with a Mixin 2019-02-04 14:21:19 +01:00
Mike Fährmann
dd358b4564 improve cookie handling during logins 2019-01-30 17:09:32 +01:00
Mike Fährmann
173add6935 [nijie] fix artist_id extraction
view_popup.php pages for older images or dojins either have the
artist_id value at a different place or not at all.
2018-07-10 12:30:53 +02:00
Mike Fährmann
017188d268 improve extractor.request()
Replace the 'fatal' parameter with 'expect', which is a list/range
of HTTP status codes >= 400 that should also be accepted.
2018-06-18 16:29:56 +02:00
Mike Fährmann
2d17a9e07f improve extractor.request()
- better retry behavior
- exponential back-off
- removed 'allow_empty' argument
2018-04-23 18:45:59 +02:00
Mike Fährmann
cc36f88586 rename safe_int to parse_int; move parse_* to text module 2018-04-20 14:53:21 +02:00
Mike Fährmann
7b562907c3 [nijie] add favorites extractor
adds support for 'https://nijie.info/user_like_illust_view.php?id=...'
2018-03-31 18:54:25 +02:00
Mike Fährmann
445db75955 [nijie] improve extraction and metadata
- add 'title' and 'description'
- split 'artist_id' into 'user_id' and 'artist_id'
  - 'user_id' is the ID of the user from which the image entry
    originates from
  - 'artist_id' is the ID of the actual image artist
- improve pagination and URL patterns
2018-03-31 18:48:41 +02:00
Mike Fährmann
a112e3f2a0 [nijie] add doujin extractor
adds support for "https://nijie.info/members_dojin.php?id=<artist_id>"
2018-03-31 18:17:41 +02:00
Mike Fährmann
3cec533c28 Merge branch 'archive' 2018-02-12 18:07:58 +01:00
Mike Fährmann
f5f2d29f56 [nijie] fix dojin extraction
- correctly extract artist_id
- set extension to "jpg" if it was empty and let filetype checks do
  the rest
2018-02-09 22:06:26 +01:00
Mike Fährmann
34873dbd90 set 'archive_fmt' values
These are going to be used to create an unique id for each image.
2018-02-01 15:30:49 +01:00
Mike Fährmann
9c138dfc1f [common] detect empty HTTP response bodies 2017-09-26 16:49:58 +02:00
Mike Fährmann
6f30cf4c64 change keyword names to valid Python identifiers
This commit mostly replaces all minus-signs ('-') in keyword names with
underscores ('_') to allow them to be used in filter-expressions. For
example 'gallery-id' got renamed to 'gallery_id'.

(It is theoretically possible to access any variable, regardless of its
name, with 'locals()["NAME"]', but that seems a bit too convoluted if
just 'NAME' could be enough)
2017-09-10 22:20:47 +02:00
Mike Fährmann
915a0137de improve 'extractor.request'
- add 'fatal' argument
- improve internal logic and flow
- raise known exception on error
- update exception hierarchy
2017-08-05 16:11:46 +02:00
Mike Fährmann
7aa9fa796a code cleanup and fixes 2017-07-25 14:59:41 +02:00
Mike Fährmann
808f67ba7d use 'cookiedomain' for cookies set by object-config-values
otherwise these cookies would not be picked up by the
_check_cookies() method.
2017-07-22 15:43:35 +02:00
Mike Fährmann
0610ae5000 skip login if cookies are present 2017-07-17 10:33:36 +02:00
Mike Fährmann
d3b04076f7 add .netrc support (#22)
Use the '--netrc' cmdline option or set the 'netrc' config option
to 'true' to enable the use of .netrc authentication data.

The 'machine' names for the .netrc info are the lowercase extractor
names (or categories): batoto, exhentai, nijie, pixiv, seiga.
2017-06-24 12:17:26 +02:00
Mike Fährmann
4b967fa189 implement and use extractor.config() method 2017-04-25 17:12:48 +02:00
Mike Fährmann
298d7c45f7 [nijie] support multi-page image listings 2017-04-02 11:43:23 +02:00
Mike Fährmann
1d46be545c add login notifications 2017-03-17 09:42:59 +01:00
Mike Fährmann
94e10f249a code adjustments according to pep8 nr2 2017-02-01 00:53:19 +01:00
Mike Fährmann
4a8d74973c adjust login methods to a specific style 2017-01-08 17:33:25 +01:00
Mike Fährmann
7952b8d18d add a few tests expecting exceptions 2016-12-30 01:46:42 +01:00