Commit Graph

97 Commits

Author SHA1 Message Date
Mike Fährmann
523ebc9b0b Fix serialization of 'datetime' objects in '--write-metadata'
Simplified universal serialization support in json.dump() can be achieved
by passing 'default=str', which was already the case in DataJob.run()
for -j/--dump-json, but not for the 'metadata' post-processor.

This commit introduces util.dump_json() that (more or less) unifies the
JSON output procedure of both --write-metadata and --dump-json.

(#251, #252)
2019-05-09 16:49:22 +02:00
Mike Fährmann
b09a8184ca move TestJob into test module; test _extractor values 2019-02-17 18:18:31 +01:00
Mike Fährmann
ae353ed3b0 provide "extractor" and "job" keys for logging output
This allows for stuff like "{extractor.url}" and "{extractor.category}"
in logging format strings.
Accessing 'extractor' and 'job' in any way will return "None" if those
fields aren't defined, i.e. in general logging messages.
2019-02-14 11:09:58 +01:00
Mike Fährmann
89ee8cd7e4 filter "private" kwdict entries 2019-02-13 13:22:11 +01:00
Mike Fährmann
61741d7333 provide type information for Queue messages
Child extractors are now directly constructed with Extractor.from_url()
if the extractor class is known beforehand, instead of using
extractor.find() and searching through all possible extractor classes.
2019-02-12 21:32:32 +01:00
Mike Fährmann
277b52101a add 'category-transfer' option
[ci skip]
2019-01-19 20:28:19 +01:00
Mike Fährmann
5f38ac9609 [postprocessor:exec] add a better error message (#155) 2019-01-13 13:59:11 +01:00
Mike Fährmann
0225d90078 add exception name and traceback for OSErrors 2018-12-04 19:24:50 +01:00
Mike Fährmann
fb53b5dd55 fix control+c during -j and range tests 2018-11-25 18:54:05 +01:00
Mike Fährmann
13cb270326 set target directory before postprocessor init (fixes #126) 2018-11-21 22:21:26 +01:00
Mike Fährmann
b828473aa3 retry HTTP requests for more exception classes 2018-11-19 15:49:13 +01:00
Mike Fährmann
c47482b110 smaller changes, missing docs, etc.
- make 'netrc' extractor-specific
- rename 'downloader.enable' to 'enabled'
- document 'downloader.ytdl.format'
- consistent newlines in configuration.rst
2018-11-16 18:18:07 +01:00
Mike Fährmann
3c25fa2dad update build_testresult_db.py script 2018-11-15 22:58:14 +01:00
Mike Fährmann
8ef84a6823 add option to enable/disable specific downloader modules
... and write URLs with no (active) downloader to unsupported-file
2018-11-13 18:06:36 +01:00
Mike Fährmann
d3d7f01543 add 'prepare()' step for post-processors
This allows post-processors to modify the destination path before
checking if a file already exists.
2018-10-18 22:32:03 +02:00
Mike Fährmann
6ed629f2b6 allow specifying number of skips before abort/exit (closes #115)
In addition to 'abort' and 'exit', it is now possible to specify
'abort:N' and 'exit:N' (where N is any integer) as value for 'skip'
to abort/exit after consecutively skipping N downloads.
2018-10-13 17:21:55 +02:00
Mike Fährmann
48a8717a7c add 'output.num-to-str' option
... to convert any numeric values to string when outputting them as JSON
(during '--dump-json' or otherwise)
2018-10-08 20:28:54 +02:00
Mike Fährmann
0514d6a0ae make --filter and --range config-file options
The functionality of --(chapter-)filter and --(chapter-)range are now
also exposed as the following config-file options:

- extractor.*.image-filter
- extractor.*.image-range
- extractor.*.chapter-filter
- extractor.*.chapter-range

TODO: update configuration.rst
2018-10-07 21:39:56 +02:00
Mike Fährmann
4a348990f4 adjust value resolution for retries/timeout/verify options
This change introduces 'extractor.*.retries/timeout/verify' options
as a general way to set these values for all HTTP requests.

'downloader.http.retries/timeout/verify' is a way to override these
options for file downloads only and will fall back to 'extractor.*.…*
values if they haven't been explicitly set.

Also: downloader classes now take an extractor object as first argument
instead of a requests.session.
2018-10-07 21:13:39 +02:00
Mike Fährmann
ca6ac4db6a fix 'content' tests 2018-10-05 21:10:33 +02:00
Mike Fährmann
188876d814 implement youtube-dl downloader module
URLs starting with 'ytdl:' will now be handled by youtube-dl.
There is probably a lot to fix and improve, but the basic use case
works.

TODO:
- format selection and ytdl options in general
- better filename/path handling
- ytdl support for "unsupported URLs"
- ...
2018-10-05 18:05:11 +02:00
Mike Fährmann
8c8da11bb8 do not create directory structures when using '-s' 2018-09-21 17:55:04 +02:00
Mike Fährmann
41249f3ead improve extractor.get_downloader() 2018-09-05 18:17:16 +02:00
Mike Fährmann
712b58a93b [postprocessor] add black-/whitelist options
Each post-processor config dict now supports a list of extractor
categories for which it should/shouldn't be active for.

For example:
"postprocessors": [
    {"name": "classify",
     "whitelist": ["tumblr", "deviantart"],
     ...
    }
]
2018-09-03 14:53:43 +02:00
Mike Fährmann
4313c95bc9 improve error message for OAuth2 authentication 2018-08-11 23:54:25 +02:00
Mike Fährmann
973cf98e88 fix download skip for files without extension 2018-06-27 17:16:07 +02:00
Mike Fährmann
2403c405e3 Merge branch 'postprocessor' 2018-06-08 17:43:11 +02:00
Mike Fährmann
baccf8a958 improve postprocessor handling
- add pathfmt argument for __init__()
- add finalization step
- add option to keep or delete zipped files
2018-06-08 17:39:02 +02:00
Mike Fährmann
7646bdbcfd improve postprocessor initialization code 2018-06-07 22:29:54 +02:00
Mike Fährmann
821535b458 adjust PathFormat class 2018-06-06 20:17:17 +02:00
Mike Fährmann
2df1a15fb8 add '-s/--simulate' to run data extraction without download
Useful for quick testing (even though -g and -j kind of do the same)
and to fill a download archive without actually downloading the files.

-s does the same as the default behaviour, except downloading stuff.
Maybe it should get a more fitting name, as it does actually write to
disk (cache, archive)?
2018-05-25 16:07:18 +02:00
Mike Fährmann
76c32d58e5 [postprocessor] initial code 2018-05-22 14:59:22 +02:00
Mike Fährmann
8bf3cdd82b implement logging options
Standard logging to stderr, logfiles, and unsupported URL files (which
are now handled through the logging module) can now be configured by
setting their respective option keys (log, logfile, unsupportedfile)
to a dict and specifying the following options;

- format:
    format string for logging messages
    available keys: see [1]
    default: "[{name}][{levelname}] {message}"
- format-date:
    format string for {asctime} fields in logging messages
    available keys: see [2]
    default: "%Y-%m-%d %H:%M:%S"
- level:
    the lowercase levelname until which the logger should activate;
    available levels are debug, info, warning, error, exception
    default: "info"
- path:
    path of the file to be written to
- mode:
    'mode' argument when opening the specified file
    can be either "w" to truncate the file or "a" to append to it (see [3])

If 'output.log', '.logfile', or '.unsupportedfile' is a string, it will
be interpreted, as it has been, as the filepath
(or as format string for .log)

[1] https://docs.python.org/3/library/logging.html#logrecord-attributes
[2] https://docs.python.org/3/library/time.html#time.strftime
[3] https://docs.python.org/3/library/functions.html#open
2018-05-01 17:54:52 +02:00
Mike Fährmann
9fb82e6b43 apply expand_path() to archive paths 2018-03-08 18:06:39 +01:00
Mike Fährmann
f970a8f13c fix adding keys to download archive when using skip=false 2018-02-13 23:45:30 +01:00
Mike Fährmann
be3ea4425d test archive-id creation and uniqueness 2018-02-12 23:02:09 +01:00
Mike Fährmann
3cec533c28 Merge branch 'archive' 2018-02-12 18:07:58 +01:00
Mike Fährmann
4d2fadfb6f restore skip actions with download archive 2018-02-12 16:56:45 +01:00
Mike Fährmann
7f7c16ae37 add option to specify additional key-value pairs 2018-02-08 23:10:58 +01:00
Mike Fährmann
8c3b713362 rework DownloadJob.handle_url(); include archive functionality
todo:
"abort" and "exit" skip modes if download is skipped because of archive
2018-02-01 20:49:41 +01:00
Mike Fährmann
db7f04dd97 emit log messages on download failure
and when retrying with fallback URLs
2018-01-28 18:44:10 +01:00
Mike Fährmann
27fce6f600 fix UrlJob behavior 2018-01-23 15:42:26 +01:00
Mike Fährmann
b837420291 fix minor urllist issues 2018-01-19 22:54:15 +01:00
Mike Fährmann
9d69401391 initial support for multiple URLs per image 2018-01-17 22:08:19 +01:00
Mike Fährmann
6174a5c4ef [download] adjust filename extension on filetype mismatch
(closes #63)
2018-01-17 18:37:06 +01:00
Mike Fährmann
1a70857a12 update extractor-unittest capabilities
- "count" can now be a string defining a comparison in the form of
  '<operator> <value>', for example: '> 12' or '!= 1'. If its value
  is not a string, it is assumed to be a concrete integer as before.

- "keyword" can now be a dictionary defining tests for individual keys.
  These tests can either be a type, a concrete value or a regex
  starting with "re:". Dictionaries can be stacked inside each other.
  Optional keys can be indicated with a "?" before its name.

  For example:
      "keyword:" {
          "image_id": int,
          "gallery_id", 123,
          "name": "re:pattern",
          "user": {
              "id": 321,
          },
          "?optional": None,
      }
2017-12-30 19:05:37 +01:00
Mike Fährmann
88bb0798fd delay initialization of PathFormat objects
This allows the DeviantArt group-check to be moved inside the
Extractor.items() method which in turn allows for better exception
handling.

As a new general rule:
Never raise exceptions during extractor initialization.
2017-12-29 22:15:57 +01:00
Mike Fährmann
9d73ed4772 fix issue with using 'skip()' when a filter is present
calling skip() skips over unfiltered items and does not apply
the filter expression to them, which is not what should happen
2017-12-27 22:09:10 +01:00
Mike Fährmann
291369eab2 various smaller changes/additions 2017-12-06 21:45:56 +01:00
Mike Fährmann
4fb6803fa6 add option to sleep before each download 2017-12-04 17:33:10 +01:00