Commit Graph

109 Commits

Author SHA1 Message Date
Mike Fährmann
90e4c645ba [formatter] allow multiple "special" format specifiers (#595)
It is now, for example, possible to specify multiple replacement
operations per format replacement field: {name:Ra/b/Rc/d/}
2020-02-16 21:47:08 +01:00
Mike Fährmann
219c4cc78c [formatter] allow for numeric list and string indices 2020-02-15 22:46:22 +01:00
Mike Fährmann
7d1da614d9 [formatter] implement field name alternatives (#525)
The format string '{a|b|c}' will now try to use the value from 'a' and
fall back to 'b' and 'c' if accessing a field raises an exception or
if its value is None.
2020-02-15 17:58:21 +01:00
Mike Fährmann
56f1c96168 implement 'parent-directory' option (#551) 2020-01-29 18:32:37 +01:00
Mike Fährmann
2a9be48511 improve util.load/save_cookiestxt() and add tests
- take a file object as argument instead of an filename
- accept whitespace before comments ("   # comment")
- map expiration "0" to None and not the number 0
2020-01-25 23:02:15 +01:00
Mike Fährmann
c1a6862863 implement functions to load/save cookies.txt files (closes #586)
The methods of the standard libraries' MozillaCookieJar have
several shortcomings (#HttpOnly_ cookies, 0 expiration timestamps, etc.)
and require construction of an ultimately pointless CookieJar object.
2020-01-21 21:59:36 +01:00
Mike Fährmann
760b9b4db4 add remove_file() and remove_directory() helpers
these functions call os.unlink() or os.rmdir()
while catching and suppressing potential OSErrors
2020-01-18 00:21:26 +01:00
Mike Fährmann
b2d542ad40 improve PathFormat._enum_file()
open only one try-except block for the whole loop,
instead of one for each iteration in os.path.exists()
2020-01-18 00:21:25 +01:00
Mike Fährmann
025f6e3398 add fallback for missing WITHOUT ROWID support (#553) 2020-01-03 22:58:28 +01:00
Mike Fährmann
58391d492d cache archive keys generated in __contains__() (#524)
To avoid writing a different key to the archive than what was checked
against before the file download.
2019-12-20 16:43:08 +01:00
Mike Fährmann
0f1538af78 split filename formatting into its own function 2019-11-29 22:32:07 +01:00
Mike Fährmann
3fc1e12949 [postprocessor:metadata] filter private entries
i.e. keys starting with an underscore
2019-11-21 16:58:44 +01:00
Mike Fährmann
d5e3910270 adjust 'util.raises()' 2019-10-28 15:06:17 +01:00
Mike Fährmann
c887493a80 overhaul exception stuff 2019-10-27 23:53:37 +01:00
Mike Fährmann
776e9e073f close archive on job completion (#417) 2019-09-10 22:43:51 +02:00
Mike Fährmann
0ce98169b8 improve path generation
- fix 'abspath()' results for Python <3.7 (closes #402)
  - 'abspath()' in Python 3.7+ removes trailing path separators
  - in Python <3.7 it doesn't
- filter empty path segments
2019-08-28 23:25:18 +02:00
Mike Fährmann
3284c62f22 ensure PathFormat.directory ends with a path separator
... plus some other small optimizations
2019-08-20 00:25:13 +02:00
Mike Fährmann
e77a656437 optimize directory path generation
- use str.join() instead of os.path.join()
  (less "features", but 10x as fast)
- cache directory formatters
- detect and optimize field access for 1-element format strings
2019-08-19 15:56:20 +02:00
Mike Fährmann
454bf1ebf9 preserve enumeration index after 'set_extension()' (#306) 2019-08-16 23:12:33 +02:00
Mike Fährmann
f5039b897f replace DownloadArchive.check() with __contains__()
Interestingly enough, 'a in obj' is slightly faster than
'obj.check(a)' and is also nicer to look at, I think.
2019-08-16 23:12:32 +02:00
Mike Fährmann
5a210991b6 Remove control characters from filesystem paths
- add 'path-remove' option to specify the set of characters that
 should be removed
- rename 'restrict-filenames' to 'path-restrict'
- #348, #380
2019-08-16 23:12:16 +02:00
Mike Fährmann
0bb873757a update PathFormat class
- change 'has_extension' from a simple flag/bool to a field that
  contains the original filename extension
- rename 'keywords' to 'kwdict' and some other stuff as well
- inline 'adjust_path()'
- put enumeration index before filename extension (#306)
2019-08-12 21:40:37 +02:00
Mike Fährmann
8dc42bb178 implement 'enumerate' for 'extractor.skip' (#306)
[ci skip]
2019-08-08 18:37:54 +02:00
Mike Fährmann
b1bea8aaeb add 'restrict-filenames' option (#348) 2019-07-23 17:41:24 +02:00
Mike Fährmann
7b77ecc35a fix paths for files without extension (#220) 2019-07-15 16:39:03 +02:00
Mike Fährmann
16c582aaf9 implement 'mtime' post-processor (#332)
This can set a file's modification time according to a UNIX timestamp
or a datetime object from its metadata.
2019-07-14 22:39:17 +02:00
Mike Fährmann
40da44b17f Merge branch 'v1.9.0' 2019-06-29 15:39:52 +02:00
Mike Fährmann
95b1e4c3c0 implement R<old>/<new>/ format option (#318) 2019-06-23 22:45:44 +02:00
Mike Fährmann
f4ba98771d use Last-Modified header to set file modification time
(#236, #277)
2019-06-19 23:16:32 +02:00
Mike Fährmann
523ebc9b0b Fix serialization of 'datetime' objects in '--write-metadata'
Simplified universal serialization support in json.dump() can be achieved
by passing 'default=str', which was already the case in DataJob.run()
for -j/--dump-json, but not for the 'metadata' post-processor.

This commit introduces util.dump_json() that (more or less) unifies the
JSON output procedure of both --write-metadata and --dump-json.

(#251, #252)
2019-05-09 16:49:22 +02:00
Mike Fährmann
23baecb29e fix 'CONVERSIONS' variable name 2019-03-05 22:50:56 +01:00
Mike Fährmann
105097ddcf add 'S' conversion options for format string fields
Same as 's' (convert to string), but has a better, human-readable
conversion for lists.
2019-03-04 21:13:34 +01:00
Mike Fährmann
148b8f15d0 update tests for util.py 2019-02-14 11:15:19 +01:00
Mike Fährmann
ae353ed3b0 provide "extractor" and "job" keys for logging output
This allows for stuff like "{extractor.url}" and "{extractor.category}"
in logging format strings.
Accessing 'extractor' and 'job' in any way will return "None" if those
fields aren't defined, i.e. in general logging messages.
2019-02-14 11:09:58 +01:00
Mike Fährmann
79c01ec7ae implement J<separator>/ format option
J joins list elements by calling <separator>.join(list):

Example:
{f:J - /} -> "a - b - c" (if "f" is ["a", "b", "c"])
2019-01-17 17:01:58 +01:00
Mike Fährmann
c5d4f558c9 allow missing field access keys in format strings (#136) 2018-12-22 13:54:14 +01:00
Mike Fährmann
d3d7f01543 add 'prepare()' step for post-processors
This allows post-processors to modify the destination path before
checking if a file already exists.
2018-10-18 22:32:03 +02:00
Mike Fährmann
6ed629f2b6 allow specifying number of skips before abort/exit (closes #115)
In addition to 'abort' and 'exit', it is now possible to specify
'abort:N' and 'exit:N' (where N is any integer) as value for 'skip'
to abort/exit after consecutively skipping N downloads.
2018-10-13 17:21:55 +02:00
Mike Fährmann
48a8717a7c add 'output.num-to-str' option
... to convert any numeric values to string when outputting them as JSON
(during '--dump-json' or otherwise)
2018-10-08 20:28:54 +02:00
Mike Fährmann
0514d6a0ae make --filter and --range config-file options
The functionality of --(chapter-)filter and --(chapter-)range are now
also exposed as the following config-file options:

- extractor.*.image-filter
- extractor.*.image-range
- extractor.*.chapter-filter
- extractor.*.chapter-range

TODO: update configuration.rst
2018-10-07 21:39:56 +02:00
Mike Fährmann
590c0b3ad5 re-implement and improve filename formatter
A format string now gets parsed only once instead of re-parsing it each
time it is applied to a set of data.

The initial parsing causes directory path creation to be at about 2x
slower than before, since each format string there is used only once,
but building a filename, the more common operation, is at least 2x
faster. The "directory slowness" cancels at about 5 filenames and
everything above that is significantly faster.
2018-08-25 10:45:14 +02:00
Mike Fährmann
c83fc62abc prioritize archive over disk access (#87) 2018-07-30 17:48:23 +02:00
Mike Fährmann
e0dd8dff5f implement L<maxlen>/<replacement>/ format option
The L option allows for the contents of a format field to be replaced
with <replacement> if its length is greater than <maxlen>.

Example:
{f:L5/too long/} -> "foo"      (if "f" is "foo")
                 -> "too long" (if "f" is "foobar")

(#92) (#94)
2018-07-29 13:52:07 +02:00
Mike Fährmann
8fe9056b16 implement string slicing for format strings
It is now possible to slice string (or list) values of format string
replacement fields with the same syntax as in regular Python code.

"{digits}"       -> "0123456789"
"{digits[2:-2]}" -> "234567"
"{digits[:5]}"   -> "01234"

The optional third parameter (step) has been left out to simplify things.
2018-07-14 09:53:15 +02:00
Mike Fährmann
a9e276bc37 reset delete-flag
Since 'PathFormat' objects are being reused, setting `delete`
to True once caused all files downloaded after to be deleted as well.
2018-06-20 18:12:59 +02:00
Mike Fährmann
baccf8a958 improve postprocessor handling
- add pathfmt argument for __init__()
- add finalization step
- add option to keep or delete zipped files
2018-06-08 17:39:02 +02:00
Mike Fährmann
7646bdbcfd improve postprocessor initialization code 2018-06-07 22:29:54 +02:00
Mike Fährmann
821535b458 adjust PathFormat class 2018-06-06 20:17:17 +02:00
Mike Fährmann
6a31ada9e3 re-implement OAuth1.0 code
OAuth support for SmugMug needs some additional features
(auth-rebuild on redirect, query parameters in URL, ...)
and fixing this in the old code wouldn't work all that well.
2018-05-10 18:47:05 +02:00
Mike Fährmann
69a5e6ddb3 Merge branch 'master' into 1.4-dev 2018-05-04 10:19:02 +02:00