Commit Graph

74 Commits

Author SHA1 Message Date
Mike Fährmann
c978fe18d4 [text] add 'extract_urls()' helper 2026-02-07 21:47:17 +01:00
Mike Fährmann
37aa7337dc [text] reject long filename extensions (#8491)
fixes regression introduced in 3252ead7c7
ref bc868e7bb8
2025-11-01 10:35:33 +01:00
Mike Fährmann
c8fc790028 merge branch 'dt': move datetime utils into separate module
- use 'datetime.fromisoformat()' when possible (#7671)
- return a datetime-compatible object for invalid datetimes
  (instead of a 'str' value)
2025-10-20 09:30:05 +02:00
Mike Fährmann
085616e0a8 [dt] replace 'text.parse_datetime()' & 'text.parse_timestamp()' 2025-10-17 17:43:06 +02:00
Mike Fährmann
17156ab7a2 [text] implement 'nameext_from_name()' 2025-10-15 11:14:49 +02:00
Mike Fährmann
724ae3661b [text] add 'empty' argument to 'parse_query()' (#8377)
enables including query parameters without value
2025-10-09 12:10:23 +02:00
Mike Fährmann
7bb4053396 [text] add 'sanitize_whitespace()' 2025-07-19 20:49:48 +02:00
Mike Fährmann
c08833aed9 [util] move 're' functions to text.py 2025-06-23 20:05:20 +02:00
Mike Fährmann
8f79ec67f4 [text] add 'build_query()' 2025-06-18 20:49:12 +02:00
Mike Fährmann
41191bb60a 'match.group(N)' -> 'match[N]' (#7671)
2.5x faster
2025-06-18 13:05:58 +02:00
Mike Fährmann
6d928f3805 remove some pre-3.8 workarounds (#7671) 2025-06-17 12:56:47 +02:00
Mike Fährmann
e84df260c0 [util] generalize 'build_duration_func' 2025-06-08 20:01:16 +02:00
Mike Fährmann
fe39b7d8c8 [text] slightly improve performance of 'extract' functions
by using 'None' instead of '0' as default 'pos' value
this only saves a few nanoseconds per call, but still
2025-05-23 17:53:28 +02:00
Mike Fährmann
f3ed15573a [text] add 'rextr()' 2025-05-23 17:28:58 +02:00
Mike Fährmann
04464b6cf0 [text] add second argument to 'parse_query_list()' (#7138)
return only values whose name is in 'as_list' as a list
2025-03-10 09:36:50 +01:00
Mike Fährmann
db19990a82 [text] allow calling 'extract_iter' with invalid arguments 2025-03-02 10:44:06 +01:00
Mike Fährmann
b03ee3c4c4 [text] implement 'parse_query_list()' 2024-10-01 20:28:30 +02:00
Mike Fährmann
9f49cf16e8 [text] implement 'parse_query()' without using 'urllib.parse.parse_qsl'
doesn't support bytes anymore, but is twice as fast
2024-10-01 20:28:11 +02:00
Mike Fährmann
2c7a0c3ca8 add alternatives for deprecated utc datetime functions 2024-09-19 20:47:05 +02:00
Mike Fährmann
5227bb6b1d [text] catch general Exceptions 2024-04-13 18:51:40 +02:00
Mike Fährmann
76581c13f7 handle URLs without '/' after their TLD (#5252) 2024-02-29 15:05:46 +01:00
Mike Fährmann
05255f5be0 add 'default' argument to 'text.extr()' 2022-11-09 11:00:32 +01:00
Mike Fährmann
eb33e6cf2d add 'text.extr()'
a stripped-down version of text.extract() that
- always returns a string (like 'extract_from')
- only returns a string
- does not deal with 'pos' arguments
- is ~20% faster
2022-11-04 21:37:36 +01:00
Mike Fährmann
67bad04dda [formatter] add 'g' conversion to sluGify a string (#2410) 2022-08-26 17:57:17 +02:00
Mike Fährmann
bddcec49f1 implement 'text.root_from_url()'
use domain from input URL for kemono
2022-03-01 03:09:57 +01:00
Mike Fährmann
bc0e853d30 combine KeyError & IndexError to common base class LookupError 2022-02-11 00:42:49 +01:00
Mike Fährmann
bc868e7bb8 consider apparently long extensions as part of the filename
(#1516)
2021-05-02 21:15:50 +02:00
Mike Fährmann
387fe415d5 unescape items in text.split_html() 2021-03-29 02:12:29 +02:00
Mike Fährmann
78fd63b8f0 remove 'text.clean_xml()'
was not used anywhere
2021-03-28 04:05:16 +02:00
Mike Fährmann
8553b218d9 replace calls to 'os.path.splitext()' with 'str.rpartition()'
Makes functions who used it more than twice as fast
and we can get rid of an import as well.
2021-03-28 04:01:27 +02:00
Mike Fährmann
a09f42f6b3 improve filename_from_url() performance
Manually extracting the part between the last '/' and '?' instead of
relying on the standard libraries' 'urllib.parse.urlsplit()' increases
performance by ~400%.

urlsplit() : 3.64 secs per 1.000.000 iterations
partition(): 0.87 secs per 1.000.000 iterations
2020-10-23 00:14:06 +02:00
Mike Fährmann
37d71f6e09 strip microseconds in text.parse_datetime() 2020-06-17 21:40:16 +02:00
Mike Fährmann
6294e2c540 add 'text.ensure_http_scheme()' 2020-05-19 22:32:53 +02:00
Mike Fährmann
a0f4c295c0 add optional 'utcoffset' argument to 'parse_datetime()' 2020-04-11 02:05:00 +02:00
Mike Fährmann
f6c5edb76b pre-compile regex pattern for remove_html() and split_html() 2020-03-13 23:31:54 +01:00
Mike Fährmann
b1bea8aaeb add 'restrict-filenames' option (#348) 2019-07-23 17:41:24 +02:00
Mike Fährmann
1740086d8a add 'repl' and 'sep' arguments to text.replace_html() 2019-07-17 14:48:24 +02:00
Mike Fährmann
b171befa87 implement 'parse_unicode_escapes()' 2019-06-16 21:47:24 +02:00
Mike Fährmann
2b1999476e implement 'text.rextract()' 2019-05-28 21:03:41 +02:00
Mike Fährmann
2316e0ed3d fix strptime workaround from b0e85a4
Don't return a modified version of 'date_time' if strptime fails.
2019-05-25 23:22:26 +02:00
Mike Fährmann
b0e85a42e3 apply workaround from 4736912 in parse_datetime() itself 2019-05-09 21:53:17 +02:00
Mike Fährmann
d09864b581 implement text.parse_datetime() 2019-05-08 15:43:59 +02:00
Mike Fährmann
6264a46212 use 'utcfromtimestamp()'
'fromtimestamp()' converts its results to the local timezone and causes
problems when running tests on a different machine.
2019-04-21 16:22:53 +02:00
Mike Fährmann
d670de0344 implement 'text.parse_timestamp()' 2019-04-21 15:28:27 +02:00
Mike Fährmann
21a7e395a7 implement convenience wrapper for text.extract functionality 2019-04-19 22:30:11 +02:00
Mike Fährmann
8f249f1d54 improve text.extract_iter() performance
by roughly 40% through
- inlining code
- pre-calculating reused values
- entering a try-except block only once
2019-04-18 23:37:17 +02:00
Mike Fährmann
5530871b5a change results of text.nameext_from_url()
Instead of getting a complete 'filename' from an URL and splitting that
into 'name' and 'extension', the new approach gets rid of the complete
version and renames 'name' to 'filename'. (Using anything other than
{extension} for a filename extension doesn't really work anyway)

Example: "https://example.org/path/filename.ext"

before:
- filename : filename.ext
- name     : filename
- extension: ext

now:
- filename : filename
- extension: ext
2019-02-14 16:07:17 +01:00
Mike Fährmann
e1d3e9a926 add 'ext_from_url' to text.py 2019-01-31 12:23:25 +01:00
Mike Fährmann
2d2953a5bf add 'text.parse_float()' + cleanup in text.py 2019-01-29 16:46:21 +01:00
Mike Fährmann
ae9a37a528 implement text.split_html() 2018-05-27 15:00:41 +02:00