59 Commits

Author SHA1 Message Date
Mike Fährmann
53cdfaac37 [common] add reference to 'exception' module to Extractor class
- remove 'exception' imports
- replace with 'self.exc'
2026-02-15 10:57:22 +01:00
Mike Fährmann
51d9fd2f4d [behance] export GraphQL queries 2026-02-01 19:13:38 +01:00
Mike Fährmann
968597a302 yield 3-tuples for Message.Directory
adapt tuples to the same length and semantics as other messages
2025-12-05 21:39:52 +01:00
Mike Fährmann
085616e0a8 [dt] replace 'text.parse_datetime()' & 'text.parse_timestamp()' 2025-10-17 17:43:06 +02:00
Mike Fährmann
aa6c2dcbac [behance] provide 'creator[name]' metadata (#7885) 2025-07-24 15:30:51 +02:00
Mike Fährmann
849a5b191f [behance] use '"browser": "firefox"' by default (#7803 #7877) 2025-07-24 09:31:12 +02:00
Mike Fährmann
a097a373a9 simplify if statements by using walrus operators (#7671) 2025-07-22 20:57:54 +02:00
Mike Fährmann
3810555bbd do not use 'append = list.append' 2025-06-30 11:42:44 +02:00
Mike Fährmann
ef12882ff7 [behance] fix '403 Forbidden' error (#7710)
update internal cookies
2025-06-29 21:36:12 +02:00
Mike Fährmann
f2a72d8d1e replace 'request(…).json()' with 'request_json(…)' 2025-06-29 17:50:19 +02:00
Mike Fährmann
9dbe33b6de replace old %-formatted and .format(…) strings with f-strings (#7671)
mostly using flynt
https://github.com/ikamensh/flynt
2025-06-29 17:50:19 +02:00
Mike Fährmann
41191bb60a 'match.group(N)' -> 'match[N]' (#7671)
2.5x faster
2025-06-18 13:05:58 +02:00
Mike Fährmann
e08ec7e083 update copyright notices 2025-06-13 00:03:41 +02:00
Mike Fährmann
d7d99d5606 [behance] fix '403 Forbidden' errors 2025-06-05 14:25:07 +02:00
Mike Fährmann
1824267447 [dl:ytdl] implement explicit HLS/DASH handling
add '_ytdl_manifest' to specify a manifest type to process
2024-10-16 15:16:21 +02:00
Mike Fährmann
6e7da6310c [behance] fix video extraction (#5965)
a lot slower than before since each video now requires an extra HTTP
request and 'sleep-request' is set to 2s-4s by default.

it now also requires ytdl.
2024-08-10 11:06:54 +02:00
Mike Fährmann
9783d95585 [behance] fix "KeyError: 'fields'" (#5965) 2024-08-08 16:29:56 +02:00
Mike Fährmann
36a64a3aa7 [behance] fix image extraction (#5873) 2024-07-21 10:54:12 +02:00
Mike Fährmann
07cb584231 [behance] add 'modules' option (#4799) 2023-11-17 22:54:38 +01:00
Mike Fährmann
6a753d9ff3 [behance] support 'text' modules (#4799) 2023-11-17 22:54:38 +01:00
Mike Fährmann
fd8f58ad76 [behance] unescape embed URLs (#4742) 2023-10-30 13:38:49 +01:00
Mike Fährmann
3ecb512722 send Referer headers by default 2023-09-19 00:02:04 +02:00
Mike Fährmann
6ae92da57e Merge branch 'tests' 2023-09-13 21:34:28 +02:00
Mike Fährmann
32da3c70d3 [behance] handle videos without 'renditions' (#4523) 2023-09-12 22:00:04 +02:00
Mike Fährmann
a453335a9f remove test results in extractor modules
and add generic example URLs
2023-09-11 16:30:55 +02:00
Mike Fährmann
6482f9453b [behance] fix cookie usage (#4417) 2023-08-18 14:48:20 +02:00
Mike Fährmann
d34195b41d [behance] fix and update 'user' extractor (#4417) 2023-08-17 16:06:35 +02:00
Mike Fährmann
4d3cf709da [behance] add 'date' metadata field (#4417) 2023-08-17 15:33:47 +02:00
Mike Fährmann
c689cd9720 [behance] show error for mature content (#4417) 2023-08-17 15:31:37 +02:00
Mike Fährmann
15d7c5a199 [behance] 'items()' -> 'values()'
we only need 'size', 'name' is unnecessary
2023-04-30 13:53:51 +02:00
Mike Fährmann
0fb580135d [behance] fix extraction (#3980) 2023-04-29 16:18:35 +02:00
Mike Fährmann
dd884b02ee replace json.loads with direct calls to JSONDecoder.decode 2023-02-09 15:22:00 +01:00
Mike Fährmann
3a0450adbf [behance] use default delay between requests (#2507) 2023-01-07 14:49:26 +01:00
Mike Fährmann
b0cb4a1b9c replace 'text.extract()' with 'text.extr()' where possible 2022-11-05 01:14:09 +01:00
Mike Fährmann
dee0d22561 update extractor test results 2022-02-06 21:39:24 +01:00
Mike Fährmann
bd08ee2859 remove most 'yield Message.Version' statements
only leave them in oauth.py as noop results
2021-08-16 03:10:48 +02:00
Mike Fährmann
f9096584ab [behance] fix 'collection' extraction 2021-08-10 00:48:31 +02:00
Mike Fährmann
6b2bce3b7d [behance] support 'video' modules (closes #1282)
(requires youtube-dl to download from m3u8 manifests)
2021-01-29 21:30:14 +01:00
Mike Fährmann
968d3e8465 remove '&' from URL patterns
'/?&#' -> '/?#' and '?&#' -> '?#'

According to https://www.ietf.org/rfc/rfc3986.txt, URLs are
"organized hierarchically" by using "the slash ("/"), question
mark ("?"), and number sign ("#") characters to delimit components"
2020-10-22 23:31:25 +02:00
Mike Fährmann
ddd6840509 [behance] fix 'collection' extraction 2020-10-11 18:15:41 +02:00
Mike Fährmann
a3fa45bbb1 [behance] get images from 'media_collection' modules 2019-11-27 01:04:33 +01:00
Mike Fährmann
1b9bf4fc6e [behance] fix 'tags' extraction 2019-10-03 17:36:02 +02:00
Mike Fährmann
3969f9cbbd [behance] fix collection extraction 2019-07-27 14:26:40 +02:00
Mike Fährmann
61741d7333 provide type information for Queue messages
Child extractors are now directly constructed with Extractor.from_url()
if the extractor class is known beforehand, instead of using
extractor.find() and searching through all possible extractor classes.
2019-02-12 21:32:32 +01:00
Mike Fährmann
4b1880fa5e propagate 'match' to base extractor constructor 2019-02-11 13:31:10 +01:00
Mike Fährmann
6284731107 simplify extractor constants
- single strings for URL patterns
- tuples instead of lists for 'directory_fmt' and 'test'
- single-tuple tests where applicable
2019-02-08 13:45:40 +01:00
Mike Fährmann
1c1367ec5b [behance] fix empty docstring 2019-02-02 14:41:05 +01:00
Mike Fährmann
45e529ab91 [behance] fix extraction
HTML structure for gallery pages changed quite a bit, so it is now using
the embedded JSON data. This changes a lot of metadata field names, but
'gallery_id', 'title', and 'user' are still provided for backwards
compatibility.

The internal API endpoint for user galleries also changed its data
structure, but nothing too major.
2019-01-31 14:33:23 +01:00
Mike Fährmann
2d2953a5bf add 'text.parse_float()' + cleanup in text.py 2019-01-29 16:46:21 +01:00
Mike Fährmann
9b8ac12eed [behance] enable 'categorytransfer' for collections (#157) 2019-01-19 20:02:20 +01:00