Commit Graph

56 Commits

Author SHA1 Message Date
Mike Fährmann
41191bb60a 'match.group(N)' -> 'match[N]' (#7671)
2.5x faster
2025-06-18 13:05:58 +02:00
Mike Fährmann
0a3fac2dfe merge #7664: [archivedmoe] redirect URL fixes (#7652) 2025-06-15 10:03:34 +02:00
Mike Fährmann
b245218c1d [archivedmoe] reword some comments and variable names 2025-06-15 10:00:45 +02:00
NecRaul
6668acf91e [archivedmoe] Sort boards alphabetically 2025-06-13 19:29:47 +04:00
NecRaul
3a4e19d284 [archivedmoe] Simplify board extraction from url 2025-06-13 18:44:02 +04:00
NecRaul
a7aa18a8c1 [archivedmoe] remove unnecessary logging 2025-06-13 18:28:21 +04:00
NecRaul
8b2adeb41e [archivedmoe] simplify board URL redirection logic 2025-06-13 18:26:39 +04:00
NecRaul
05081dea2e Lint with flake8 2025-06-13 17:56:43 +04:00
NecRaul
223fe960a0 [archivedmoe] redirect URL changes (again)
Redirects to warosu.org instead of 4chan's cdn for certain boards
Redirects to archive.4plebs.org instead of 4chan's cdn for /tg/
Slices the filename only if it's redirecting to certain archives
2025-06-13 17:43:16 +04:00
Mike Fährmann
e08ec7e083 update copyright notices 2025-06-13 00:03:41 +02:00
Mike Fährmann
811b665e33 remove @staticmethod decorators
There might have been a time when calling a static method was faster
than a regular method, but that is no longer the case. According to
micro-benchmarks, it is 70% slower in CPython 3.13 and it also makes
executing the code of a class definition slower.
2025-06-12 22:50:52 +02:00
NecRaul
e3df99dbb9 Apply mikf's diff regarding Archived.moe
Moved (and refactored) code into remote()
Added a check for fixup_timestamp
2025-06-11 21:51:03 +04:00
NecRaul
4370654532 Simplify remote_media_link assignment 2025-06-11 04:49:21 +04:00
NecRaul
cb74d0f2f3 Lint with flake8 2025-06-11 04:46:18 +04:00
NecRaul
96bb2b1630 Fix Archived.moe redirection issue
Unless the board is /b/ (in which case redirection works fine),
remove the characters of the filename portion of the url until
filename portion of the url is 13 characters long (epoch millis).
2025-06-11 04:42:03 +04:00
Mike Fährmann
23c4bc8ac5 [b4k] keep support for previous 'arch.b4k.co' domain 2025-02-09 11:11:38 +01:00
NecRaul
dae82f1519 [b4k] update domain to arch.b4k.dev 2025-02-09 01:28:23 +04:00
Mike Fährmann
36883e458e use 'v[0] == "c"' instead of 'v.startswith("c")' 2024-10-15 08:24:06 +02:00
Mike Fährmann
64948f2c09 [foolfuuka] improve 'board' pattern & support pages (#5408) 2024-04-01 22:31:25 +02:00
Mike Fährmann
1f7101d606 [archivedmoe] fix thebarchive webm URLs (#5116) 2024-01-27 00:24:41 +01:00
Mike Fährmann
1f9b16a70b replace static 'sleep-request' defaults with dynamic ones 2023-12-18 22:06:26 +01:00
Mike Fährmann
3ecb512722 send Referer headers by default 2023-09-19 00:02:04 +02:00
Mike Fährmann
a453335a9f remove test results in extractor modules
and add generic example URLs
2023-09-11 16:30:55 +02:00
Mike Fährmann
a383eca7f6 decouple extractor initialization
Introduce an 'initialize()' function that does the actual init
(session, cookies, config options) and can called separately from
the constructor __init__().

This allows, for example, to adjust config access inside a Job
before most of it already happened when calling 'extractor.find()'.
2023-07-25 22:16:16 +02:00
Mike Fährmann
a08fdfac6e [foolfuuka] add 'archive.palanq.win' 2023-05-02 19:58:55 +02:00
Mike Fährmann
1870df8b23 [foolfuuka] remove 'tokyochronos.net' 2023-05-02 19:25:50 +02:00
Mike Fährmann
ef4e2d8178 [foolfuuka] remove 'archive.alice.al' 2023-05-02 19:23:26 +02:00
Mike Fährmann
b0cb4a1b9c replace 'text.extract()' with 'text.extr()' where possible 2022-11-05 01:14:09 +01:00
Mike Fährmann
7e385ed63e [foolfuuka] update domains
- remove nyafuu
- add rozenarcana (https://archive.alice.al/)
- add tokyochronos (https://www.tokyochronos.net)
2022-08-26 17:57:17 +02:00
Mike Fährmann
2dc57637cf [foolfuuka] remove archive.wakarimasen.moe 2022-07-10 23:13:49 +02:00
Mike Fährmann
bd6ec5c352 [foolfuuka] match 4chan filenames (#2577)
introduce two new metadata fields:
- filename_media: original filename of file uploaded to 4chan
- timestamp_ms  : timestamp with millisecond precision (tim)
2022-05-15 14:39:54 +02:00
Mike Fährmann
d26da3b9e5 add pre-generated 'pattern' for supported BaseExtractor sites 2022-05-09 22:20:09 +02:00
Mike Fährmann
dee0d22561 update extractor test results 2022-02-06 21:39:24 +01:00
Mike Fährmann
275543b2d2 update extractor test results 2021-11-27 19:26:44 +01:00
Mike Fährmann
211de95dd0 update extractor test results 2021-11-01 02:58:53 +01:00
Mike Fährmann
c04f7ab139 [foolfuuka] add 'gallery' extractor (#1785) 2021-08-21 22:46:23 +02:00
Mike Fährmann
21c2da454f update extractor test results 2021-07-04 22:00:32 +02:00
Mike Fährmann
407627ec86 [foolfuuka] support 'archive.wakarimasen.moe' (closes #1595) 2021-06-02 15:45:43 +02:00
Mike Fährmann
532ac79fb0 update extractor test results 2021-05-21 02:28:53 +02:00
Mike Fährmann
671a95cae5 [foolfuuka] use BaseExtractor 2021-01-26 18:48:37 +01:00
Mike Fährmann
e9a75e27d9 [foolfuuka] stop search when results are exhausted (#1174) 2021-01-17 22:48:21 +01:00
Mike Fährmann
56b460dcea [foolfuuka] add 'search' extractors (#1174) 2021-01-02 02:34:06 +01:00
Mike Fährmann
fb64183d53 [foolfuuka] add 'board' extractors (closes #1044) 2021-01-01 19:33:35 +01:00
Mike Fährmann
1e3dd7330e merge SharedConfigMixin functionality into Extractor 2020-11-17 00:34:07 +01:00
Mike Fährmann
f5b7ae01c1 update extractor test results 2020-09-15 18:07:08 +02:00
Mike Fährmann
82f7f4172a update test results 2020-01-01 16:05:38 +01:00
Mike Fährmann
41a3169c67 [foolfuuka] use '{extension}' in default filename format 2019-11-28 23:12:48 +01:00
Mike Fährmann
2a3bd4e3c7 rename extractor classes starting with a digit 2019-11-02 20:42:09 +01:00
Mike Fährmann
8de5866fd2 [twitter] replace unit test URLs
https://twitter.com/PicturesEarth was deleted
2019-05-09 10:17:55 +02:00
Mike Fährmann
591a07f20c small code changes and cleanups 2019-03-13 22:03:02 +01:00