Commit Graph

32 Commits

Author SHA1 Message Date
Mike Fährmann
a383eca7f6 decouple extractor initialization
Introduce an 'initialize()' function that does the actual init
(session, cookies, config options) and can called separately from
the constructor __init__().

This allows, for example, to adjust config access inside a Job
before most of it already happened when calling 'extractor.find()'.
2023-07-25 22:16:16 +02:00
Mike Fährmann
dd884b02ee replace json.loads with direct calls to JSONDecoder.decode 2023-02-09 15:22:00 +01:00
Mike Fährmann
b0cb4a1b9c replace 'text.extract()' with 'text.extr()' where possible 2022-11-05 01:14:09 +01:00
Mike Fährmann
ea8113ff36 [reactor] match 'best', 'new', 'all' URLs (#3073) 2022-10-19 10:52:33 +02:00
Mike Fährmann
d26da3b9e5 add pre-generated 'pattern' for supported BaseExtractor sites 2022-05-09 22:20:09 +02:00
Mike Fährmann
addb72e1bb [reactor] support thatpervert.com (closes #2029) 2021-11-26 18:58:07 +01:00
Mike Fährmann
d8d9502e1e [reactor] inherit from BaseExtractor 2021-11-26 18:58:07 +01:00
Mike Fährmann
bd08ee2859 remove most 'yield Message.Version' statements
only leave them in oauth.py as noop results
2021-08-16 03:10:48 +02:00
Nyasume
fa6af46756 Added ability to download GIFs instead of mp4 from Luscious and Reactor (#1701) 2021-08-12 15:12:42 +02:00
Mike Fährmann
21c2da454f update extractor test results 2021-07-04 22:00:32 +02:00
Mike Fährmann
2c60c7d798 [reactor] skip deleted/empty posts 2021-05-21 16:14:09 +02:00
Mike Fährmann
bae874f370 replace 'wait-min/-max' with 'sleep-request'
on exhentai, idolcomplex, reactor
2021-03-02 22:55:45 +01:00
Mike Fährmann
3df527ee2c update extractor test results 2021-02-27 21:01:29 +01:00
Mike Fährmann
65ca923b4e fix 'whitelist' option for BaseExtractor instances 2021-02-15 21:58:33 +01:00
Mike Fährmann
912eea29bc update extractor test results 2020-12-27 17:41:08 +01:00
Mike Fährmann
1e3dd7330e merge SharedConfigMixin functionality into Extractor 2020-11-17 00:34:07 +01:00
Mike Fährmann
968d3e8465 remove '&' from URL patterns
'/?&#' -> '/?#' and '?&#' -> '?#'

According to https://www.ietf.org/rfc/rfc3986.txt, URLs are
"organized hierarchically" by using "the slash ("/"), question
mark ("?"), and number sign ("#") characters to delimit components"
2020-10-22 23:31:25 +02:00
dawidsowa
43b156fb40 [reactor] match URLs without subdomain (#1053) 2020-10-11 18:15:06 +02:00
Mike Fährmann
7619152988 [reactor] sort 'tags'
to ensure a consistent order for test results
2020-08-15 18:22:31 +02:00
Mike Fährmann
c50d60a53d [reactor] fix image URLs 2019-08-16 14:07:22 +02:00
Mike Fährmann
b1db194c14 [reactor] update and improve
- split 'tags' into a list
- parse 'date' into a datetime object
- fix webm/mp4 URLs
2019-05-09 23:24:49 +02:00
Mike Fährmann
0f02e85961 [reactor] use "/full/" URLs (closes #210)
Putting a "/full/" in image URLs potentially gives higher resolution
and better quality.
2019-03-30 22:14:57 +01:00
Mike Fährmann
5530871b5a change results of text.nameext_from_url()
Instead of getting a complete 'filename' from an URL and splitting that
into 'name' and 'extension', the new approach gets rid of the complete
version and renames 'name' to 'filename'. (Using anything other than
{extension} for a filename extension doesn't really work anyway)

Example: "https://example.org/path/filename.ext"

before:
- filename : filename.ext
- name     : filename
- extension: ext

now:
- filename : filename
- extension: ext
2019-02-14 16:07:17 +01:00
Mike Fährmann
2e516a1e3e store the full original URL in Extractor.url 2019-02-12 18:46:48 +01:00
Mike Fährmann
4b1880fa5e propagate 'match' to base extractor constructor 2019-02-11 13:31:10 +01:00
Mike Fährmann
6284731107 simplify extractor constants
- single strings for URL patterns
- tuples instead of lists for 'directory_fmt' and 'test'
- single-tuple tests where applicable
2019-02-08 13:45:40 +01:00
Mike Fährmann
050bc1aa4a [reactor] simplify tests
Some posts have, for whatever reason, a slightly different text
formatting the first time they are accessed that day
compared to any further time.
2019-02-05 10:37:44 +01:00
Mike Fährmann
4d656a81ca replace SharedConfigExtractor class with a Mixin 2019-02-04 13:46:02 +01:00
Mike Fährmann
1734a6c879 [reactor] detect "circular" redirects (#148) 2019-01-09 14:59:15 +01:00
Mike Fährmann
e53cdfd6a8 update build_supportedsites.py 2019-01-09 14:58:35 +01:00
Mike Fährmann
e95b24f056 [reactor] add wait-min & -max options (#148) 2019-01-07 18:04:47 +01:00
Mike Fährmann
8e01cf0ef8 [reactor] generalize extractors (#148)
- support *.reactor.cc domains
- combine joyreactor and pornreactor modules
2019-01-07 17:06:47 +01:00