Commit Graph

57 Commits

Author SHA1 Message Date
Mike Fährmann
a383eca7f6 decouple extractor initialization
Introduce an 'initialize()' function that does the actual init
(session, cookies, config options) and can called separately from
the constructor __init__().

This allows, for example, to adjust config access inside a Job
before most of it already happened when calling 'extractor.find()'.
2023-07-25 22:16:16 +02:00
Mike Fährmann
d97b8c2fba consistent cookie-related names
- rename every cookie variable or method to 'cookies_*'
- simplify '.session.cookies' to just '.cookies'
- more consistent 'login()' structure
2023-07-22 01:20:50 +02:00
Mike Fährmann
a16d7c59cb [newgrounds] access 'response.text' only once 2023-07-04 21:49:57 +02:00
FrostTheFox
9576652fa5 extract & pass auth token for newgrounds 2023-07-04 02:35:48 -04:00
Mike Fährmann
c698c3de44 [newgrounds] add default delay between requests (#4046) 2023-05-11 16:04:37 +02:00
Mike Fährmann
c9a7345228 [newgrounds] prevent archive ID overlap (#3681)
add an 'i' and 'a' prefix to image and audio files
(/art/view/, /audio/listen/)
since their numeric ID may conflict with movies and other media
2023-03-06 15:03:49 +01:00
Mike Fährmann
dd884b02ee replace json.loads with direct calls to JSONDecoder.decode 2023-02-09 15:22:00 +01:00
Mike Fährmann
b0cb4a1b9c replace 'text.extract()' with 'text.extr()' where possible 2022-11-05 01:14:09 +01:00
Mike Fährmann
ff532d6c3c [newgrounds] extract 'type' metadata 2022-09-24 20:29:43 +02:00
Mike Fährmann
0393e59535 [newgrounds] add 'games' extractor (#2955) 2022-09-24 12:34:37 +02:00
Mike Fährmann
c794777600 [newgrounds] prevent exception on empty results (#2727) 2022-07-03 11:44:46 +02:00
Mike Fährmann
37453a9528 [newgrounds] only login if necessary (#2715) 2022-06-29 11:46:07 +02:00
Mike Fährmann
c1768972c2 [newgrounds] update and fix pagination (#2456) 2022-04-07 15:38:41 +02:00
Mike Fährmann
a53cfc845e [newgrounds] warn about age-restricted posts (#2456) 2022-03-30 16:18:33 +02:00
Mike Fährmann
281a5b3b28 [newgrounds] fix video descriptions (#2328) 2022-03-14 08:38:20 +01:00
Mike Fährmann
d71c173150 [newgrounds] strip incomplete HTML tag from '_comment' (#2328) 2022-02-23 21:42:28 +01:00
Mike Fährmann
cf58048bd4 [newgrounds] add 'post_url' metadata field (#2328) 2022-02-23 00:00:23 +01:00
Mike Fährmann
4acc31bd9f [newgrounds] set suitabilities filter before starting a search 2022-01-11 23:50:29 +01:00
Mike Fährmann
37beb1298e [newgrounds] add 'search' extractor (closes #2161) 2022-01-06 19:32:39 +01:00
Vrihub
96fcff182c generic extractor (#735)
* Generic extractor, see issue #683

* Fix failed test_names test, no subcategory needed

* Prefix directory_fmt with "generic"

* Relax regex (would break some urls)

* Flake8 compliance

* pattern: don't require a scheme

This fixes a bug when we force the generic extractor on urls without a
scheme (that are allowed by all other extractors).

* Fix using g: and r: on urls without http(s) scheme

Almost all extractors accept urls without an initial http(s) scheme.

Many extractors also allow for generic subdomains in their "pattern"
variable; some of them implement this with the regex character class
"[^.]+" (everything but a dot).

This leads to a problem when the extractor is given a url starting
with g: or r: (to force using the generic or recursive extractor)
and without the http(s) scheme: e.g. with "r:foobar.tumblr.com"
the "r:" is wrongly considered part of the subdomain.

This commit fixes the bug, replacing the too generic "[^.]+" with the
more specific "[\w-]+" (letters, digits and "-", the only characters
allowed in domain names), which is already used by some extractors.

* Relax imageurl_pattern_ext: allow relative urls

* First round of small suggested changes

* Support image urls starting with "//"

* self.baseurl: remove trailing slash

* Relax regexp (didn't catch some image urls)

* Some fixes and cleanup

* Fix domain pattern; option to enable extractor

Fixed the domain section for "pattern", to pass "test_add" and
"test_add_module" tests.
Added the "enabled" configuration option (default False) to enable the
generic extractor. Using "g(eneric):URL" forces using the extractor.
2021-12-29 22:39:29 +01:00
Mike Fährmann
7a0da4f93f [newgrounds] add 'format' option (closes #1729) 2021-07-29 19:11:20 +02:00
Mike Fährmann
223a4e79cd [newgrounds] fix using 'category-tranfer' (#1274) 2021-07-29 15:54:04 +02:00
Mike Fährmann
36f281330a [newgrounds] fix flash file extraction (closes #1257)
… and add a 'flash' option to choose between flash and video formats.
2021-01-19 17:48:14 +01:00
Mike Fährmann
c14c5d82d6 [newgrounds] use generator for fallback URLs 2020-10-23 00:39:19 +02:00
Mike Fährmann
968d3e8465 remove '&' from URL patterns
'/?&#' -> '/?#' and '?&#' -> '?#'

According to https://www.ietf.org/rfc/rfc3986.txt, URLs are
"organized hierarchically" by using "the slash ("/"), question
mark ("?"), and number sign ("#") characters to delimit components"
2020-10-22 23:31:25 +02:00
Mike Fährmann
3f2ba629ea [newgrounds] provide fallback URLs for video downloads (#1042) 2020-10-16 01:16:12 +02:00
Mike Fährmann
5b844a72b7 [newgrounds] handle embeds without scheme (#1033) 2020-10-15 15:13:54 +02:00
Mike Fährmann
c5e3971b18 [newgrounds] extract image embeds (closes #1033) 2020-10-11 18:15:40 +02:00
Mike Fährmann
f9c1684af7 [newgrounds] restore original video URLs (#1042) 2020-10-07 22:53:53 +02:00
Mike Fährmann
5b927c15df [newgrounds] fix video extraction (closes #1042) 2020-10-01 20:14:16 +02:00
Mike Fährmann
e17d4f44f6 [newgrounds] fix favorites extraction 2020-07-13 23:08:45 +02:00
Mike Fährmann
6294e2c540 add 'text.ensure_http_scheme()' 2020-05-19 22:32:53 +02:00
Mike Fährmann
c56a751dae [newgrounds] fix URLs produced by 'followng' extractors (#684) 2020-04-28 21:33:37 +02:00
Mike Fährmann
9b194520db [newgrounds] add 'following' extractor (closes #684) 2020-04-17 22:17:43 +02:00
Mike Fährmann
ae2a33243b [newgrounds] catch general Exceptions 2020-03-18 02:17:43 +01:00
Mike Fährmann
87d4f83597 [newgrounds] make post extraction nonfatal 2020-03-10 01:49:59 +01:00
Mike Fährmann
823fbeaae6 [newgrounds] add 'favorite' extractor (#394) 2020-03-10 01:07:09 +01:00
Mike Fährmann
4e361b3008 add tests for specific datetime values 2020-02-23 16:48:30 +01:00
Mike Fährmann
5ad92fc196 [newgrounds] fix tags metadata extraction 2020-01-01 16:06:58 +01:00
Mike Fährmann
42b9633c7e update test results 2019-11-26 23:27:15 +01:00
Mike Fährmann
d45fabb79d match user profile handling on deviantart and newgrounds 2019-11-22 23:20:21 +01:00
Mike Fährmann
b1f0609de5 [newgrounds] rewrite (#394)
- restructure extractor hierarchy
- extract more metadata
- extract videos without youtube-dl
- be more resilient to errors

TODO:
- favorites
- games, but that might be near impossible for non-flash titles
2019-11-18 21:13:33 +01:00
Mike Fährmann
3ece3976ae [newgrounds] implement login support (#394) 2019-11-16 23:45:32 +01:00
Mike Fährmann
3a07c06865 [newgrounds] update
- create directory per post
- rename variables and methods
2019-11-14 23:17:14 +01:00
Mike Fährmann
a732e9c430 [instagram] update query hashes and headers 2019-08-10 14:13:08 +02:00
Mike Fährmann
1133b7fcbd [smugmug] update unit tests
The account used for tests before has been deleted.
2019-07-19 17:16:24 +02:00
Mike Fährmann
04b8d0894a [newgrounds] improve metadata extraction 2019-07-08 17:53:55 +02:00
Mike Fährmann
b89f0d8d3c update extractor result tests 2019-07-01 20:02:47 +02:00
Mike Fährmann
74c7304c6b [newgrounds] extract 'date', 'favorites', and 'score' 2019-05-08 18:09:17 +02:00
Mike Fährmann
5530871b5a change results of text.nameext_from_url()
Instead of getting a complete 'filename' from an URL and splitting that
into 'name' and 'extension', the new approach gets rid of the complete
version and renames 'name' to 'filename'. (Using anything other than
{extension} for a filename extension doesn't really work anyway)

Example: "https://example.org/path/filename.ext"

before:
- filename : filename.ext
- name     : filename
- extension: ext

now:
- filename : filename
- extension: ext
2019-02-14 16:07:17 +01:00