Commit Graph

53 Commits

Author SHA1 Message Date
Mike Fährmann
2be54be692 [subscribestar] merge 'user-tag' into regular 'user' extractor (#8737) 2025-12-23 18:58:25 +01:00
Mike Fährmann
7669a1f13a [subscribestar:user-tag] update 'pattern' 2025-12-22 11:43:30 +01:00
Mike Fährmann
b5a7540619 Merge branch 'op+': use '+' for 2-element string concatenations 2025-12-22 11:34:21 +01:00
Mike Fährmann
00c6821a3f replace 2-element f-strings with simple '+' concatenations
Python's 'ast' module and its 'NodeVisitor' class
were incredibly helpful in identifying these
2025-12-22 11:26:04 +01:00
Mike Fährmann
609e19273d [subscribestar] add 'user-tag' extractor (#8737) 2025-12-21 22:14:17 +01:00
Mike Fährmann
e006d26c8e Revert "use f-strings when building 'pattern'"
revert d7c97d5a97.
2025-12-20 22:07:37 +01:00
Mike Fährmann
968597a302 yield 3-tuples for Message.Directory
adapt tuples to the same length and semantics as other messages
2025-12-05 21:39:52 +01:00
Mike Fährmann
d7c97d5a97 use f-strings when building 'pattern' 2025-10-20 21:23:11 +02:00
Mike Fährmann
9bf76c1352 replace 'util.re()' with 'text.re()'
remove unnecessary 'util' imports
2025-10-20 17:44:58 +02:00
Mike Fährmann
c8fc790028 merge branch 'dt': move datetime utils into separate module
- use 'datetime.fromisoformat()' when possible (#7671)
- return a datetime-compatible object for invalid datetimes
  (instead of a 'str' value)
2025-10-20 09:30:05 +02:00
Mike Fährmann
085616e0a8 [dt] replace 'text.parse_datetime()' & 'text.parse_timestamp()' 2025-10-17 17:43:06 +02:00
Mike Fährmann
36a3fe45e4 [subscribestar] improve 'filename' (#8416) 2025-10-15 11:52:39 +02:00
Mike Fährmann
a097a373a9 simplify if statements by using walrus operators (#7671) 2025-07-22 20:57:54 +02:00
Mike Fährmann
d8ef1d693f rename 'StopExtraction' to 'AbortExtraction'
for cases where StopExtraction was used to report errors
2025-07-09 21:07:28 +02:00
Mike Fährmann
f2a72d8d1e replace 'request(…).json()' with 'request_json(…)' 2025-06-29 17:50:19 +02:00
Mike Fährmann
9dbe33b6de replace old %-formatted and .format(…) strings with f-strings (#7671)
mostly using flynt
https://github.com/ikamensh/flynt
2025-06-29 17:50:19 +02:00
Mike Fährmann
e08ec7e083 update copyright notices 2025-06-13 00:03:41 +02:00
Mike Fährmann
b5c88b3d3e replace standard library 're' uses with 'util.re()' 2025-06-06 13:24:52 +02:00
Mike Fährmann
b81fc5c124 replace text.rextract() with rextr() 2025-05-23 18:28:58 +02:00
Mike Fährmann
311eaf5f11 [subscribestar] fix 'title' extraction for 'trix-attachment' posts (#7526) 2025-05-16 19:09:37 +02:00
Mike Fährmann
98fdcd4d72 [subscribestar] fix 'content' extraction (#7486)
and extract 'tags' metadata

Authored by: prowlguru

Co-authored-by: prowlguru <183935626+prowlguru@users.noreply.github.com>
2025-05-10 21:04:27 +02:00
Mike Fährmann
78b34bbdd7 [subscribestar] fix username & password login 2025-04-25 20:15:00 +02:00
Mike Fährmann
8b7f5eacbb [subscribestar] add warning for missing login cookie
and update expected cookie domains and names
2025-04-25 16:20:02 +02:00
Mike Fährmann
af57ab3233 [subscribestar] detect redirects to '/age_confirmation_warning' pages 2025-03-22 11:42:50 +01:00
Mike Fährmann
4807bc215c [subscribestar] extract 'title' metadata (#7219) 2025-03-22 09:46:08 +01:00
Mike Fährmann
79dc04d87c [subscribestar] fix 'post' extractor (#6582)
https://github.com/mikf/gallery-dl/issues/6582#issuecomment-2675939669
2025-02-22 10:08:59 +01:00
Mike Fährmann
7c96c2368f [subscribestar] detect and handle redirects (#6916) 2025-02-01 21:03:24 +01:00
Mike Fährmann
107798eeab [subscribestar] strip whitespace from 'content' 2025-01-04 16:19:22 +01:00
Wyoh Knott
22d4e84372 [subscribestar] Better extraction of content
The structure of content is like this:

```
<div class="post-content" data-role="post_content-text">
                <div class="trix-content">
                    <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
                    <html>
                        <body>
                            <div>
                                Unspeakable thing are written here<br />
                                <br />
                                haiiiiiiiiiiiiiiii hi hi hiii its meee back againnn, plspls leave a comment if uuuu liked it mwah
                                &lt;3
                            </div>
                        </body>
                    </html>
                </div>
            </div>
            <div class="post-uploads
```

Currently we extract content with:

```
(extr('<div class="post-content', '<div class="post-uploads').partition(">")[2])
```

I propose we just take the body parts:

```
extr('<body>', '</body>')
```

which only happen when surrounding actual content.

It is then easier to use it in the filename content with the `!H`
formatter: `content[:160]!H}`. Otherwise the content currently extracted
can't be decoded with it.
2025-01-03 14:57:12 +01:00
Mike Fährmann
671297a8cc [subscribestar] extend fix + add test
some attachments are inside an element with an additional class besides
'doc_preview', e.g. 'class="doc_preview for_post"'
2025-01-02 18:22:15 +01:00
Wyoh Knott
a46f7981ee [subscribestar] Fix attachment download and add support for audio type
- We change the text.extr 3rd argument to match current structure
   ('class="post-edit_form"')
 - We add support for uploads-audios based on a similar structure as the
   attachment type:
    - id = data-upload-id
    - name = audio_preview-title
    - url = src
    - type = audio

Fix #6721
2025-01-02 15:47:09 +01:00
Arased
03486599af Fix subscribestar date parsing in udated posts 2024-06-24 16:40:59 +02:00
Mike Fährmann
ea434963ae [subscribestar] fix file URLs (#5631) 2024-05-23 19:12:01 +02:00
Mike Fährmann
1b34d5ac40 [subscribestar] fix 'date' metadata 2024-03-22 00:45:09 +01:00
Mike Fährmann
57fc6fcf83 replace '24*3600' with '86400'
and generalize cache maxage values
2023-12-18 23:57:22 +01:00
Mike Fährmann
a453335a9f remove test results in extractor modules
and add generic example URLs
2023-09-11 16:30:55 +02:00
Mike Fährmann
f856987297 [subscribestar] fix preview detection (#4468)
and show a warning message when posts contain previews
2023-09-04 22:21:14 +02:00
Mike Fährmann
d97b8c2fba consistent cookie-related names
- rename every cookie variable or method to 'cookies_*'
- simplify '.session.cookies' to just '.cookies'
- more consistent 'login()' structure
2023-07-22 01:20:50 +02:00
Mike Fährmann
dd884b02ee replace json.loads with direct calls to JSONDecoder.decode 2023-02-09 15:22:00 +01:00
Mike Fährmann
b0cb4a1b9c replace 'text.extract()' with 'text.extr()' where possible 2022-11-05 01:14:09 +01:00
Mike Fährmann
541a61d344 [subscribestar] fix 'date' metadata (#2642)
Handle instances where the actual datetime information
is preceded by "Updated on "
2022-06-04 12:24:08 +02:00
Mike Fährmann
d50a1ec2cc [subscribestar] unescape attachment URLs (fixes #2370) 2022-03-09 19:06:04 +01:00
Mike Fährmann
522782c09d [subscribestar] emit metadata for posts without media (#1569) 2021-11-18 23:42:17 +01:00
Mike Fährmann
1c8aaf9318 [subscribestar] add 'num' enumeration index (closes #2040) 2021-11-18 23:38:41 +01:00
Mike Fährmann
21c2da454f update extractor test results 2021-07-04 22:00:32 +02:00
Mike Fährmann
d09bc5bd34 [subscribestar] improve attachment filenames (#1609) 2021-06-10 17:09:13 +02:00
Mike Fährmann
968d3e8465 remove '&' from URL patterns
'/?&#' -> '/?#' and '?&#' -> '?#'

According to https://www.ietf.org/rfc/rfc3986.txt, URLs are
"organized hierarchically" by using "the slash ("/"), question
mark ("?"), and number sign ("#") characters to delimit components"
2020-10-22 23:31:25 +02:00
Mike Fährmann
69e4871005 update extractor test results
- sensescans: replace 404d chapters
- mangapark: replace 404d chapters
- subscribestar: update test for attached files
2020-08-28 22:32:32 +02:00
Mike Fährmann
0d84d3af55 [subscribestar] extract attached media files (#852) 2020-08-03 22:02:42 +02:00
Mike Fährmann
e50c75628c [subscribestar] update 'date' parsing 2020-07-24 22:27:36 +02:00