[subscribestar] Better extraction of content

The structure of content is like this: ``` <div class="post-content" data-role="post_content-text"> <div class="trix-content"> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <html> <body> <div> Unspeakable thing are written here<br /> <br /> haiiiiiiiiiiiiiiii hi hi hiii its meee back againnn, plspls leave a comment if uuuu liked it mwah <3 </div> </body> </html> </div> </div> <div class="post-uploads ``` Currently we extract content with: ``` (extr('<div class="post-content', '<div class="post-uploads').partition(">")[2]) ``` I propose we just take the body parts: ``` extr('<body>', '</body>') ``` which only happen when surrounding actual content. It is then easier to use it in the filename content with the `!H` formatter: `content[:160]!H}`. Otherwise the content currently extracted can't be decoded with it.
2025-01-03 14:47:59 +01:00
parent 5767c0854c
commit 22d4e84372
1 changed files with 2 additions and 6 deletions
--- a/gallery_dl/extractor/subscribestar.py
+++ b/gallery_dl/extractor/subscribestar.py
@@ -137,9 +137,7 @@ class SubscribestarExtractor(Extractor):
            "author_nick": text.unescape(extr('>', '<')),
            "date"       : self._parse_datetime(extr(
                'class="post-date">', '</').rpartition(">")[2]),
-            "content"    : (extr(
-                '<div class="post-content', '<div class="post-uploads')
-                .partition(">")[2]),
+            "content"    : extr('<body>', '</body>')
        }

    def _parse_datetime(self, dt):
@@ -196,7 +194,5 @@ class SubscribestarPostExtractor(SubscribestarExtractor):
            "author_nick": text.unescape(extr('alt="', '"')),
            "date"       : self._parse_datetime(extr(
                '<span class="star_link-types">', '<')),
-            "content"    : (extr(
-                '<div class="post-content', '<div class="post-uploads')
-                .partition(">")[2]),
+            "content"    : extr('<body>', '</body>')
        }