[subscribestar] Better extraction of content
The structure of content is like this:
```
<div class="post-content" data-role="post_content-text">
<div class="trix-content">
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<body>
<div>
Unspeakable thing are written here<br />
<br />
haiiiiiiiiiiiiiiii hi hi hiii its meee back againnn, plspls leave a comment if uuuu liked it mwah
<3
</div>
</body>
</html>
</div>
</div>
<div class="post-uploads
```
Currently we extract content with:
```
(extr('<div class="post-content', '<div class="post-uploads').partition(">")[2])
```
I propose we just take the body parts:
```
extr('<body>', '</body>')
```
which only happen when surrounding actual content.
It is then easier to use it in the filename content with the `!H`
formatter: `content[:160]!H}`. Otherwise the content currently extracted
can't be decoded with it.
This commit is contained in:
@@ -137,9 +137,7 @@ class SubscribestarExtractor(Extractor):
|
||||
"author_nick": text.unescape(extr('>', '<')),
|
||||
"date" : self._parse_datetime(extr(
|
||||
'class="post-date">', '</').rpartition(">")[2]),
|
||||
"content" : (extr(
|
||||
'<div class="post-content', '<div class="post-uploads')
|
||||
.partition(">")[2]),
|
||||
"content" : extr('<body>', '</body>')
|
||||
}
|
||||
|
||||
def _parse_datetime(self, dt):
|
||||
@@ -196,7 +194,5 @@ class SubscribestarPostExtractor(SubscribestarExtractor):
|
||||
"author_nick": text.unescape(extr('alt="', '"')),
|
||||
"date" : self._parse_datetime(extr(
|
||||
'<span class="star_link-types">', '<')),
|
||||
"content" : (extr(
|
||||
'<div class="post-content', '<div class="post-uploads')
|
||||
.partition(">")[2]),
|
||||
"content" : extr('<body>', '</body>')
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user