🇵🇱 Przejdź do polskiej wersji tego wpisu / Go to polish version of this post
In my previous posts, I wrote about Mastodon, so I thought I would expand on this topic and present one of my little projects. Mastodon is gaining popularity every day, but it is not yet a big and recognized medium that would attract the attention of larger media companies, which are most active there when there is the largest audience, quantitatively, and not necessarily qualitatively. In such situations, you have to deal with it yourself, which is what I did. This is how the idea of MEWS, which stands for Mastodon nEWS, was born.
Where to start?
Since news portals do not publish on Mastodon and probably do not have a plan to do so, we need to make a bot that will do it for them!
That idea came to me one day. It turns out that Mastodon API is quite easy to handle through cURL, and since it is easy to handle through cURL, it will be just as easy to write a PHP script that will scrape (retrieve data from) the RSS feed of a given news portal, process the data, and publish it in the form of a toot on Mastodon.
Okay, but which portal would I like to start with? Ideally, it would be the one I miss the most on Mastodon! My favorite Polish-language source of information is Rzeczpospolita, which I pay a small monthly fee for access to because it is behind a paywall. This arrangement is fully understandable to me because good journalism should not be free.
Building an RSS -> Mastodon bot
The complete code for the bot, which is the hero of this post, is available on my GitHub at this link. I’m mentioning this because I won’t be posting the entire code line by line here, but I’ll describe its most essential parts. I would also like to note that I am not a professional programmer, but only a self-taught hobbyist, so my code may not be perfect or compliant with any accepted standards in the world of dev. It may also not be fully optimized, but what matters is that it works as it should.
We start by creating two files:
- rzeczpospolita.txt – it will store links to articles that we have already transferred from RSS to Mastodon, so that we do not duplicate toots,
- rzeczpospolita.php – the main script of the bot.
I wanted the bot I am writing to be as universal as possible and could be easily modified with little effort to work for another website and be easy to use by others, so at the beginning of the script, I extracted certain variables (or rather constants 🤔) that I will use later in the code. So, at the beginning, we need to determine three things. The first is the token, which is our private access key to the Mastodon API. It is obtained by entering the Settings of the account on which we will publish automatic toots generated by the bot, then the Development tab, and the New application button. We provide any name for the application, I gave “MEWS bot” and in the Scopes section, uncheck everything except “write”, which means that the application we are creating will have full permissions to publish on this account. The next thing we need to specify for the script to function properly is the instance address on which we registered the bot account. The third variable is the character limit applicable to this instance (rate limit). By default, this is 500, but there are instances that allow more (for example, on our local instance 101010.pl, it is 2048 characters).
$token = "[PASTE TOKEN HERE]";
$instance_url = "[PASTE INSTANCE URL HERE]";
$instance_rate_limit = 500;
Next, we create an array with links to the RSS channels of the portal whose articles we want to publish through the bot. It can be one or several links separated by commas. Rzeczpospolita has one main RSS feed, so for it, this instruction will look like this.
$urls = array(
"https://www.rp.pl/rss_main"
);
But if we wanted to filter content thematically, we can limit ourselves to the thematic RSS feeds, of which there will be more, and do it like this:
$urls = array(
"https://moto.rp.pl/rss/2651-motoryzacja",
"https://cyfrowa.rp.pl/rss/2991-cyfrowa",
"https://energia.rp.pl/rss/4351-energetyka"
);
We load the contents of the rzeczpospolita.txt file so that we can later filter out those articles from the RSS channel that we have already shared.
$file = file_get_contents("rzeczpospolita.txt");
Using the foreach loop, we go through all the links to the RSS channels given in the $urls array.
foreach($urls as $url)
{...}
Using the simplexml_load_file() function, we convert the content of the RSS channel into a multilevel array called $feeds.
$feeds = simplexml_load_file($url);
We use the foreach loop again, but this time we divide the RSS feed into individual articles (items).
foreach ($feeds->channel->item as $item)
Let’s now take a look at the syntax of an example item in an RSS feed:
<item>
<guid isPermaLink="true">https://cyfrowa.rp.pl/technologie/art37858421-chinski-robot-jak-terminator-zmienia-ksztalt-i-przelewa-sie-przez-kraty</guid>
<mainProfile><![CDATA[Technologie]]></mainProfile>
<title><![CDATA[Chiński robot jak Terminator. Zmienia kształt i przelewa się przez kraty]]></title>
<link><![CDATA[https://cyfrowa.rp.pl/technologie/art37858421-chinski-robot-jak-terminator-zmienia-ksztalt-i-przelewa-sie-przez-kraty]]></link>
<description><![CDATA[Zespołowi badaczy z Chin udało się opracować rozwiązanie niczym z filmów science fiction. Stworzyli zmiennokształtnego robota, umieścili go w zminiaturyzowanym modelu więzienia i pokazali, jak potrafi wydostać się on zza krat.]]></description>
<category>Technologie</category>
<pubDate>Sat, 28 Jan 2023 11:56:00 +0100</pubDate>
<enclosure length="0" type="image/jpeg" url="https://i.gremicdn.pl/image/free/497cf5a2a1609a24bd425fe122641ed9/?t=resize:fill:600:300,enlarge:1"/>
<author>Michał Duszczyk</author>
<redirectUrl/>
<pay_status>Preview</pay_status>
</item>
The specificity of XML files is that the information is contained between appropriate tags, whose names define what they store. Let’s establish what we would like our toot to look like, and therefore what we need to construct it. My vision was:
TITLE
SEPARATOR (5 DASHES)
THEMATICAL HASHTAGS
SEPARATOR (5 DASHES)
SHORT DESCRIPTION (IF NECESSARY, SHORTENED ACCORDING TO INSTANCE CHARACTER LIMIT)
SEPARATOR (NEWLINE CHARACTER)
LINK
Now that we know what we need, let’s start extracting that information from the XML file. We’ll start with the link. It may seem like we’re starting from the end, but it’s a special operation because there’s no need to retrieve the rest if it turns out that the link is already in the rzeczpospolita.txt file, which would mean that it has already been processed earlier by the bot and the article it refers to has already been posted to Mastodon. The link is located between the <link>…</link> tags, so since we used the simplexml_load_file() function earlier, we can access it using only the simple notation $item->link. We still need to reformat the retrieved data to a string using the strval() function and remove unnecessary elements from that string.
$link = strval($item->link); // Retrieve the link from the XML file and format it as a string
$link = str_replace("<![CDATA[", "", $link); // Remove "<![CDATA[" from the beginning of the string
$link = str_replace("]]>", "", $link); // Remove "]]>" from the end of the string
This way we saved in a variable named $link a string of characters that is a link to the article. Now we need to check if it appears in the rzeczpospolita.txt file. We will use the str_contains() function for this, which returns true if the $file string contains the $link string, and false if it does not.
if(str_contains($file, $link))
{
continue; // If it appears, skip this item and continue executing the loop
}
else
{
... // If it doesn't appear, execute the rest of the code, which will be described later in the post
}
Once we know that we haven’t posted a toot about the article yet, we move on to obtaining the remaining items from the RSS feed. We retrieve the article title and description in a similar way to how we did it with the link, while removing unnecessary characters.
$title = strval($item->title);
$title = str_replace("<![CDATA[", "", $title);
$title = str_replace("]]>", "", $title);
$description = strval(strip_tags($item->description));
$description = str_replace("<![CDATA[", "", $description);
$description = str_replace("]]>", "", $description);
We still have thematic hashtags, which will be equivalents of the categories to which the article data has been classified. For hashtags, the situation is slightly different than for the previously retrieved data, because while articles from Rzeczpospolita are usually assigned to only one category, for other portals, an article often belongs to more than one and there are more than one parameter <category>…</category> to retrieve. Creating the MEWS bot, I decided that hashtags are a fairly important part because they will allow followers to easily filter topics that interest them or, conversely, do not interest them. They must be unique, so I add MEWS at the end, then the user can be sure that blocking a given hashtag only blocks toots from the MEWS bot.
We start preparing the hashtags by creating an $hashtags array. Then we use the foreach loop again and in this way we collect all values under the category parameter of the given article. We process the collected data appropriately. I add a prefix # and suffix MEWS to each category. Finally, I put all the hashtags in the previously created array, adding one more hashtag at the end – #MEWS – which is not a category but only a common hashtag for all MEWS toots, and join all the elements of this array into one string, separating them with a space.
$hashtag = array();
foreach($item->category as $category)
{
$category = ucwords(strtolower(strval($category)));
$category = str_replace(" ", "", $category);
$category = "#".$category."MEWS";
$hashtag[] = $category;
}
$hashtag[] = "#MEWS";
$hashtags = implode(" ", $hashtag);
This way, I store the string with all hashtags under the $hashtags variable, which I will soon attach to the toot.
Now that we know the length of all components, we need to calculate whether we can fit everything into one toot. However, if it turns out that the message in this form is longer than the limit set at the beginning, we will have to shorten the description saved in the $description variable to fit within the limit. Let’s start by calculating the limit for the description using the formula:
Description limit = Allowable number of characters for one toot – Title length – Two separators of 5 characters each – Six newline characters – Hashtag string length – Link length – Three dots as shortened description ending – 10 reserved characters for safety.
$description_limit = $instance_rate_limit - strlen($title) - 10 - 6 - strlen($hashtags) - strlen($link) - 3 - 10;
Now we just need to check whether the length of the description is greater than the limit, and if so, shorten it to the length of the calculated limit and add three dots at the end. We will use two functions for this: strlen() to calculate the length of a string and substr() to extract a smaller string of a specific length starting from the first character (0).
if(strlen($description) > $description_limit)
{
$description = substr($description,0,$description_limit);
$description .= "...";
}
OK, now we can start composing the content of the toot.
$status_message = $title."\r\n";
$status_message .= "-----"."\r\n";
$status_message .= $hashtags."\r\n";
$status_message .= "-----"."\r\n";
$status_message .= $description."\r\n\r\n";
$status_message .= $link;
The message is ready, so it’s time to set up the cURL request parameters, i.e., communication with the Mastodon API. We will start by defining the data that we will send in the request, which are:
- status – the content of the toot that we have prepared,
- language – language in which it’s written,
- visibility – the visibility of the toot, available options are public, unlisted, private, and direct. I have chosen unlisted because I don’t want to spam people on the local and global timeline, but I also want all published toots to be visible on the bot’s profile.
$status_data = array(
"status" => $status_message,
"language" => "pl",
"visibility" => "unlisted"
);
Now, the header for the API function we are going to use, i.e. publishing a status, doesn’t have to be extensive, as it is sufficient to consist only of the necessary authorization instructions, i.e. containing our token defined at the beginning.
$headers = [
"Authorization: Bearer ".$token
];
Everything is ready, so we build and execute a cURL request. First, we initialize the request. Then we specify the URL to which we will send the request. If you want to use the “publish status (toot)” API function, the URL will be – [INSTANCE URL]/api/v1/statuses. Next, we instruct cURL to use the standard POST method over HTTP and to return information about the result of the request to us (success or failure and error code if applicable). Finally, we include the previously defined header and the main content. The last two lines execute the cURL request, save the result in the $output_status variable, which is not required but useful for diagnostic purposes, and disconnect the connection.
$ch_status = curl_init();
curl_setopt($ch_status, CURLOPT_URL, $instance_url."/api/v1/statuses");
curl_setopt($ch_status, CURLOPT_POST, 1);
curl_setopt($ch_status, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch_status, CURLOPT_HTTPHEADER, $headers);
curl_setopt($ch_status, CURLOPT_POSTFIELDS, $status_data);
$output_status = json_decode(curl_exec($ch_status));
curl_close ($ch_status);
Finally, at the end of the loop, we add the link that we have processed to the list of processed links.
$file .= $link."\n";
The last line before the end of the script is the update of the rzeczpospolita.txt file with the content of the $file variable, where we stored links to previously published articles as well as those published during this particular run of the script.
Bot is ready!
Now all that remains is to place the bot code on some hosting or server (e.g. with nginx or apache). It would also be good to set up a cron job that will trigger the script to run at a certain interval of time (e.g. every 30 minutes). Most hosting services have this function, it will be called cron jobs, recurring tasks, or something similar.
This post turned out to be a pretty long block of text, but I hope I described everything in a clear way. For different portals, the bot script will require minor modifications, which is due to the fact that RSS channels, or rather their formatting, are sometimes a bit different. However, this is not an obstacle that cannot be overcome. It is enough to review the content of the XML file of a given RSS feed and make corrections.
Without further talking, I will just add links to the bots that I have launched myself below. The source code for all three bots below is available on my GitHub, so you can take a look at it. They are published under the MIT license, so you can basically do whatever you want with them. I have only one request – if you use my code and create your own bot of this type, let me know, I would be happy to see how it turned out and I may be interested in following it 😉
- 🇵🇱 Rzeczpospolita – @rzeczpospolita@101010.pl
- 🇵🇱 OKO.press – @oko_press@101010.pl
- 🇬🇧 The Guardian – @guardian@mastodon.world
Pingback: MEWS Bot = Mastodon nEWS – Tomasz Dunia Blog
Łukasz
@to3k Serwus, chyba Polska wersja jest źle podlinkowana. Pozdrawiam.
Tomasz Dunia
o faktycznie! Dziękuję za zgłoszenie dobry człowieku!
Pingback: Translator DeepL – translate using API [ENG 🇬🇧] – Tomasz Dunia Blog