Atom feed intricacies

Jan 4, 2024

Somehow I'm on a kick of writing a whole lot right now, and specifically on nerdy topics it seems. I can't help myself. It's just what's going on in my life currently. Around two weeks ago, I had a manic (colloquial) moment, wiped my computer, and installed Linux. It isn't the first time I've run Linux as my primary operating system, and, every time, it just consumes my life. I think my brain is very susceptible to getting hooked onto nerdy neuroses. Anyway, I downloaded a command-line feed reader and noticed that my own blog's feed was showing up totally weirdly. This observation triggered a (nearly) full day of reading the RSS/Atom specification and fixing my feed generation script. I'm not going to write anything new here, since RSS/Atom is very well documented, but I just couldn't let this event pass by without noting it somehow.

I am also very aware that posts about a blog/site's backend are always much more interesting for the webmaestro (do we have a better word than webmaster yet???) and extremely dry for everyone else. Recognising that, onto the very indulgent post.

My feed generation process

I know lots of people use some software tool to generate their feeds, whether Hugo or some other generator. I don't. As I noted on my about page, I wrote everything that goes into this website by hand, except the syntax highlighting on pages with code snippets. Even though static site generators are very convenient, I just like making each part of my website myself (see the above reference to my neurotic nature, especially in the context of tech). Anyway, that's a very long preamble to say that I wrote the feed generator for my blog by myself. I recognise that having to sink hours into debugging my feed was a problem 100% of my own making because I choose not to use an off-the-shelf generator.

My feed generating script

My script isn't anything fancy. I wrote it in Julia, the language with which I am the most comfortable. I literally just have a skeleton feed.xml file, into which the script inserts an entry for each blog item.

My main goals with the feed were

to always have the most updated version of each post
to have the full body of each post, not just a link to the post like some feeds

To those ends, my script basically loops through every single blog post, extracts the necessary information for each (e.g. title, date, body), builds an entry using this information, then inserts the entries into the skeleton file. I know it's inefficient to go through this process every time to generate the feed (for example, why not only modify the entries that were actually updated by checking something like git?), but it's super conceptually straightforward and therefore really easy to code up. Plus, I just don't have that many posts yet, and Julia is a fast language, especially for loops. The script runs in no time at all.

What is it that they say about optimisation having to balance development time vs runtime? Usually I'm so neurotic about optimising runtime that it's comical how much development time I sink in. For once, I think I am striking the right balance on this spectrum.

So what was messing up?

It's outlined in the W3 documentation, but I just was too dense to truly get it. Basically, to have the full content of the post in the feed entry, I was just taking the raw HTML body of the post and sticking it into the feed entry directly.

It turns out that, though lots of RSS/feed aggregators are smart enough to interpret the resulting .xml file properly to render the HTML, you aren't actually supposed to straight up dump HTML into the .xml file's tag. If you do, the W3 validation service will yell at you, and rightfuly so. It against the specification. And, precisely for this reason, my command line feed aggregator was not displaying my feed entries properly.

To include HTML-marked-up text in the <content> tag, you need to

tell the .xml file that your content is in the HTML format, by including the type="html" attribute in the content tag
escape any of the HTML markup syntax — for example, a <p> tag should become <p> so the triangular brackets are represented with their HTML character entities rather than their actual character

It's one of those things that's dead simple once you figure it out, but for some reason just totally resists comprehension until you do. Typing it out now, I almost feel silly with how straightforward it is. But when I was in the weeds mucking about, I tried so many permutations of escaped/unescaped, type="html"/type="text"/type="text/html", etc.

(As a little meta-comment, I actually had to escape these characters in this blog post itself to get them to render properly. Check out the source code of this post if you want to see what I really wrote in that previous paragraph compared to what you see in your browser.)

I always wondered why some feeds don't have the full article content but instead just a link to the post's page. I just assumed the webmaestro wanted to force traffic to their website. While I'm sure that's part of it, I am also now extremely sympathetic to simply putting a link and not the entry's contents. It's so much easier. That said, I guess most people just use a feed generator, like a sane person, so maybe it's not really that big an ordeal for them to include the post's body in the feed. Come on, people! If I can do it, you really have no excuse.

Thanks for nothing, Linux

I'm writing this post for my future self as reference, and for the minute possibility that someone else is both as stubborn as me about building a feed generator by hand and also stuck on how to actually get the articles to display properly. The likelihood is definitely laughably slim. Even if such a person exists, chances are they haven't noticed that their feed is wonky because, as I previously mentioned, most feed aggregators now are smart enough to parse an incorrectly-structured feed (e.g. with unescaped characters). It's why I, myself, didn't notice there was something wrong with my feed until this week.

I only really became aware of the problem, as I said, because I tried to view my feed with a command-line aggregator that doesn't do the smart parsing. (It's Newsboat, by the way.) I guess I have Linux to thank for that. You certainly don't need to be using Linux to use a command-line aggregator, but something about being on Linux makes command-line tools feel so much more a propos. People always say using Linux is good because it forces you to learn about your computer system more deeply. I definitely think that sentiment is true, and I didn't really need convincing of it, but here's a cute example regardless.

send a comment