Skip to main content
Best answer

Extracting specific patterns of text within a RSS feed

  • 14 May 2021
  • 2 replies
  • 420 views

Challenge: Take an RSS feed that delivers HTML as a text blob, parse that HTML for specific tags, drop result into an Airtable.

Obstacle: Parsing the HTML.

Sample entry in the RSS feed:

<item> <title>The Monsters Episode (ep48)</title> <link>https://letterboxd.com/lofitop5/list/the-monsters-episode-ep48/</link> <guid isPermaLink="false">letterboxd-list-17914720</guid> <pubDate>Sat, 15 May 2021 02:39:48 +1200</pubDate> <description><![CDATA[ <p>Lizards and apes and gremlins, oh my! Welcome to our Top 5 Monster Movies episode, where we utterly redefine monsters to include “cute fuzzy CGI things who go through doors”, “weird CGI things with eyes in their hands”, and, of course, BrundleFly. Our lists also include a musical, a monster who lives in a flower pot and sings, a camouflage monster from outer space, and … the wind. Listen and play along to try to guess which monster has excellent “hand-eye coordination”!</p><p><a href="https://www.lofitop5.com/lofi-top-5-48-the-monsters-episode/" rel="nofollow">Listen to this episode here!</a></p> <ul> <li> <a href="https://letterboxd.com/film/king-kong/">King Kong</a> <p>Not just the king of the apes, but hands down, King of the Monster Movies</p> </li> <li> <a href="https://letterboxd.com/film/godzilla/">Godzilla</a> </li> <li> <a href="https://letterboxd.com/film/the-sandlot/">The Sandlot</a> </li> <li> <a href="https://letterboxd.com/film/the-mist/">The Mist</a> </li> <li> <a href="https://letterboxd.com/film/monsters-inc/">Monsters, Inc.</a> <p>By far it has the fewest jump-scares for the genre, but as the title suggests, it's a monster movie!</p> </li> <li> <a href="https://letterboxd.com/film/the-fly-1986/">The Fly</a> </li> <li> <a href="https://letterboxd.com/film/jurassic-park/">Jurassic Park</a> </li> <li> <a href="https://letterboxd.com/film/little-shop-of-horrors/">Little Shop of Horrors</a> <p>Is this the only monster movie ever made with a sadistic singing dentist?</p> </li> <li> <a href="https://letterboxd.com/film/gremlins/">Gremlins</a> <p>After they got wet, the Gremlins became the iconic "micro-monsters" of the 1980s. Though we're still wondering when does "after midnight" end?</p> </li> <li> <a href="https://letterboxd.com/film/hellboy/">Hellboy</a> <p>As much as the sequel had issues, we loved Del Toro's vision and Perlman was just born to be Hellboy</p> </li> </ul> <p>...plus 8 more. <a href="https://letterboxd.com/lofitop5/list/the-monsters-episode-ep48/">View the full list on Letterboxd</a>.</p> ]]></description> <dc:creator>lofitop5</dc:creator> </item>

As you’ll see from the above, sandwiched between the UL and /UL there’s a pattern inside the HTML that looks like this:

 <li> <a href="URL">TITLE</a> <p>TEXT</p> </li>

What I’d like to do is something like “for each LI, create a row in Airtable with the URL, Title, Text”

I’m assuming the way to do this is a JavaScript block of code after I’ve fetched the RSS entry, but I am otherwise blocked. Any help/tips would be greatly appreciated.

This post has been closed for comments. Please create a new post if you need help or have a question about this topic.

2 replies

Userlevel 7
Badge +14

Hi @jtoeman 

Give this Code a try in a JavaScript Code step which takes 1 Input parameter (HTML) and returns 3 arrays (URL, Title, Text)…

(I’m sure there are more elegant ways to write the JavaScript, but this ought to work)

NOTE: Code assumes there will always be a URL and Title, and handles if there is or is not a Description.

 

let HTML = inputData.HTML;
let LIs = HTML.split('<ul>')[1].split('</ul>')[0].trim().split('<li>');
let URL = [];
let Title = [];
let Text = [];

LIs = LIs.filter(i => i);

for(var i=0; i < LIs.length; i++) {
LIs[i] = LIs[i].replace('</li>', '').trim();

var Link = LIs[i].split('<a href="')[1].split('">')[0].trim();
URL.push(Link);

var Label = LIs[i].split('">')[1].split('</a>')[0].trim();
Title.push(Label);

if(LIs[i].includes("<p>")) {
var Description = LIs[i].split('<p>')[1].split('</p>')[0].trim();
Text.push(Description);
}
else Text.push("");
}

output = [{URL, Title, Text, LIs, HTML}];

 

After the Code step in your Zap, add a Looping step to create an Airtable Record for each derived line item:https://zapier.com/apps/looping/integrations

 

Userlevel 3
Badge

this worked! thank you very much, SO helpful!!!