The need for more advanced HTML parsing/DOM libraries

  • 16 April 2024
  • 3 replies
  • 47 views

Userlevel 3
Badge

Today I had to build a Zap that parsed a HTML email template to pick out specific data as fields for a CRM system. I first looked at the Email Parser by Zapier, unfortunately it quickly became obvious that despite multiple methods of matching and having AI functionality, the layout of the email was too complex. The service specifically sending the email also only appeared to send a HTML only email, no plain text which was mainly the issue, as I think realistically the Email Parser by Zapier is designed for basic templates. While there are email parsing services, they cost or require a further subscription, so I looked at other options.

I then went back to a standard email inbound action and then a Code by Zapier block using JavaScript, eventually having to settle on using regex to pick specific parts of the HTML and assign these as fields with some further parsing. This worked, partly due to the fact by some luck the HTML elements had unique identifiers for every key piece of data, this meant the regex rules were able to be fairly specific without risk of matching multiple elements while also not requiring too complex regex rules. While I got a working solution, it raised a potential area for Zapier to consider, DOM parsing tools, specifically for HTML/XML responses.

Looking around Zapier does have some utilities related to HTML parsing and can use regex in places as well as well a web scraping option (This however only works with URLs), and a HTML email template is basically a big string. Regex will also only get you so far before it becomes very error prone and not performant. While JavaScript itself has DOMParser, because Zapier uses Node.Js this is not part of the library and as we know we cannot install additional libraries in the Code by Zapier steps. I also had to use a Code by Zapier step as even if the utilities/formatters could pick out the data, I’d have increased the step count in the overall zap by 3 or 4 times the current amount.

While I personally would consider what I wrote today as completely last resort and against all best practice and judgement, it did suggest there could be a possible avenue for DOM parsing options that are specifically designed to pick elements by identifier, class, attribute etc, rather than resort to regex.

It could have multiple use cases, parsing through HTML/XML as string value i.e. a response from another steps, parsing a GET request from a website, parsing a HTML email template and no doubt other use cases.

It would be good for Zapier to providing DOM parsing tools for these types of scenarios, it is certainly a more advanced use case, but better than regex rules.

I created this discussion to share a potential scenario and use case, feel free to chime in, if anyone has also come across the need for DOM parsing.


This post has been closed for comments. Please create a new post if you need help or have a question about this topic.

3 replies

Userlevel 7
Badge +11

Thanks for reaching out about this, @jamesmacwhite 👋

That’s a really great idea! I couldn’t see any existing feature requests for this so I’d recommend reaching out to our Support team to get one officially added. You can do that using the form at the bottom of the page here: https://zapier.com/app/get-help (select the Report a bug or request a feature option).

This will help to get it on the team’s radar and also allow us to start tracking customer interest in this. Then once it’s been created, if any folks chime in on the thread here we can get their vote added to the feature request to help increase it’s priority! 🙂

Userlevel 3
Badge

Sure thing, I submitted a feature request. On reflection, there is perhaps two angles to this.

DOM parsing utilities or actions for more advanced HTML/XML manipulation i.e. you can specific DOM selectors, or the ability to install additional libraries either in a Node.Js (JavaScript) or Python Code by Zapier step. There are plenty of DOM parsing libraries available for both, but the lack of no external libraries not being available in a Code by Zapier step is the limitation currently, so maybe that’s the alternative.

I’m not familiar with how Code by Zapier works behind the scenes in great detail but the code written is likely offloaded to a worker/container which has either the Node.JS or Python interpreter available. Having the ability to allow this container to have additional libraries present or be installed at runtime, then possibly cached in an re-usable image for the step until refreshed, rather than pulling libraries each time. We are venturing into advanced use cases here, but there’s probably more chance of expanding on the Code by Zapier step itself, rather than a whole new utility.

Userlevel 7
Badge +11

Thanks for submitting that feature request, @jamesmacwhite

Hmm, I’ve just had another look into the existing feature requests for Code by Zapier and while there aren’t any specifically around the addition of DOM parsing tools, there is an existing feature request for the ability to use external libraries! Which could potentially allow access to the desired DOM parsing libraries so I’ve gone ahead and added your vote to that - apologies for not picking up on that sooner!

I can’t make any promises or give timelines around when or if that functionality would definitely be added, but we’ll be sure to notify you by email if it is. 🙂

In the meantime, happy Zapping! ⚡