Skip to main content
Best answer

Issue Using Regex to Isolate Strings

  • 2 October 2020
  • 8 replies
  • 443 views

I have a Zap with four consecutive similar steps; two of those steps work and two of them don’t, and I can’t figure out why.

 

This Zap begins by taking an HTML email and converting it to Markdown. The relevant section of that email is as follows (proprietary information changed):

 

Customer: | **ACME Corporation**
Representative:| **John Smith**
Service Location:| **New York City**
Agreement Number:| **G123456**

 

The Customer line has a space between : and |, while the next three lines do not; I account for this in the expressions.

 

Using the Formatter Extract Pattern tool, I set up the following 4 steps, each using the output of the HTML to Markdown step as its input:

Step 5, meant to extract the Customer name:

Customer\: \| \*\*(.*?)\*\*

This works as expected, and returns:

output:
0: ACME Corporation
_end: 326
_matched: true
_start: 271

Step 6, meant to extract the Representative name:

Representative\:\| \*\*(.*?)\*\*

This returns:

output:
_matched: false

If I change the input on this step from the whole Markdown output of the email, to just the single Representative line, it returns John Smith as expected.

Step 7, meant to extract the Service Location name:

Service Location\:\| \*\*(.*?)\*\*

This behaves exactly as Step 6 - it returns false unless I input only the single relevant line of the email.

Step 8, meant to extract the Agreement Number without the leading letter:

\d\d\d\d\d\d

This works as expected, and returns:

output: 
0: 123456
_end: 443
_matched: true
_start: 437

EDIT: I just realized if I change the expression here to:

Agreement Number\:\| \*\*G(.*?)\*\*

this step also returns false. It seems like it doesn’t like the lack of the space between : and |, I guess? I can’t imagine why that would matter.

================================

I’ve tested all four expressions using regex101.com and using the full Markdown email as test string, and all four returned the correct string.

In case there was some kind of order issue, I tried adding an additional step before the functional step 5, and recreated the Representative step there, and it still returns false.

While I didn’t paste the entire email above due to too much proprietary information, I can say that there is no other place in the email that contains either “Representative” or “Service Location” - and even if there were, it should return multiple matches instead of a false match.

I can’t imagine there’s some other content in the email causing the failure because steps 5 and 8 do work, and use the same input.

Is there something else happening here that I’m not seeing?

Try Parser by Zapier: https://parser.zapier.com

Or Mailparser: https://mailparser.io

Both will parse emails for you in a more efficient way than trying to use the Formatter.


While I appreciate your reply, I’m not looking for “use X instead” here. I’m looking for reasons I may have missed that explain why the extract pattern tool isn’t working as expected. I have used Zapier’s mail parser and it’s good, but not quite as accurate as I need this to be. 


Using a service such as Mailparser will allow you to do more advanced email parsing.

Plus, it will save on the number of Steps in the Zaps, and thus save the number of Tasks, which depending on your volume can save on your Zapier monthly paid plan.

NOTE: Mailparser is a paid service.


[advanced] Could also use a Code step using custom JavaScript or Python to do the regex extraction in 1 Step instead of 4: https://zapier.com/apps/code/help


Using Mailparser is a non-starter. Zapier is also a paid service - one that I am paying for currently, and should not require an additional paid service to complete this task. I would prefer that Zapier properly escalate the ticket I opened with them, but their answer was, “RegEx can be a bit tricky to get set up so we can only provide limited support for debugging here. If that expression didn't work, I recommend reaching out in our Community Forum or on Stack Overflow for additional help,” etc. So here I am.

 

Using Code as follows:

If the input is set to the HTML-to-Markdown step output, the Customer line returns the expected result, and the other three lines return None.

If the input is a text copy/paste of the HTML-to-Markdown step output, all four lines return the expected result.

Python used:

output = {}
import re
cus_substring=re.search('Customer\: \| \*\*(.*?)\*\*',input_data.get('email')).group(1)
rep_substring=re.search('Representative\:\| \*\*(.*?)\*\*',input_data.get('email')).group(1)
svc_substring=re.search('Service Location\:\| \*\*(.*?)\*\*',input_data.get('email')).group(1)
agr_substring=re.search('Agreement Number\:\| \*\*\w(.*?)\*\*',input_data.get('email')).group(1)
output = {'Customer':cus_substring,'Representative':rep_substring,'Service Location':svc_substring,'Agreement Number':agr_substring}


I received a reply to my original support ticket that shed some light on the problem and offered a workaround.

 

The support tech was able to run a test and send me a more in-depth input/output log for a test step, which revealed that the input has additional spaces that are not visible in the Zap building interface.

 

The input I need to use here is the output of an HTML-to-Markdown conversion step; copying that output into Notepad shows a single space between | and ** in the three lines that fail to match. (This is the same data I pasted into my OP here as well.) However, in the input/output log the support tech sent me, there are clearly two spaces in those three places (but not in the Customer line, which was matching successfully):

 

\nCustomer: | **ACME Corporation**  \nRepresentative:|  **John Smith**  \nService Location:|  **New York City**  \nAgreement Number:|  **G123456**  \n

 

Adjusting my regex for the failing steps to include the extra space does return successful matches, for each of those lines.

 

I replied to my support ticket again asking for this to be reviewed as a bug; hopefully this happens on Zapier’s end. It’s hard to write a successful regex to account for input characters I can’t see, and I would like to use this Zap functionality more in the future.