Skip to main content
This discussion was created from comments split from: Any Apify / Parsehub / Import.io experts here?.

Question: Does anyone know of any other smart ways to scrape data from websites? I am currently scraping data from our own website and turn that data into personalized and automated mails and newsletters . These, however, are powered not by import.io (which i'm testing right now) but by an RSS-Feed Generator which continually generates RSS Content from the website and then processed by the newsletter or push-mail-system. But I find that workflow (even though it works perfectly) a bit lame to be honest.



Interesting. Do you mind me asking what platform your website is running on, @davidweiss? Am I understanding correctly that the RSS Feed functionality is not native to the site?



@TheDavidJohnson hi, indeed, the RSS Feed functionality is not native. that's the only reason why i am using an external RSS feed generator. and the sole reason for this RSS workaround is that the website is running on a proprietary system developed by a small developer who is charging much more for adding RSS or API functionality than what we pay by using an external RSS feed generator. this whole thing is a workaround to avoid a)long time to deployment and b)high costs...(and c)headaches from discussing specifications with the external agency). we're based in a country with very high salary levels and were stupid enough to get locked in by this domestic developer...😱



What is the actual website @davidweiss ? I ask, as perhaps the website is pretty default when it comes to data, in which case you can easily use javascript to get everything you need just set up a pretty sweet Zap for a client where I used code by zapier to fetch the html and parsed out the data using code and formatter. I found that when I used import/appify/etc I was being blocked, but if I used the following code:

'var url = inputData.url

fetch(url)

 .then(res => res.text())

 .then(body => {

  var output = { url, rawHTML: body }

  callback(null, output)

 })

 .catch(callback)'


It brought back all of the info I needed. The fun part is then parsing it out with code or formatter.


There are a few things to keep in mind:

  1. if you're running the zap constantly, you will get blocked from the site temporarily, this will also occur with Apify or Import.io
  2. It's best with a plan with Paths.
  3. Its tedious, but it works really well!
  4. You'll need to know regex, but if you don't feel free to pm me, I love this kind of stuff.



Hey @Chelsey thanks a bunch...that looks more fun than the RSS workaround which is not only tedious,but also very finicky and costs $$ because the RSS generator is a paid service.


Will have a look. I don't "know" Regex (but have successfully tinkered with it to get custom Google Analytics reporting) but also I don't "know" many other things which I can make work none the less ☺️

Will take you up on your offer when I get around rengineering this particular workflow.


PS: i'm using Paths actively already.



As mentioned above, without knowing the website it is hard to give an answer. I can highly recommend using Google Sheets to scrape data. There is inbuilt functionality in IMPORTXML to do just this, using Xpaths (It is easier than it sounds). Then, of course, having the data in a google sheet gives you a lot of freedom in how to manipulate the data with zaps etc.



@cwinhall - Sounds interesting, I want to check this out more. Are there any good tutorials you can recommend as a jumping off point?



@Andrew_Luhhu There are a few good ones out there. This one is quite extensive.

https://www.distilled.net/blog/distilled/guide-to-google-docs-importxml/

or

https://rozhon.com/sheets-for-marketers/web-scraping-using-importxml-in-google-spreadsheets/



FYI, the website in question from which I generate RSS Feeds to then further process is:  https://www.oebu.ch/de/jobs-191.html  (among others...).



@davidweiss here is a quick example of importXML being used to get data from that URL:

importXML example

(I have used an add-on in Google Sheets called ImportFromWeb in this example) but the theory and functionality is pretty much the same.



Great stuff - thank @cwinhall !



Thanks @cwinhall ! I shall rework my workflow!



@cwinhall thanks again for this tip. i have now reworked the workflow to scrape data via importXML...regex gave me a headache for half an hour but now the extraction works like a charm. it is a bit slow since i'm scraping now hundreds of pages at once.



@davidweiss I get a headache just reading about you trying to work with regex for half an hour! 😁



Yes, this happens to my account also. But after some time, it automatically solved by zapier helps. You can see here the solution of this page.


Looks like this has been solved, however a smart and easy way to scrape data into Zapier is via Simplescraper

 

The scraped data is sent via webhook to Zapier so no code required. There’s a quick guide here: https://simplescraper.io/docs/scraping-data-into-zapier/.


Reply