Workaround: How to extract Page Content from Notion

  • 14 June 2024
  • 0 replies
  • 166 views
Workaround: How to extract Page Content from Notion
Userlevel 7
Badge +11
  • Community Manager
  • 5484 replies

Hi friends! 👋 As you might already be aware, currently the Notion app on Zapier is only able to access the page properties and not the page content. So if you’ve been wanting to access page content from Notion for a database item or page in a Zap then you’ll want to give the following workaround I’ve discovered a try!

What’s the workaround?

I did some digging into the API documentation for Notion recently and discovered that there’s a Retrieve block children endpoint which will return a list of all the child block objects. That’s nice, but what’s a “block object” got to do with page content I hear you say? Well, a block object represents a chunk of content in Notion like headings, paragraphs, bulleted list items, embedded images etc. - you know, things that are contained in the page content. 

So I thought, perhaps we could use that to pull the page content for a page? Well, I did some testing and yes, it turns out it is possible to use that endpoint to retrieve all of the block objects within the content section of a database item or page in Notion! 😁 Keep reading to find out how you can do the same...

💡 Please note: this workaround involves the use of a API Request (Notion) and a Run Python (Code by Zapier) action. I’ll provide the code for you to copy and paste but it will result in an additional 2 tasks being used in your Zaps for those actions. Not ideal I know but I wanted to share this workaround in case anyone is interested in using it anyway! 

Set up the API Request action

It’s relatively quick to set this up, all you need to do for the request is:

  1. Add a Notion action to your Zap and select API Request (Beta) action event.
  2. Select your account then in the Action section set the HTTP Method field to use the GET method.
  3. And enter the following Retrieve child blocks endpoint in the URL field: https://api.notion.com/v1/blocks/IDGOESHERE/children
  4. In that URL replace IDGOESHERE with the field containing the ID for the database item/page. I used Notion’s Updated Database Item for my Zap’s the trigger so the ID was output in the Page Id field from the trigger. It also works if you supply the ID for a page rather than a database item.

The set up should look like this:
3c067bfa9d28a3c47554737b388ca400.png

Next, test the API Request action to make sure it retrieves the content from the necessary item/page:
1790f39035c3dc18fdb2d4388865344b.png
Yay! It loaded the page content blocks. But hang on a minute, it didn’t just return a set of nice tidy line items containing only the contents of each block. As you can see in the screenshot above, it returned a whole bunch of unwanted information like the type of block, whether there’s any annotations, formatting etc. 

So here is where I turned to my good friend AI to help me generate some code to help parse out the information I wanted. If like me, you don’t have the time to be writing Python code from scratch to work some magic in your Zap you can always use our Code with AI feature to do the hard work for you! ✨😎✨ To save even more time, you can copy and paste the code I’ll share in the next section...

Set up the Run Python action

  1. First, add a Code by Zapier action and select the Run Python action event:
    4958d23e16eac64ad6728985fd05d096.png
  2. In the for the Input Data section for that Code action you’ll want to:
    - Type in a Key name of blockObjects on the left.
    - And on the right, select the Response body field from the API Request like so:
    4f8c6c3cbbd1fe9b9de56a0e557cfe48.png
  3. Next, in the Code field you’ll want to copy and paste the following code: 
    import json

    def parse_notion_blocks(notion_api_response):
    # Ensure notion_api_response is a dictionary
    if isinstance(notion_api_response, str):
    notion_api_response = json.loads(notion_api_response)

    # Check if the response has 'results'
    if not isinstance(notion_api_response, dict) or 'results' not in notion_api_response:
    return {"error": "Invalid API response format"}

    blocks = notion_api_response.get("results", [])
    parsed_blocks = []

    text_buffer = []
    list_buffer = []
    list_type = None

    def flush_buffers():
    if text_buffer:
    parsed_blocks.append("\n".join(text_buffer))
    text_buffer.clear()
    if list_buffer:
    if list_type == "bulleted_list_item":
    formatted_list = "\n".join(f"- {item}" for item in list_buffer)
    elif list_type == "numbered_list_item":
    formatted_list = "\n".join(f"{i+1}. {item}" for i, item in enumerate(list_buffer))
    parsed_blocks.append(formatted_list)
    list_buffer.clear()

    def extract_text(block):
    text_objects = block.get("text", [])
    return " ".join(text_obj.get("plain_text", "") for text_obj in text_objects if isinstance(text_obj, dict))

    for block in blocks:
    if not isinstance(block, dict):
    continue

    block_type = block.get("type")
    content = ""

    if block_type == "paragraph":
    content = extract_text(block[block_type])
    if content:
    if list_buffer:
    flush_buffers()
    text_buffer.append(content)
    elif block_type == "image":
    if "file" in block[block_type] and "url" in block[block_type]["file"]:
    content = block[block_type]["file"]["url"]
    elif "external" in block[block_type] and "url" in block[block_type]["external"]:
    content = block[block_type]["external"]["url"]
    if content:
    flush_buffers()
    parsed_blocks.append(content)
    elif block_type in ["bulleted_list_item", "numbered_list_item"]:
    content = extract_text(block[block_type])
    if content:
    if list_type is None:
    list_type = block_type
    if block_type != list_type:
    flush_buffers()
    list_type = block_type
    list_buffer.append(content)
    elif block_type in ["heading_1", "heading_2", "heading_3"]:
    content = extract_text(block[block_type])
    if content:
    flush_buffers()
    parsed_blocks.append(content)
    else:
    flush_buffers()

    flush_buffers()

    return parsed_blocks

    # Reference input field containing source of API
    notion_api_response = input_data.get("blockObjects", "")

    # Parse the Notion blocks
    parsed_blocks = parse_notion_blocks(notion_api_response)

    # Output the parsed blocks
    output = {
    "parsed_blocks": parsed_blocks
    }

    output

Now test the Run Python action step to check the code outputs all of the blocks correctly:
a4fad80a6c7cf16157817a38a0a994be.png
Sweet! Now we have a set of lovely line items that we can use in subsequent actions! 🎉

A note on lists and paragraphs: In the above screenshot you’ll see that the bulleted list items are grouped together in a single field. The API Request action returns bulleted and numbered list items as separate block objects but the Python code we’re using will group adjacent list items together and output them as a single line items. Similarly with adjacent paragraph blocks that are output as separate block objects it will group them into a single line item field. 

From here you can select those line items in a subsequent action, assuming it supports line items. If not you could use a Line-item to Text (Formatter) action to convert the line items into a text field that is supported. Depending on your use case you might event want to use a Create Loop from Line Items (Looping by Zapier) action to run actions for each individual line item. What you do with the extracted page content now is up to you! 

Will this handle every kind of content type?

I’ve not tested this workaround with all the possible content types so it might not work for every use case. With images for example it was only able to extract the URLs for embedded images and not images/files that have been uploaded.

But that’s not to say that you can’t use this workaround to extract other types of content blocks in Notion - you just might need to tweak the Python code that’s been shared above to get it working for other types of content as you’d like.

And to make this even better you could also get the Code action to automatically apply certain HTML or markdown formatting to headers, and group them with adjacent paragraph content blocks. If you’re not familiar with coding in Python I’d definitely recommend giving our Code with AI feature a try to improve upon the example code!

Other resources

 

Well that’s it from me. If you find this workaround useful please let me know in the comments below. And if you have any suggestions or requests for other workarounds you’d like me to look into let me know, I’d love to hear from you! 


0 replies

Be the first to reply!

Reply