Question

Overcoming the Token Limit: A Challenge with Cyrillic Text Transfer from Google Docs to OpenAI

  • 24 June 2023
  • 2 replies
  • 406 views

Hi guys!

I'm trying to make a Zap to transfer text from a Google Doc into OpenAI for subsequent summary preparation. The challenge is that large texts cannot be directly transferred to OpenAI due to a token limit of 4097. Thus, I plan to divide the large text into smaller segments, send each to OpenAI, and then synthesize the results into a single output.
 

This pipeline is well-described in this video (https://youtu.be/mdqr85ZtKYs?t=556). I used it as a guide when creating my Zap, which functions perfectly on the reference model.

The process follows these steps:

  1. Identify a new document in a Google Doc folder.

     
  2. Clear temporary storage with Zapier Storage at each Zap launch.

     
  3. Determine the number of words in the source document with Zapier Formatter. In my case, the document contains 7518 Cyrillic words.

     
  4. Cut the large text into pieces of 2500 words with Zapier Code.

     
  5. Loop the subsequent algorithm for all segments with Zapier Looping.

     
  6. Send the prompt to OpenAI.

     

However, I'm encountering an issue.


If I use a Cyrillic text, Zapier reports an error, indicating that I requested more tokens than the model's maximum context length of 4097. Adjusting the number of words in each text segment doesn't affect the total tokens requested. Even altering the volume of the source document (though it deviates from the task conditions) hasn't resolved the issue.

Interestingly, when I use an English text as input, there are no issues, and the action works fine.

I can't figure out where to look for the error or what I should do to get past this step with a Cyrillic document.

Any help or tip would be much appreciated.


2 replies

Userlevel 7
Badge +14

Hi @Alex Evdokimov 

Good question.

Have you tried using this Zap action?

Formatter > Text > Split Text into Chunks for AI Prompt

Userlevel 1

Hi Alex

 

I have a similar issue as yours, do you have any workarounds for it?

 

By the way, you may find this information useful:

“How words are split into tokens is also language-dependent. For example ‘Cómo estás’ (‘How are you’ in Spanish) contains 5 tokens (for 10 chars). The higher token-to-char ratio can make it more expensive to implement the API for languages other than English.”

(Extracted from https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them)

Also, you may want to play around with the Tokenizer tool.

 

 

Reply