Skip to main content

I am trying to get contents of my files. Unfortunatly they are in bytes, as I don’t want to store then as files I try to convert them into string. Even tough I tried many encoding none works.



I also tried text representation which solves my problem. But it is only applicable upto 500MB documents as per the BOX documentation.



How can I get the contents from the files even if they are greater than 500 MB.

Hi @MBenny , welcome to the forum.



Can you elaborate on the types of files you’re interested? I’m assuming it’s not pure text files or you would just convert the binary to text…



PDF, Docx, something else?



Perhaps we can find some sort of text extractor that can be plugged in between the API and whatever you’re sending the text representations.



That would also be interesting to understand a bit more of the use case, what do you do with the text? Do you store it, send it to an LLM, something else?



Cheers


Hi @rbarbosa ,



I’ll be working with office files and PDFs. I’ll be using it to store the data then do a search which will be integrted to my AI model.



The offices files will generally be doc, excel, ppt, text


Hi @MBenny ,



I’ve been discussing this internally, and the 500 mbyte limit is in place to extract the text of a file.



However folks on my side seem quite open to revisit this topic, especially from an AI perspective, where we can have big files to process, but these don’t necessarily yield that big text version to send to an LLM.



Having said that, and assuming you are familiar with how product managers works, I would kindly ask you to put this as an idea on our Box Pulse.



This type of tool is how or PM’s track requests and manage product road map. The more we have the better.



Also if you are or represent a customer, mention which too.



Cheers


Reply