Question

How does Box indexing work with OCR'ed documents?

Forum|Forum|6 months ago
May 21, 2025
6 replies
8 views

community-manager
Box Employee

We currently have users OCR their documents before uploading onto Box. Is there any additional setup needed to make sure that the Box search API searches through the text in images inside pdfs? How does the box indexing work with documents that are already OCRed? Do we need to build any custom indexes?

C

community-manager
Author
Box Employee
Forum|Forum|6 months ago
May 21, 2025

After you OCR a document, are you adding the recognized text to that document in Box? Or are you storing the recognized text in a different way?

Like

C

community-manager
Author
Box Employee
Forum|Forum|6 months ago
May 21, 2025

Hi Murtza,

I am working with Ktatt on this and what we want basically is to understand how Box.com reads an alreadyOcr'd document and gets the text and how then it stores it. we are trying to build an api that can reference what box.com stores relating to the document and output it to a search response.

Thanks,

Barry

Like

C

community-manager
Author
Box Employee
Forum|Forum|6 months ago
May 21, 2025

Curious if this ever got an answer? I've noted that box doesn't seem to index pdfs with an ocr layer, so the text won't allow the document to show up in a search. I get that Box doesn't OCR images, but since we've already done that leg, we were hopeful it would index the text layer for searching. Interestingly you can use the box preview to highlight and copy text etc just like any other document so it is seeing that there is text.

Like

C

community-manager
Author
Box Employee
Forum|Forum|6 months ago
May 21, 2025

You can store the recognized text as metadata on the file. Our search service will index the file metadata, so you can search for a file based on the recognized text.

Like

C

community-manager
Author
Box Employee
Forum|Forum|6 months ago
May 21, 2025

Hi Murtza, I think we are looking for something a bit quicker and hands off- basically we upload the pdf that has an ocr text layer and like any other document the text is indexed for search. Not looking to use the api, addittional editing steps, etc, just simple upload, box index, and search from the interface. It also looks like from the interface that the metadata option is going to be a difficult sell to enter entire lengthy documents easily. Hopefully this clarifies and maybe you have some other thoughts? Or perhaps this is a future feature coming? Thanks.

Like

C

community-manager
Author
Box Employee
Forum|Forum|6 months ago
May 21, 2025

Hi, I've just ocr's a pdf document (that was previously a scanned document), uploaded to Box and the search function found text within the document. However It did seem to take a few minutes to index

Like

Sign up

Login with SSO

Login to the community

Login with SSO

Scanning file for viruses.

This file cannot be downloaded