Solved: Vertex Search AI, raw byte or pdf?

edudesouza · 11-27-2023 05:10 AM

Hi there,

I am testing several configs to use in Vertex Search AI, my doubt: whats are the pros and cons about upload data in raw byte against a pdf file.

My application query at least 3 documents, 2 of them with more then 15 pages.

Many thanks in advance

quangtrunghuynh

Hi,
I do have some recommend,

Uploading in raw bytes:

Pros:

Smaller file size: Raw byte data is typically smaller than the corresponding PDF file, which can save storage space and reduce upload times.
More flexibility: Raw byte data can be processed more flexibly than PDF files, allowing for more granular control over text extraction and analysis.
Potential for better accuracy: Depending on the specific application, raw byte data may lead to better search accuracy, as it preserves the original formatting and layout of the text.

Cons:

Increased complexity: Handling raw byte data requires more complex processing steps compared to PDF files, which may introduce additional overhead.
Potential for errors: If the raw byte data is not processed correctly, it could lead to errors in text extraction and analysis.

Uploading as a PDF file:

Pros:

Ease of use: PDF files are widely supported and can be easily uploaded and processed by Vertex Search AI.
Preservation of formatting: PDF files preserve the original formatting and layout of the document, which is important for some applications.
Reduced complexity: Processing PDF files is generally simpler than handling raw byte data, as the formatting and layout are already defined.

Cons:

Larger file size: PDF files are typically larger than the corresponding raw byte data, which can increase storage requirements and upload times.
Reduced flexibility: PDF files may not provide the same level of flexibility as raw byte data for text extraction and analysis.
Potential for lower accuracy: Depending on the specific application, PDF files may lead to lower search accuracy due to additional processing steps and potential formatting issues.

In general, uploading data in raw bytes is recommended for applications that require high accuracy and flexibility, while uploading PDF files is recommended for applications that prioritize ease of use and preservation of formatting. For your application, where you need to query at least three documents, two of which have more than 15 pages, uploading in raw bytes may be a better choice, as it can provide more efficient processing and potentially better search accuracy for long documents. However, if the formatting and layout of the documents are crucial for your application, uploading as PDF files may be a more suitable option.
Hopefully, this will help you

View solution in original post

quangtrunghuynh

Hi,
I do have some recommend,

Uploading in raw bytes:

Pros:

Smaller file size: Raw byte data is typically smaller than the corresponding PDF file, which can save storage space and reduce upload times.
More flexibility: Raw byte data can be processed more flexibly than PDF files, allowing for more granular control over text extraction and analysis.
Potential for better accuracy: Depending on the specific application, raw byte data may lead to better search accuracy, as it preserves the original formatting and layout of the text.

Cons:

Increased complexity: Handling raw byte data requires more complex processing steps compared to PDF files, which may introduce additional overhead.
Potential for errors: If the raw byte data is not processed correctly, it could lead to errors in text extraction and analysis.

Uploading as a PDF file:

Pros:

Ease of use: PDF files are widely supported and can be easily uploaded and processed by Vertex Search AI.
Preservation of formatting: PDF files preserve the original formatting and layout of the document, which is important for some applications.
Reduced complexity: Processing PDF files is generally simpler than handling raw byte data, as the formatting and layout are already defined.

Cons:

Larger file size: PDF files are typically larger than the corresponding raw byte data, which can increase storage requirements and upload times.
Reduced flexibility: PDF files may not provide the same level of flexibility as raw byte data for text extraction and analysis.
Potential for lower accuracy: Depending on the specific application, PDF files may lead to lower search accuracy due to additional processing steps and potential formatting issues.

In general, uploading data in raw bytes is recommended for applications that require high accuracy and flexibility, while uploading PDF files is recommended for applications that prioritize ease of use and preservation of formatting. For your application, where you need to query at least three documents, two of which have more than 15 pages, uploading in raw bytes may be a better choice, as it can provide more efficient processing and potentially better search accuracy for long documents. However, if the formatting and layout of the documents are crucial for your application, uploading as PDF files may be a more suitable option.
Hopefully, this will help you

edudesouza

Thanks, quite accurate your answer, regarding my tests, specially about time, raw data gets available for query in seconds.

I my case, the user can update this raw data when he/she finds some extra information.

I did not find a way to train the search & conversation algorithm.