Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

"Search and Conversion" searching a German PDF: Bad results/specifying languages?

We want to add a "AI-supported search feature" to our application, where the users ask questions in natural language and the feature returns information (and as well some kind of reference, but this is not important) from a kind of manual (or a bunch of web pages like in Confluence).

I have followed the example, created a Cloud Storage and uploaded a manual as PDF, the texts are mostly in German. The results are very bad (both in preview and using the integration code):

The manual has a page which looks like this:

Glossar: CPC (Cost per Click)

Der CPC gibt bei Paid-Media-Kampagnen an, wie viel im Durchschnitt für einen Klick auf ein Werbemittel, also einen Besuch auf der Landingpage, ausgegeben werden musste.

(Manually translated: The CPC provides for paid-media-campaigns, how much on average had to be paid per click for an ad, that means, for each visit to a landing page.)

  • Searching using German frequently returns "There is not enough information", even if you use the exact same phrase as in the PDF.
  • Searching using German (translated) "Which abbreviation describes the cost per click" returns a text of a different KPI from the same document.
  • Searching using English "What is a CPC" surprisingly works, but translates the answer into English as well.
  • Searching just for "Paid-Media-Kampagnen" returns a result in - I guess - Swedish - (see below)

Is there a way to enhance the results, e.g. by providing somewhere the language of the Data Store/the underlying data (here: PDF)? Is there a way to hint the script into staying with a certain language, e.g. German?

Test examples:

Welche Abkürzung beschreibt die Kosten pro Klick?

CPÜ steht für Cost per Überleitung

What is a CPC:

CPC stands for Cost per Click. It is a metric that measures the average amount of money spent per click on a paid media campaign

Paid-Media-Kampagnen:

CPC (Cost per Click) og CPA (Cost per Action) er to begreber, der bruges til at beskrive Paid-Media-kampagner. CPC angiver, hvor meget der i gennemsnit skal bruges for at få en person til at klikke på et reklamemiddel.

 

0 1 195
1 REPLY 1

What I found in my tests of using Knowledgebases is, HTML files in a bucket produce better results compared to the same PDF versions.

We tried both and settled with HTML versions.