Advanced Topic: Full-Text Search Details

📘

NOTE:

For general information on how to use search, see this article.

Full-Text search is an option on the Tetra Data Lake search screen. When Full-Text search is enabled, you can quickly and easily find files in the Tetra Data Lake without learning an additional query language. This article provides more details on how the Full-Text search option works.

Enabling the Full-Text Search Option

To enable this option in the TDP Search page, move the Full-Text slider to the right.

Enable the Full-Text Search OptionEnable the Full-Text Search Option

Enable the Full-Text Search Option

Files Supported

For version 3.1 of TDP, four types of files are indexed for full-text search: TXT, CSV, XML and JSON.
Content from these files, along with any file metadata, tags, and labels, are extracted. This data is then indexed and are broken down into search terms.

An Example of How Full-Text Indexing and Search Operates

To illustrate how this Full-Text Search works, let's look at an example of the following indexed file, fts-doc.txt, which has the following content.

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s.

The indexed metadata is the following:

{
  "fileId": "00cfc49c-2d33-46e3-bd7e-2b0815a4d52f",
  "filePath": "/fts-doc.txt",
  "orgSlug": "tetrascience",
  "traceId": "00cfc49c-2d33-46e3-bd7e-2b0815a4d52f",
  "file": {
    "path": "tetrascience/6f166302-df8a-4044-ab4b-7ddd3eefb50b/RAW/fts-doc.txt",
    "bucket": "ts-platform-dev-datalake",
    "version": "l2alWAAji.f2DgYF7DuFC7Dz_YHC1M.8",
    "size": 128,
    "checksum": "c1de85c917339cc9201f9952749188e7",
    "type": "txt"
  },
  "createdAt": "2021-07-08T15:32:43.000Z",
  "category": "RAW",
  "integration": {
    "type": "api",
    "id": "6f166302-df8a-4044-ab4b-7ddd3eefb50b"
  },
  "source": {
    "type": "unknown",
    "name": "API Upload",
    "id": "6f166302-df8a-4044-ab4b-7ddd3eefb50b",
    "api": {
      "integrationId": "6f166302-df8a-4044-ab4b-7ddd3eefb50b"
    }
  },
  "metadata": {
    "test": "some meta value"
  },
  "tags": [
    "my-tag"
  ],
  "labels": [
    {
      "name": "test_label",
      "value": "some label value"
    }
  ]
}

Based on this metadata and extracted file content, TDP generates something similar to this:

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s my-tag 00cfc49c-2d33-46e3-bd7e-2b0815a4d52f /fts-doc.txt tetrascience 00cfc49c-2d33-46e3-bd7e-2b0815a4d52f tetrascience/6f166302-df8a-4044-ab4b-7ddd3eefb50b/RAW/fts-doc.txt ts-platform-dev-datalake l2alWAAji.f2DgYF7DuFC7Dz_YHC1M.8 128 c1de85c917339cc9201f9952749188e7 txt Thu Jul 08 2021 15:32:43 GMT+0000 (Coordinated Universal Time) RAW api 6f166302-df8a-4044-ab4b-7ddd3eefb50b unknown API Upload 6f166302-df8a-4044-ab4b-7ddd3eefb50b 6f166302-df8a-4044-ab4b-7ddd3eefb50b some meta value test_label some label value

Later, the same person uploads another file that is somewhat similar named fts-doc-2.txt. TDP generates a file that looks like this:

The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters. firstName lastName your-tag 043c963e-65ce-4550-b362-6386454e3238 /fts-doc-2.txt tetrascience 043c963e-65ce-4550-b362-6386454e3238 tetrascience/6f166302-df8a-4044-ab4b-7ddd3eefb50b/RAW/fts-doc-2.txt ts-platform-dev-datalake 8AgxZhzrxRoHoDKT3wbTO.P2AKhSIW9t 112 11fdc29922fae43649911b421ea5b9b4 txt Thu Jul 08 2021 16:22:03 GMT+0000 (Coordinated Universal Time) RAW api 6f166302-df8a-4044-ab4b-7ddd3eefb50b unknown API Upload 6f166302-df8a-4044-ab4b-7ddd3eefb50b 6f166302-df8a-4044-ab4b-7ddd3eefb50b fts test_label blabla

The text in the file is analyzed using searchable terms that divide text using word boundaries. It removes lowercased terms, punctuation and stop words. Stop words are words that are filtered out during or after text. Stop words are typically the most used words in a language. When the match query is performed, the entered terms are analyzed and are searched for in Full-Text search.

Term Search Results

If you were to search for both documents above using the terms lorem, typesetting, distribution, typesetting distribution, tetrascience, tag, test_label blabla, and tetra, the results would as shown in the following table.

Search TermResult
loremReturns both documents, since both have this term.
typesettingReturns the first document only, because the second one does not contain the term.
distributionReturns the second document only, because the first one does not contain the term.
typesetting distributionReturns both documents, since match query is also analyzed and matching performed for each term.
tetrascienceReturns both, since the document path indexed in full-text search is analyzed and broken into terms and one of them is TetraScience.
tagReturns both documents. Tag values are my-tag (in the first doc) and your-tag (in the second doc). Both are analyzed into terms and the term tag are present in both documents.
test_label blablaReturns both documents, since this query is analyzed and both terms (test_label and blabla) are used for matching. To get the second document only, since the value of test_label for the second document is blabla, a phrase matching should be used (explained below).
tetraReturns no documents, as we do not have term tetra after text analysis (we have tetrascience).

Phrase Searches

Since free text input allows for a lot of terms to be entered, it might return a bigger set of results. We implemented support for phrase matching, which can be done by putting the search terms under double quotes.

Lets take the two example docs from above. They both have a label named test_label, the value of the label in the first doc is some label value and in the second doc is blabla.

Searching for either test_label blabla or test_label some label value returns both documents, as the query itself is analyzed into terms and matching performed on all terms.

Phrase matching is achieved by putting the text under double quotes, so searching for "test_label blabla" returns the second document only and "test_label some label value" the first one only, because if we recognize double quotes, we try to perform exact phrase matching.

Few more examples:
point of - returns both documents (both have the word of)
"point of" - returns the second document only, because it has this phrase
Jul 08 2021 16:22:03 - returns both documents
"Jul 08 2021 16:22:03" - returns the second document only

Only one phrase at a time is supported in Full-Text Search. If two phrases are entered, for example: "point of" "Jul 08" this is interpreted as one phrase point of" "Jul 08.


Did this page help you?