ES支持的第三方语言模型

supported architectures:

BERT
BART
DPR bi-encoders
DistilBERT
ELECTRA
MobileBERT
RoBERTa
RetriBERT
MPNet
SentenceTransformers bi-encoders with the above transformer architectures
XLM-RoBERTa

Third party fill-mask models

Third party named entity recognition models

Third party question answering models

Third party text embedding models

Text Embedding models are designed to work with specific scoring functions for calculating the similarity between the embeddings they produce. Examples of typical scoring functions are: cosine, dot product and euclidean distance (also known as l2_norm).

The embeddings produced by these models should be indexed in Elasticsearch using the dense vector field type with an appropriate similarity function chosen for the model.

To find similar embeddings in Elasticsearch use the efficient Approximate k-nearest neighbor (kNN) search API with a text embedding as the query vector. Approximate kNN search uses the similarity function defined in the dense vector field mapping is used to calculate the relevance. For the best results the function must be one of the suitable similarity functions for the model.

Using SentenceTransformerWrapper:

All DistilRoBERTa v1 Suitable similarity functions: dot_product, cosine, l2_norm
All MiniLM L12 v2 Suitable similarity functions: dot_product, cosine, l2_norm
All MPNet base v2 Suitable similarity functions: dot_product, cosine, l2_norm
Facebook dpr-ctx_encoder multiset base Suitable similarity functions: dot_product
Facebook dpr-question_encoder single nq base Suitable similarity functions: dot_product
LaBSE Suitable similarity functions: cosine
msmarco DistilBERT base tas b Suitable similarity functions: dot_product
msmarco MiniLM L12 v5 Suitable similarity functions: dot_product, cosine, l2_norm
paraphrase mpnet base v2 Suitable similarity functions: cosine

Using DPREncoderWrapper:

Third party text classification models

Third party text similarity models

Third party zero-shot text classification models

Expected model output

edit

Models used for each NLP task type must output tensors of a specific format to be used in the Elasticsearch NLP pipelines.

Here are the expected outputs for each task type.

Fill mask expected model output

edit

Fill mask is a specific kind of token classification; it is the base training task of many transformer models.

For the Elastic stack’s fill mask NLP task to understand the model output, it must have a specific format. It needs to be a float tensor with shape(<number of sequences>, <number of tokens>, <vocab size>).

Here is an example with a single sequence "The capital of [MASK] is Paris" and with vocabulary ["The", "capital", "of", "is", "Paris", "France", "[MASK]"].

Should output:

 [
   [
     [ 0, 0, 0, 0, 0, 0, 0 ], // The
     [ 0, 0, 0, 0, 0, 0, 0 ], // capital
     [ 0, 0, 0, 0, 0, 0, 0 ], // of
     [ 0.01, 0.01, 0.3, 0.01, 0.2, 1.2, 0.1 ], // [MASK]
     [ 0, 0, 0, 0, 0, 0, 0 ], // is
     [ 0, 0, 0, 0, 0, 0, 0 ] // Paris
   ]
]

The predicted value here for [MASK] is "France" with a score of 1.2.

Named entity recognition expected model output

edit

Named entity recognition is a specific token classification task. Each token in the sequence is scored related to a specific set of classification labels. For the Elastic Stack, we use Inside-Outside-Beginning (IOB) tagging. Elastic supports any NER entities as long as they are IOB tagged. The default values are: "O", "B_MISC", "I_MISC", "B_PER", "I_PER", "B_ORG", "I_ORG", "B_LOC", "I_LOC".

The "O" entity label indicates that the current token is outside any entity. "I" indicates that the token is inside an entity. "B" indicates the beginning of an entity. "MISC" is a miscellaneous entity. "LOC" is a location. "PER" is a person. "ORG" is an organization.

The response format must be a float tensor with shape(<number of sequences>, <number of tokens>, <number of classification labels>).

Here is an example with a single sequence "Waldo is in Paris":

 [
   [
//    "O", "B_MISC", "I_MISC", "B_PER", "I_PER", "B_ORG", "I_ORG", "B_LOC", "I_LOC"
     [ 0,  0,         0,       0.4,     0.5,     0,       0.1,     0,       0 ], // Waldo
     [ 1,  0,         0,       0,       0,       0,       0,       0,       0 ], // is
     [ 1,  0,         0,       0,       0,       0,       0,       0,       0 ], // in
     [ 0,  0,         0,       0,       0,       0,       0,       0,       1.0 ] // Paris
   ]
]

Text embedding expected model output

edit

Text embedding allows for semantic embedding of text for dense information retrieval.

The output of the model must be the specific embedding directly without any additional pooling.

Eland does this wrapping for the aforementioned models. But if supplying your own, the model must output the embedding for each inferred sequence.

Text classification expected model output

edit

With text classification (for example, in tasks like sentiment analysis), the entire sequence is classified. The output of the model must be a float tensor with shape(<number of sequences>, <number of classification labels>).

Here is an example with two sequences for a binary classification model of "happy" and "sad":

 [
   [
//     happy, sad
     [ 0,     1], // first sequence
     [ 1,     0] // second sequence
   ]
]

Zero-shot text classification expected model output

edit

Zero-shot text classification allows text to be classified for arbitrary labels not necessarily part of the original training. Each sequence is combined with the label given some hypothesis template. The model then scores each of these combinations according to [entailment, neutral, contradiction]. The output of the model must be a float tensor with shape(<number of sequences>, <number of labels>, 3).

Here is an example with a single sequence classified against 4 labels:

 [
   [
//     entailment, neutral, contradiction
     [ 0.5,        0.1,     0.4], // first label
     [ 0,          0,       1], // second label
     [ 1,          0,       0], // third label
     [ 0.7,        0.2,     0.1] // fourth label
   ]
]

作者：admin 创建时间：2023-11-27 23:51
最后编辑：admin 更新时间：2024-07-01 18:08