• paper link: 12-textual-sim-paper.pdf

  • This study aims to develop a general text embedding model through a multi-stage training approach.

  • In the initial stage of unsupervised contrastive learning, we generate weak supervised correlation text pairs using publicly available data from various sources.

  • In the supervised fine-tuning stage, the mixture of training data in our approach is more varied to further enhance the model’s versatility.

  • The training process of our model consists of two stages: unsupervised pre-training and supervised fine-tuning. Both stages employ the learning objective of contrastive learning.

Model Architecture

  • The backbone of our embedding model is a deep Transformer encoder which can be initialized with pre-trained language models such as BERT (Devlin et al., 2019).

Unsupervised Pre-training Data

  • Weakly supervised text relevance data is readily available in publicly accessible web sources, such as the inherent connection between queries and answers on QA forums.
  • Model is initially pre-trained on naturally occurring text pairs extracted from diverse sources.

Supervised Fine-tuning Data

  • In the supervised fine-tuning stage, we use relatively lower-sized datasets with human annotation of the relevance between two pieces of text and optional hard negatives mined by an extra retriever to form text triples.

Training Details

  • imbalanced data they employ a multinomial distribution to sample data batches from different data sources, taking into account their respective sizes.
  • They use improved contrastive loss
    • ICL is bidirectional and enlarges the negative samples with both in-batched queries and documents.

Results