The way to Use the Hugging Face Tokenizers Library to Preprocess Textual content Knowledge

[ad_1]

The way to Use the Hugging Face Tokenizers Library to Preprocess Textual content Knowledge
Picture by Creator

 

If in case you have studied NLP, you might need heard in regards to the time period “tokenization.” It is a crucial step in textual content preprocessing, the place we rework our textual information into one thing that machines can perceive. It does so by breaking down the sentence into smaller chunks, often called tokens. These tokens could be phrases, subwords, and even characters, relying on the tokenization algorithm getting used. On this article, we’ll see easy methods to use the Hugging Face Tokenizers Library to preprocess our textual information.

 

Setting Up Hugging Face Tokenizers Library

 

To begin utilizing the Hugging Face Tokenizers library, you will want to put in it first. You are able to do this utilizing pip:

 

The Hugging Face library helps varied tokenization algorithms, however the three important varieties are:

  • Byte-Pair Encoding (BPE): Merges essentially the most frequent pairs of characters or subwords iteratively, making a compact vocabulary. It’s utilized by fashions like GPT-2.
  • WordPiece: Just like BPE however focuses on probabilistic merges (would not select the pair that’s the most frequent however the one that can maximize the chance of the corpus as soon as merged), generally utilized by fashions like BERT.
  • SentencePiece: A extra versatile tokenizer that may deal with totally different languages and scripts, typically used with fashions like ALBERT, XLNet, or the Marian framework. It treats areas as characters moderately than phrase separators.

The Hugging Face Transformers library supplies an AutoTokenizer class that may mechanically choose the perfect tokenizer for a given pre-trained mannequin. This can be a handy approach to make use of the proper tokenizer for a particular mannequin and could be imported from the transformers library. Nonetheless, for the sake of our dialogue relating to the Tokenizers library, we is not going to comply with this method.

We’ll use the pre-trained BERT-base-uncased tokenizer. This tokenizer was educated on the identical information and utilizing the identical strategies because the BERT-base-uncased mannequin, which implies it may be used to preprocess textual content information appropriate with BERT fashions:

# Import the mandatory elements
from tokenizers import Tokenizer
from transformers import BertTokenizer

# Load the pre-trained BERT-base-uncased tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

 

Single Sentence Tokenization

 

Now, let’s encode a easy sentence utilizing this tokenizer:

# Tokenize a single sentence
encoded_input = tokenizer.encode_plus("That is pattern textual content to check tokenization.")
print(encoded_input)

 

Output:

{'input_ids': [101, 2023, 2003, 7099, 3793, 2000, 3231, 19204, 3989, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

 

To make sure correctness, let’s decode the tokenized enter:

tokenizer.decode(encoded_input["input_ids"])

 

Output:

[CLS] that is pattern textual content to check tokenization. [SEP]

 

On this output, you’ll be able to see two particular tokens. [CLS] marks the beginning of the enter sequence, and [SEP] marks the tip, indicating a single sequence of textual content.

 

Batch Tokenization

 

Now, let’s tokenize a corpus of textual content as a substitute of a single sentence utilizing batch_encode_plus:

corpus = [
    "Hello, how are you?",
    "I am learning how to use the Hugging Face Tokenizers library.",
    "Tokenization is a crucial step in NLP."
]
encoded_corpus = tokenizer.batch_encode_plus(corpus)
print(encoded_corpus)

 

Output:

{'input_ids': [[101, 7592, 1010, 2129, 2024, 2017, 1029, 102], [101, 1045, 2572, 4083, 2129, 2000, 2224, 1996, 17662, 2227, 19204, 17629, 2015, 3075, 1012, 102], [101, 19204, 3989, 2003, 1037, 10232, 3357, 1999, 17953, 2361, 1012, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

 

For higher understanding, let’s decode the batch-encoded corpus as we did incase of single sentence. It will present the unique sentences, tokenized appropriately.

tokenizer.batch_decode(encoded_corpus["input_ids"])

 

Output:

['[CLS] whats up, how are you? [SEP]',
 '[CLS] i'm studying easy methods to use the cuddling face tokenizers library. [SEP]',
 '[CLS] tokenization is a vital step in nlp. [SEP]']

 

Padding and Truncation

 

When getting ready information for machine studying fashions, guaranteeing all enter sequences have the identical size is commonly essential. Two strategies to perform this are:

 

1. Padding

Padding works by including the particular token [PAD] on the finish of the shorter sequences to match the size of the longest sequence within the batch or max size supported by the mannequin if max_length is outlined. You are able to do this by:

encoded_corpus_padded = tokenizer.batch_encode_plus(corpus, padding=True)
print(encoded_corpus_padded)

 

Output:

{'input_ids': [[101, 7592, 1010, 2129, 2024, 2017, 1029, 102, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1045, 2572, 4083, 2129, 2000, 2224, 1996, 17662, 2227, 19204, 17629, 2015, 3075, 1012, 102], [101, 19204, 3989, 2003, 1037, 10232, 3357, 1999, 17953, 2361, 1012, 102, 0, 0, 0, 0]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]]}

 

Now, you’ll be able to see that additional 0s are positioned, however for higher understanding, let’s decode to see the place the tokenizer has positioned the [PAD] tokens:

tokenizer.batch_decode(encoded_corpus_padded["input_ids"], skip_special_tokens=False)

 

Output:

['[CLS] whats up, how are you? [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]',
 '[CLS] i'm studying easy methods to use the cuddling face tokenizers library. [SEP]',
 '[CLS] tokenization is a vital step in nlp. [SEP] [PAD] [PAD] [PAD] [PAD]']

 

2. Truncation

Many NLP fashions have a most enter size sequence, and truncation works by chopping off the tip of the longer sequence to fulfill this most size. It reduces reminiscence utilization and prevents the mannequin from being overwhelmed by very massive enter sequences.

encoded_corpus_truncated = tokenizer.batch_encode_plus(corpus, truncation=True, max_length=5)
print(encoded_corpus_truncated)

 

Output:

{'input_ids': [[101, 7592, 1010, 2129, 102], [101, 1045, 2572, 4083, 102], [101, 19204, 3989, 2003, 102]], 'token_type_ids': [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1], [1, 1, 1, 1, 1], [1, 1, 1, 1, 1]]}

 

Now, you can too use the batch_decode methodology, however for higher understanding, let’s print this data otherwise:

for i, sentence in enumerate(corpus):
    print(f"Authentic sentence: {sentence}")
    print(f"Token IDs: {encoded_corpus_truncated['input_ids'][i]}")
    print(f"Tokens: {tokenizer.convert_ids_to_tokens(encoded_corpus_truncated['input_ids'][i])}")
    print()

 

Output:

Authentic sentence: Howdy, how are you?
Token IDs: [101, 7592, 1010, 2129, 102]
Tokens: ['[CLS]', 'whats up', ',', 'how', '[SEP]']

Authentic sentence: I'm studying easy methods to use the Hugging Face Tokenizers library.
Token IDs: [101, 1045, 2572, 4083, 102]
Tokens: ['[CLS]', 'i', 'am', 'studying', '[SEP]']

Authentic sentence: Tokenization is a vital step in NLP.
Token IDs: [101, 19204, 3989, 2003, 102]
Tokens: ['[CLS]', 'token', '##ization', 'is', '[SEP]']

 

This text is a part of our wonderful sequence on Hugging Face. If you wish to discover extra about this subject, listed here are some references that will help you out:

 
 

Kanwal Mehreen Kanwal is a machine studying engineer and a technical author with a profound ardour for information science and the intersection of AI with drugs. She co-authored the e book “Maximizing Productiveness with ChatGPT”. As a Google Era Scholar 2022 for APAC, she champions range and educational excellence. She’s additionally acknowledged as a Teradata Range in Tech Scholar, Mitacs Globalink Analysis Scholar, and Harvard WeCode Scholar. Kanwal is an ardent advocate for change, having based FEMCodes to empower girls in STEM fields.

[ad_2]

Leave a Reply

Your email address will not be published. Required fields are marked *