-
Notifications
You must be signed in to change notification settings - Fork 171
Description
Hi everyone! Thanks a lot for this nice tutorial and code to learning transformers!.
I am trying to recreate the sample of the tutorial:
https://peterbloem.nl/blog/transformers
And I was able to train and serialize a model for the IMDB Dataset.
Currently, I want to test the model with new validation phrases. Nevertheless, I cannot find a way to tokenize the phrase into the required data shape, as in the provided sample:
#Load dataset
tdata, _ = datasets.IMDB.splits(TEXT, LABEL)
train, test = tdata.split(split_ratio=0.8)
#Preprocess data
TEXT.build_vocab(train, max_size=50_000 - 2)
LABEL.build_vocab(train)
#Create iterators
train_iter, test_iter = data.BucketIterator.splits((train, test), batch_size=4, device=util.d())
I see that the tokens are generated in some part of the BucketIterator (or the dataset itself):
for batch in tqdm.tqdm(test_iter):
input = batch.text[0]
label = batch.label - 1
As in the dataset , I can see the phrases separated into words:
print(test_iter.data()[0].text)
print(test_iter.data()[0].label)
generates:
['i', "wouldn't", 'rent', 'this', 'one', 'even', 'on', 'dollar', 'rental', 'night.']
neg
So, if I want to test a pharse in the model. Like:
#Try the model
input = ["this", "movie", "is", "incredible", "boring"]
How can I tokenize the word in a correct way to feed it into the model?.
Thanks in advance for your response.
Greetings!