Save tensorflow model with StringLookup layer with encoded vocabulary

I'm having some issues saving a trained TensorFlow model, where I have a StringLookup layer and I'm required to use TFRecods as input for training. A minimal example to reproduce the issue:

First I define the training data

vocabulary = [str(i) for i in range(100, 200)]
X_train = np.random.choice(vocabulary, size=(100,))
y_train = np.random.choice([0,1], size=(100,))

I save it in a file as tfrecords

def _int64_feature(value): return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))
def _string_feature(value): return tf.train.Feature(bytes_list=tf.train.BytesList(value=[str(value).encode('utf-8')]))
with tf.io.TFRecordWriter('train.tfrecords') as writer: for i in range(len(X_train)): example = tf.train.Example(features=tf.train.Features(feature={ 'user_id': _string_feature(X_train[i]), 'label': _int64_feature(y_train[i]) })) writer.write(example.SerializeToString())

Then I use the tf.data API to be able to stream the data into training (the original data doesn't fit into memory)

data = tf.data.TFRecordDataset(['train.tfrecords'])
features = { 'user_id': tf.io.FixedLenFeature([], tf.string), 'label': tf.io.FixedLenFeature([], tf.int64)
}
def parse(record): parsed = tf.io.parse_single_example(record, features) return (parsed['user_id'], parsed['label'])
data = data.map(parse)

The data looks like this:

print(list(data.take(5).as_numpy_iterator()))
[(b'166', 1), (b'144', 0), (b'148', 1), (b'180', 0), (b'192', 0)]

The strings of the original dataset were converted to bytes in the process. I have to pass this new vocabulary to the StringLookup contructor, as passing strings and training with bytes will throw an error

new_vocab = [w.encode('utf-8') for w in vocabulary]
inp = tf.keras.Input(shape=(1,), dtype=tf.string)
x = tf.keras.layers.StringLookup(vocabulary=new_vocab)(inp)
x = tf.keras.layers.Embedding(len(new_vocab)+1, 32)(x)
out = tf.keras.layers.Dense(1, activation='sigmoid')(x)
model = tf.keras.Model(inputs=[inp], outputs=[out])
model.compile(optimizer='adam', loss='BinaryCrossentropy')
model.fit(data.batch(10), epochs=5)

But when I try to save the model, I get an error because the vocabulary input to the StringLookup layer is encoded as bytes and can't be dumped into json

model.save('model/')
TypeError: ('Not JSON Serializable:', b'100')

I really don't know what to do, I read that TensorFlow recommends using encoded strings instead of normal strings but that doesn't allow to save the model. I also tried to preprocess the data decoding the strings before thay are fed to the model, but I wasn't able to do it without loading all the data into memory (using just tf.data operations)

0

1 Answer

Using your data and original vocabulary:

import tensorflow as tf
import numpy as np
vocabulary = [str(i) for i in range(100, 200)]
X_train = np.random.choice(vocabulary, size=(100,))
y_train = np.random.choice([0,1], size=(100,))
...
...
data = data.map(parse)

I ran your code (with an additional Flatten layer) and was able to save your model:

inp = tf.keras.Input(shape=(1,), dtype=tf.string)
x = tf.keras.layers.StringLookup(vocabulary=vocabulary)(inp)
x = tf.keras.layers.Embedding(len(vocabulary)+1, 32)(x)
x = tf.keras.layers.Flatten()(x)
out = tf.keras.layers.Dense(1, activation='sigmoid')(x)
model = tf.keras.Model(inputs=[inp], outputs=[out])
model.compile(optimizer='adam', loss='BinaryCrossentropy')
model.fit(data.batch(10), epochs=5)
model.save('model/')
Epoch 1/5
10/10 [==============================] - 1s 8ms/step - loss: 0.6949
Epoch 2/5
10/10 [==============================] - 0s 4ms/step - loss: 0.6864
Epoch 3/5
10/10 [==============================] - 0s 5ms/step - loss: 0.6787
Epoch 4/5
10/10 [==============================] - 0s 5ms/step - loss: 0.6707
Epoch 5/5
10/10 [==============================] - 0s 5ms/step - loss: 0.6620
INFO:tensorflow:Assets written to: model/assets

I do not see why you need new_vocab = [w.encode('utf-8') for w in vocabulary].

If you really need to use new_vocab, you can try setting it during training and afterwards setting vocabulary for saving your model, since the only difference is the encoding:

new_vocab = [w.encode('utf-8') for w in vocabulary]
lookup_layer = tf.keras.layers.StringLookup()
lookup_layer.adapt(new_vocab)
inp = tf.keras.Input(shape=(1,), dtype=tf.string)
x = lookup_layer(inp)
x = tf.keras.layers.Embedding(len(new_vocab)+1, 32)(x)
x = tf.keras.layers.Flatten()(x)
out = tf.keras.layers.Dense(1, activation='sigmoid')(x)
model = tf.keras.Model(inputs=[inp], outputs=[out])
model.compile(optimizer='adam', loss='BinaryCrossentropy')
model.fit(data.batch(10), epochs=5)
model.layers[1].adapt(vocabulary)
model.save('/model')

Admittingly, this is quite hacky.

2

Your Answer

Sign up or log in

Sign up using Google Sign up using Facebook Sign up using Email and Password

Post as a guest

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge that you have read and understand our privacy policy and code of conduct.

You Might Also Like