text_utils module

class text_utils.CharByCharSynhthetizer(rnn, char_init, encode_lambda, onehot_encoder, decode_lambda, ts, n_step, path_out)

Bases: object

Synthetize text (char-by-char) from a trained RNN using a one-hot encoder.

sample(lenght, p)

Weighted sampling of next character based on RNN predicitons.

class text_utils.OneHotEncoder(length)

Bases: object

One-hot encoder class.

length

The length of the one-hot encoding.

Type

int

__init__(layers)

Constuctor.

__call__(x, encode=True)

Encode a sequence of integers into a one-hot encoded vectors, or decode a sequence of one-hot encoded vectors into a sequence of integers.

__repr__()

Returns the string representation of class.

text_utils.add_eol_to_text(text, eol='.')
text_utils.char_to_idx(char, chars)

Convert a char to an index from the encoder np array.

Parameters
  • char (str) – A char.

  • chars (np.ndarray) – All chars.

Returns

The index repre of char, of shape (,).

Return type

np.ndarray

Notes

None

text_utils.decode(encoding, chars)

Decode a sequence of indices into a sequence of chars based on the encoder.

Parameters
  • encoding (np.ndarray) – The sequence of index representation of the chars, of shape (n_chars,)

  • chars (np.ndarray) – All chars.

Returns

decoding – The sequence of chars, of shape (n_chars,)

Return type

np.ndarray

Notes

None

text_utils.encode(decoding, chars)

Encode a sequence of chars into a sequence of indices based on the encoder.

Parameters
  • decoding (np.ndarray) – The sequence of chars, of shape (n_chars,)

  • chars (np.ndarray) – All chars.

Returns

encoding – The sequence of index representation of the chars, of shape (n_chars,)

Return type

np.ndarray

Notes

None

text_utils.give_emoji_free_text(text)

https://stackoverflow.com/a/50602709

text_utils.idx_to_char(idx, chars)

Convert an index to char in the encoder np array.

Parameters
  • idx (int) – The index repr of a char.

  • chars (np.ndarray) – All chars.

Returns

The char.

Return type

str

Notes

None

text_utils.limit_text_length(df, col_name, max_length=140)
text_utils.make_decoded_dataset(dataset)

Decode a dataset of strings into a list of characters.

Parameters

dataset (list) – A list of strings (contexts) maybe of varying size.

Returns

decoded_dataset – A list of lists (contexts) where a context is a list of characters.

Return type

list

Notes

None

text_utils.make_encoded_dataset(decoded_dataset, chars)

Encode a dataset of list of charcters into a list of integers.

Parameters
  • decoded_dataset (list) – A list of lists (contexts) where a context is a list of characters.

  • chars (np.ndarray) – All chars.

Returns

encoded_dataset – A list of lists (contexts) where a context is a list of integers. An integer corresponds to its index in chars.

Return type

list

Notes

None

text_utils.make_one_hot_encoded_dataset(encoded_dataset, onehot_encoder)

One-hot encode a dataset of list of integers into a list of one-hot encoded vectors.

Parameters
  • encoded_dataset (list) – A list of lists (contexts) where a context is a list of integers. An integer corresponds to its index in chars.

  • onehot_encoder (OneHotEncoder) – A one-hot encoder initilaized with chars (all unique characters in the dataset).

Returns

onehot_encoded_dataset – A list of one-hot encoded vectors (contexts). The index of 1s in the vectors corresponds to the index of the character in chars.

Return type

list

Notes

None

text_utils.synthetize(rnn, eol, chars, onehot_encoder, ts, path_out)
text_utils.unique_characters(data)

Get the list of unique characters in a data.

Parameters

data (list) – A list of strings. The strings may be of different lenghts.

Returns

The list of unique characters in all of the strings in data.

Return type

np.ndarray

Notes

None