Input Pipeline¶

The input pipeline assumes datasets exist somewhere they can be accessed by TensorFlow as one or more TFRecord data files, and that all data files in a dataset have the same feature names and structure. A manifest file for each dataset describes these features and structure and must be configured prior to loading the dataset.

Data files are read, parsed, preprocessed, and batched according to the configuration specified by the user. Features are requested from the dataset and mapped to model inputs or outputs for training and evaluation.

Batch loaders construct a TensorFlow Estimator model’s input_fn. They load requested features from records in the dataset, preprocess them, and construct minibatches to feed into the model.

Dataset Specification¶

Datasets are comprised of TFRecord data files together with a manifest file containing global information about the contents of those data files. All data files in a dataset must be consistent with what is described in the manifest.

The manifest file contains a single JSON object with the following keys:

Key	Value Type	Required	Description
`compression`	string or `null`	yes	The type of compression used for the records. May be `"zlib"` or `"gzip"`, or `null` for no compression.
`allow_var_len`	boolean	yes	If `false`, all records are serialized `Example` protobufs with fixed-length `Feature` protos. If `true`, all records are `SequenceExample` protobufs and may contain variable-length `FeatureList` protos.
`features`	list	yes	A JSON array of objects specifying all features available for loading from the dataset. Only features requested by the running pipeline configuration will be loaded for a model.

Dataset Features¶

All feature specifiers are JSON objects with the following keys:

Key	Value Type	Required	Description
`name`	string	yes	The name of the feature or feature list in the protobuf.
`dtype`	string	yes	The datatype to give the tensor once deserialized, cast if different from what is read from the data file.
`shape`	list	yes	The shape to give the tensor once deserialized. If it is a variable length sequence, this is the shape of each sequence element instead.
`var_len`	boolean	if `allow_var_len` is `true`	Must be `false` if `allow_var_len` is `false`. Otherwise, specifies that the feature is a fixed-length context feature if `false`, or a variable-length feature list if `true`.
`deserialize_type`	string	yes	Type of deserialization used to parse the serialized protobuf into a tensor, as defined below.
`deserialize_args`	dictionary	no	Arguments JSON object for the specific type of deserialization. Defaults to empty object, but types may require certain arguments to be present.

The types of deserialization available are as follows:

Type:	`int`
Description:	The feature is stored in the data files as a TensorFlow int64 list.
Arguments:	None

Type:	`float`
Description:	The feature is stored in the data files as a TensorFlow float list.
Arguments:	None

Type:	`string`
Description:	The feature is stored in the data files as a TensorFlow bytes list.
Arguments:	None

Type:	`raw`
Description:	The feature is stored in the data files as one or more tensors encoded as a raw bytes string in a TensorFlow bytes list.
Arguments:

Key	Value Type	Required	Description
`endian`	string	yes	Specifies the endianness of the raw bytes. Must be `"little"` or `"big"`.
`len`	integer	no	The number of raw byte strings in each record. If greater than `1`, tensors are concatenated along a new first axis (before batching). Ignored if `var_len` is `true`. Defaults to `1`.

Types of Batch Loaders¶

Different batch loader types are defined for different use cases. Each type reads files containing TFRecord protobufs from either a specified local directory or from a predefined list of files, then optionally shuffles the contents and outputs minibatches that the computation graph consumes.

The types of batch loaders are:

Use Case Example:
Type:	`independent` (args)
Description:	Each data file represents a collection of examples that are independent from all other data files, and each `TFRecord` inside them represents a single example independent from all other examples.
	Each data file is a set of random images, each `TFRecord` is a single image, and the task is to label each image.

Use Case Example:
Type:	`continuous_sequence` (args)
Description:	Each data file represents a contiguous sequence of data, but the data inside each file is independent from the data inside other files. Each `TFRecord` represents a non-overlapping chunk of the full sequence, such that the concatenation of all records produces a single long sequence with no logical subsequence boundaries. Batch examples are selected by choosing random-length sliding windows from the concatenated full sequence. All selected features must have the same length along the first dimension.
	Each data file is a song, each `TFRecord` is a 10 second segment of audio samples, and the task is to predict the next sample from the previous 3 seconds.

Use Case Example:
Type:	`discrete_sequence` (args)
Description:	Each data file again represents a contiguous sequence of data and again the data inside each file is independent from the data inside other files, but here each full sequence is separable into logical subsequences. Each `TFRecord` represents a non-overlapping subsequence, such that the concatenation of all records produces a single long sequence with logical subsequence boundaries at the beginning and end of each record. Batch examples are selected by choosing windows containing a random number of subsequences from the concatenated full sequence, aligned at record (subsequence) boundaries.
	Each data file is a text document, each `TFRecord` is a single sentence, and the task is to predict the next word using an LSTM model that resets the hidden and cell states every 3-5 sentences.

Batch Loader Arguments¶

All batch loaders use the following configuration arguments:

Key	Value Type	Required	Description
`dataset`	dictionary	yes	Specfies the dataset to be loaded. See dataset below.
`target_batch_size`	integer	yes	Target size of each batch.
`drop_remainder`	boolean	yes	If `true`, every batch is exactly the same size and unaligned remaining examples are dropped. If `false`, the final batch may be smaller than others. Only considered if `epochs` is not `null`.
`epochs`	integer or `null`	yes	Number of passes through the dataset. Setting to `null` means infinite repetitions.
`num_parallel_reads`	integer	no	Number of parallel threads for reading data. Defaults to `1`.
`num_parallel_parses`	integer	no	Number of parallel threads for parsing data after reading. Defaults to `1`.
`num_read_buffer_bytes`	integer	yes	Number of bytes in the read buffer. Setting to `0` means no read buffering.
`num_prefetch`	integer	yes	Maximum number of batches to prefetch at the end of the input pipeline.
`shuffle`	boolean	no	Whether to shuffle data records. Defaults to `false`.
`num_shuffle_buffer_elements`	integer	if `shuffle` is `true`	Number of data records to buffer for shuffling. Ignored if shuffling disabled.
`num_filenames_shuffle_buffer`	integer	if `shuffle` is `true`	Number of filenames to buffer for shuffling before loading records. Ignored if shuffling disabled.
`num_mix_files`	integer	if `shuffle` is `true`	Number of files loaded together to shuffle records among. Ignored if shuffling disabled.
`sloppy_interleave`	boolean	no	Whether to enable nondeterministic loading for potentially faster batches. Defaults to `false`.
`num_interleave_out_buffer_elements`	boolean	no	Number of output elements to buffer when interleaving file loading. Defaults to `1`.
`num_interleave_in_buffer_elements`	boolean	no	Number of input elements to prefetch when interleaving file loading. Defaults to `1`.
`primary_features`	list	yes	List of named features read from dataset. See primary_features below.
`secondary_features`	list	no	List of named features constructed after primaries. See secondary_features. below. Defaults to empty list.
`processing_steps`	list	no	List of processing steps to perform on primary and secondary features. See processing_steps below. Defaults to empty list.
`padding`	list or boolean	no	Tensor padding specifications. See padding below. Defaults to `false`.

Arguments Details¶

`dataset`¶

The value of the dataset argument must be a single JSON object with the following keys:

Key	Value Type	Required	Description
`type`	string	yes	The type of dataset specifier, as defined below.
`args`	dictionary	yes	Arguments object for the specified `type`.

The type may be either of the following:

Type:	`dir`
Description:	The dataset is loaded recursively from a local directory.
Arguments:

Key	Value Type	Required	Description
`data_dir`	string	yes	Path to a locally-available data directory. The directory must contain a dataset manifest file named `__manifest__.json` in its top level, and at least one file with a `.tfrecords` extension storing the `TFRecord` protobufs. The directory is read recursively and all `.tfrecords` files are assumed to be data files.

Type:	`list`
Description:	The dataset is loaded from an explicit list of files specified by the user.
Arguments:

Key	Value Type	Required	Description
`manifest_file`	string	yes	Path to a locally-available dataset manifest used for all data files.
`list_file`	string	yes	Path to a locally-available text file listing an absolute path to each data file, one per line. Paths in the text file may be local, cloud locations, URLs, or any file path that can be loaded by TensorFlow.

`primary_features`¶

The value of the primary_features argument must be a list of JSON objects containing the following keys:

Key	Value Type	Required	Description
`from_name`	string	yes	Name of a feature tensor in the dataset as specified in its manifest.
`to_name`	string	yes	Name of the feature tensor once loaded into the input pipeline. Must be one of the model’s input or output tensors defined in the model configuration.

Note

Every model input must be satisfied by a primary or secondary feature, or an error will be raised.

Note

Every to_name must be unique or an error will be raised.

Note

Any to_name that is constructed and not used by the model will raise an error.

`secondary_features`¶

Secondary features are constructed after all primary features are loaded, and may or may not reference primary features by their to_name. The value of the secondary_features argument must be a list of JSON objects containing the following keys:

Key	Value Type	Required	Description
`to_name`	string	yes	Name of the feature tensor in the input pipeline. Must be one of the model’s input or output tensors defined in the model configuration.
`type`	string	yes	The type of secondary feature, as defined below.
`args`	dictionary	no	Arguments object for the specified `type`. Defaults to an empty object, but types may require certain arguments to be present.

Secondary features may be any of the following types:

Type:	`const`
Description:	Constructs a tensor with all entries a constant value.
Arguments:

Key	Value Type	Required	Description
`shape`	list or string	yes	Either a shape specifier as a JSON array of integers, or the name of a primary feature tensor whose shape to copy at the time of construction. Must not include the batch dimension.
`dtype`	string	yes	Either a TensorFlow datatype specifier or the name of a primary feature tensor whose datatype to copy at the time of construction.
`value`	`dtype`	no	The constant value to populate the tensor with. Defaults to `0` or empty string.

Note

Every model input must be satisfied by a primary or secondary feature, or an error will be raised.

Note

Every to_name must be unique or an error will be raised.

Note

Any to_name that is constructed and not used by the model will raise an error.

`processing_steps`¶

Processing steps are performed after constructing both primary and secondary feature tensors. They are performed in exactly the order specified. The value of the processing_steps argument must be a list of JSON objects containing the following keys:

Key	Value Type	Required	Description
`tensor`	string	yes	Name of the primary or secondary feature tensor to operate on.
`type`	string	yes	The type of processesing step, as defined below.
`args`	dictionary	no	Arguments object for the specified `type`. Defaults to an empty object, but types may require certain arguments to be present.

Processing steps may be any of the following types:

Type:	`slice`
Description:	Extract a multidimensional slice from the tensor.
Arguments:

Key	Value Type	Required	Description
`slice`	string	yes	The slice specifier as a Pythonic multidimensional array slice, with no batch dimension. Only integer indices, negative indices, and colons are allowed, e.g. `[3,:,4:,:-2,1:4]`.

Note

All tensors are input and output from each step using the names given at construction. Each processing step overwrites the tensor it operates on.

`padding`¶

Either all tensors in the pipeline are padded, or none are. For this reason, the value of the padding argument may be true to zero pad all tensors to the batch maximum along each dimension, or false to specify no padding. If more specific padding requirements are needed, the padding argument may be a list of JSON objects with the following keys:

Key	Value Type	Required	Description
`tensor`	string	yes	Name of the feature tensor to pad.
`shape`	list	no	Shape specifier as a JSON array of integers. Set any dimension to `-1` to pad to the maximum batch size. Must not include the batch dimension. Defaults to maximum batch size for all dimensions.
`value`	`dtype` of padded tensor	no	The constant value to pad the tensor with. Defaults to `0` or empty string.

Note

By either setting padding to true or by specifying at least one padding object, all tensors are automatically zero padded to the batch maximum along all dimensions unless a different desired padding specification is explicitly given. Either all tensors are padded, or none are.

Type-Specific Arguments¶

Each loader type has special arguments that define the behavior.

Additional Arguments:
Type:	`independent`

Key	Value Type	Required	Description
`multi_load`	boolean	no	If `true`, records are loaded and parsed in batches. If `false`, each record is loaded and parsed individually and then batched later. Enabling may boost performance for some input pipelines, but is only available in very restricted cases. Must be `false` if the records are variable length or require any secondary features, processing steps, or padding. May not be available for all dataset types. Defaults to `false`.

Additional Arguments:
Type:	`continuous_sequence`

Key	Value Type	Required	Description
`min_window`	integer	yes	Minimum random size of the sliding window to select.
`max_window`	integer	yes	Maximum random size of the sliding window to select.
`stride`	integer or	no	Sliding window stride, or `null` for striding with the exact window length and no overlap. Defaults to `null`.

Note

All selected features must have the same length along the first dimension.

Additional Arguments:
Type:	`discrete_sequence`

Key	Value Type	Required	Description
`min_window`	integer	yes	Minimum random number of subsequences to include in a selected window.
`max_window`	integer	yes	Maximum random number of subsequences to include in a selected window.

Note

Unlike with the continuous_sequence loader, selected features may have different lengths along their first dimension. They will still be aligned at subsequence boundaries.

nnetmaker

Navigation

Related Topics

Input Pipeline¶

Dataset Specification¶

Dataset Features¶

Types of Batch Loaders¶

Batch Loader Arguments¶

Arguments Details¶

`dataset`¶

`primary_features`¶

`secondary_features`¶

`processing_steps`¶

`padding`¶

Type-Specific Arguments¶

Input Pipeline¶

Dataset Specification¶

Dataset Features¶

Types of Batch Loaders¶

Batch Loader Arguments¶

Arguments Details¶

dataset¶

primary_features¶

secondary_features¶

processing_steps¶

padding¶

Type-Specific Arguments¶

`dataset`¶

`primary_features`¶

`secondary_features`¶

`processing_steps`¶

`padding`¶