Data Training In Neural Network Using SPSS

To specify how the network should be trained, use the Training tab. Which training choices are accessible depends on the optimization algorithm and the training type.

Type of Training. How the network interprets the records depends on the training type. Choose from the following categories of training:

Batch. updates the synaptic weights only after passing each record in the training dataset; batch training thus makes use of all the data. Because batch training directly reduces total error, it is frequently favored; nevertheless, batch training may need numerous data passes as it may need to update the weights until one of the stopping rules is satisfied. It works well with “smaller” datasets.

Online. synapse weights are updated after each training data record, hence online training only takes data from one record at a time.

Until one of the stopping rules is reached, online training continuously changes the weights and acquires records. The procedure continues by recycling the data records if all of the records are used just once and none of the stopping conditions are satisfied.

For “bigger” datasets with related predictors, online training is superior to batch training; that is, if there are many records and many inputs, and their values are not independent of one another, online training can produce a reasonable response more quickly than batch training.

Mini-batch. divides the training data records into groups of about similar size, then, after passing one group, updates the synaptic weights; in other words, mini-batch training utilizes data from a group of records.

If necessary, the operation then recycles the data group. For “medium-size” datasets, mini-batch training may be the most advantageous option because it strikes a balance between batch and online training.

Alternatively, you can enter an integer larger than 1 and less than or equal to the maximum number of cases to store in memory, and the process will decide how many training records per mini-batch. On the Options page, you can specify the maximum number of cases to save in memory.

Optimization Algorithm. The synaptic weights are estimated using this technique.

Scaled conjugate gradient. Conjugate gradient methods can only be used for batch training types because the justifications for their use do not apply to online or mini-batch training.

Gradient descent. This technique can be applied to batch training as well as online or mini-batch training.

Training Options. The optimization algorithm can be adjusted using the training options. Unless the network encounters estimation issues, you usually won’t need to adjust these parameters.

The scaled conjugate gradient algorithm has the following training options:

Initial Lambda. the scaled conjugate gradient algorithm’s lambda parameter’s initial value. Enter a value that is higher than 0 and smaller than 0.000001.

Initial Sigma. the scaled conjugate gradient algorithm’s sigma parameter’s initial value. Enter a value that is larger than 0 and lower than 0.0001.

Interval Center and Interval Offset. The interval center (a ₀) and interval offset (a) define the interval [a₀ − a, a₀ +a], This uses simulated annealing and randomly generates weight vectors.

When the optimization process is applied, simulated annealing is employed to break free of a local minimum in order to locate the global minimum. Weight initialization and automated architecture selection both employ this method. Give the interval offset a number larger than 0 and the interval center a number.

The gradient descent algorithm has the following training options:

Initial Learning Rate. the gradient descent algorithm’s starting learning rate value. The network will train more quickly with a greater learning rate, but it may also become unstable. Enter a value that is not 0.

Lower Boundary of Learning Rate. The gradient descent algorithm’s lower bound on learning rate. Only online and mini-batch training is affected by this setting. Enter a value that is higher than 0 but lower than the initial learning rate.

Momentum. The gradient descent algorithm’s initial momentum parameter. A too-high learning rate might lead to instability, which the momentum term helps to avoid. Enter a value that is not 0.

Learning rate reduction, in Epochs. The number of epochs (p), or When gradient descent is employed with online or mini-batch training, data passes of the training sample are necessary to drop the initial learning rate to the lower boundary of the learning rate. This gives you control of the learning rate decay factor β = (1/p K)*ln(η₀/η_low), where η₀ is the initial learning rate, η low is the lower bound on the learning rate, and K is the total number of mini-batches (or the number of training records for online training) in the training dataset. Specify an integer greater than 0.

Output

Network Structure. provides a summary of the neural network’s data.

Description. the number of input and output units, hidden layer and unit counts, dependent variables, hidden layer and unit counts, and activation functions are all displayed.

Diagram. displays the network diagram as a graphic that cannot be modified. Be aware that the diagram becomes more challenging to understand as the number of variables and factor values rises.

Synaptic weights. displays the coefficient estimates that demonstrate the connection between the units in one layer and the subsequent layer. Even if the active dataset is divided into training, testing, and holdout data, the synaptic weights are still based on the training sample. Keep in mind that there can be a lot of synaptic weights and that they are typically not used to evaluate network results.

Network Performance. Shows the outcomes that are used to judge the model’s “goodness.” Note: If there is no testing sample, only the training sample will be used for the charts in this group, which are based on the combined training and testing samples.

Model summary

Summary of the neural network results per partition and overall is shown, along with the error, relative error (% of inaccurate predictions), halting rule, and training duration.

When the identity, sigmoid, or hyperbolic tangent activation functions are applied to the output layer, the error is the sum-of-squares error. When the softmax activation function is applied to the output layer, it is the cross-entropy error.

Depending on the levels of measurement for the dependent variable, relative errors or percentages of inaccurate predictions are shown. The average overall relative error (compared to the mean model) is shown if any dependent variable has a scale measurement level. The average percentage of inaccurate predictions is shown if all dependent variables are categorical. relative mistakes

Classification results. Displays a categorization table by division and overall for each categorical dependent variable. The number of cases successfully and wrongly assigned to each dependent variable category is shown in each table. Also reported is the proportion of all instances that were correctly classified.

ROC curve. Shows a ROC (Receiver Operating Characteristic) curve for each dependent variable that is a categorical variable. Additionally, a table showing the space under each curve is included. The ROC graph shows one curve for each category for a certain dependent variable. Each curve treats the relevant category as the positive state vs the other category if the dependent variable has two categories. Each curve treats the relevant category as the positive state versus the sum of all other categories if the dependent variable has more than two categories.

Cumulative gains chart. shows a cumulative gains graph for each dependent categorical variable. Similar to ROC curves, one curve is displayed for each category of dependent variables.

Lift chart. shows a lift chart for each dependent category variable. Similar to ROC curves, one curve is displayed for each category of dependent variables.

Predicted by observed chart. Shows a chart of projected values compared to observed values for each dependent variable. The observed response category serves as the cluster variable in clustered boxplots of predicted pseudo-probabilities for each response category for categorical dependent variables. A scatterplot is shown for variables that depend on scale.

Residual by predicted chart. For each scale-dependent variable, displays a residual-by-predicted-value graphic. There shouldn’t be any discernible patterns between the anticipated values and the residuals. This graph is created only for variables that rely on scale.

Case processing summary. Shows the case processing summary table, which lists the total number of cases included and excluded from the analysis as well as the cases included and excluded from the analysis by the training, testing, and holdout samples.

Independent variable importance analysis. Carries out a sensitivity analysis, which determines the contribution of each predictor to the neural network. The combined training and testing samples are used for the analysis, or only the training sample if no testing sample is available. As a result, each predictor’s importance and normalized importance are displayed in a table and a chart. Be aware that sensitivity analysis requires a lot of calculation time and money if there are a lot of predictors or instances.

Save

Predictions are saved as variables in the dataset using the Save tab.

Save predicted value or category for each dependent variable. For scale-dependent variables, this saves the predicted value, and for categorical dependent variables, it records the predicted category.

Save predicted pseudo-probability or category for each dependent variable. The projected pseudo-probabilities for categorical dependent variables are saved in this way. Each of the first n categories, where n is given in the Categories to Save column, has its own variable saved.

Names of Saved Variables. Your whole body of work will be preserved thanks to automatic name generation. You can replace/remove results from earlier runs using custom names without first erasing the recorded variables in the Data Editor.

Each predicted value for categorical dependent variables with softmax activation and cross-entropy error represents the likelihood that the example falls into the respective category.

There will be a predicted value for each category for categorical dependent variables with sum-of-squares error, but the predicted values cannot be read as probabilities. Even if any of these projected pseudo-probabilities are less than 0 or larger than 1, or if the sum for a certain dependent variable is not 1, the process still stores them.

Based on pseudo-probabilities, the ROC, cumulative gains, and lift charts are produced. When a particular variable’s total is not 1, or if any of the pseudo-probabilities are larger than 0 or less than 1, they are first rescaled to 1.

Export

The synaptic weight estimates for each dependent variable are saved to an XML (PMML) file using the Export tab. This model file can be used to score additional data files by applying model information to them. If split files are defined, this option is not available.

Options

User-Missing Values. For a case to be included in the analysis, factors must have legitimate values. You can choose to treat user-missing values as valid among factors and categorical dependent variables using these controls.

Stopping Rules. When to finish training the neural network is decided by these rules. At least one data pass is made during the training process. The following factors, which are examined in the order stated, can then cause training to end. For the online and mini-batch techniques and an iteration for the batch method, a step in the halting rule definitions that follow corresponds to a data pass.

Maximum steps without a decrease in error. The number of steps to wait before examining whether error has decreased. After the predetermined number of steps, training ends if there is no reduction in error. Give an integer that is bigger than 0. Which data sample is utilized to compute the error can also be specified.

Choose automatically uses the training sample unless the testing sample is present.

This option only applies to batch training if a testing sample is available, as batch training ensures a reduction in the training sample error with each data pass.

Both training and test data this option only applies if a testing sample exits; it verifies the error for each of these samples.

Note: Online and mini-batch training require an additional data pass to calculate the training error after each full data pass. It is generally suggested that you submit a testing sample and choose Choose automatically in any instance because this additional data pass can significantly impede training.

Maximum training time. Decide if you want to set a time limit on how long the algorithm can run. Enter a value that is not 0.

Maximum Training Epochs. The most permitted epochs (data passes) per period. When the allotted number of epochs is reached, training comes to an end. Give an integer that is bigger than 0.

Minimum relative change in training error. If the relative change in the training error over the preceding step is less than the criterion value, training comes to an end. Enter a value that is not 0. If just testing data are used to determine the error, this requirement is disregarded for online and mini-batch training.

Minimum relative change in training error ratio. If the ratio of training error to null model error falls below a certain threshold, training is terminated. The average value for each dependent variable is predicted by the null model. Enter a value that is not 0. If just testing data are used to determine the error, this requirement is disregarded for online and mini-batch training.

Maximum cases to store in memory. The multilayer perceptron algorithms’ following settings are controlled by this. Enter a number bigger than 1.

The size of the sample used in automatic architecture selection is min (1000, memsize), where memsize is the maximum number of cases that may be stored in memory.

The number of mini-batches in mini-batch training using automatically computed mini-batches is min(max(M/10,2),memsize), where M is the number of cases in the training sample.

Contact Us

In order to build the necessary connection between the data and the vision, Statisda uses powerful, simple-to-understand data analysis tools and knowledge. Inquire with us about data analysis.