This article covers several topics:

- Why do we need to split data into training data and test dataset to build a prediction model?
- Why many published articles with the title “Factors predicting” did not spilt data into training and test dataset? Was it because they did in SPSS program?
- Is “to identify significant predictors” and “to develop a prediction model” the same aim?

**Why do we need to split data into training data and test dataset to build a prediction model?**

The simple answer:

If you are building a model for prediction, it is necessary to split the data (training data and test data) to avoid the over-fitting. If the goal of linear regression is just to study and analyze the data then it is not required to split the data. But, keep in mind, if only using one data for linear regression model, it is possible that your model overfits to data by overweighting unimportant variables. Therefore, for building prediction model, you need to split your data into training and test sets to be able to obtain a realistic evaluation of your learned model. If you evaluate your learned model with the training data, you obtain an optimistic measure of the goodness of your model. So, you should use a separate set (a set that is not seen during training) to obtain a realistic evaluation of your model. If you can’t do that, the results only for a correlational model (with training set), not a prediction model.

Long answer:

Splitting data into training and test datasets is a fundamental step in building prediction models to evaluate their performance and assess their generalization capabilities. Here are the main reasons why this separation is necessary:

- Model Development: The training dataset is used to train the prediction model. During the training process, the model learns patterns, relationships, and dependencies within the data. By providing labeled examples, the model adjusts its parameters and optimizes its internal representation to minimize the prediction errors on the training data.
- Model Evaluation: The test dataset serves as an unbiased evaluation set to assess how well the model performs on unseen data. It simulates real-world scenarios where the model encounters new, previously unseen instances. By measuring the model’s performance on the test data, we gain insights into its ability to generalize and make accurate predictions on unseen data points.
- Preventing Overfitting: Overfitting occurs when a model becomes too specialized in fitting the training data to the extent that it performs poorly on new, unseen data. By evaluating the model’s performance on the test dataset, we can detect overfitting. If the model performs significantly worse on the test data compared to the training data, it indicates overfitting. Regularization techniques and model adjustments can be applied to mitigate overfitting issues.
- Hyperparameter Tuning: Models often have hyperparameters that need to be set before training. Hyperparameters control the learning process and affect the model’s performance. By splitting the data, we can use the training set to tune and optimize these hyperparameters, and then evaluate their impact on the model’s performance using the test set. This helps in finding the optimal configuration of hyperparameters that yields the best predictive performance.
- Unbiased Performance Estimation: By keeping the test dataset separate from the training process, we obtain an unbiased estimate of the model’s performance. This estimation provides a more accurate representation of how well the model is expected to perform on new, unseen data in real-world scenarios.

**Why many published articles with the title “Factors predicting” did not spilt data into training and test datasets? Was it because they did in SPSS program?**

It is common for research articles to have titles such as “Factors predicting” without explicitly mentioning the use of a training and test dataset split. There could be a few reasons for this:

- Different Methodologies: Different research studies employ different methodologies and statistical techniques. While splitting data into training and test datasets is a common practice in machine learning and predictive modeling, other statistical methods may be used in certain research domains. For example, in traditional statistical analyses, such as regression analysis or analysis of variance (ANOVA), the focus is often on estimating the relationships between variables rather than predicting outcomes on unseen data. In such cases, data splitting may not be necessary or explicitly mentioned.
- Emphasis on Relationships and Associations: Some research studies may be more focused on exploring relationships and associations between variables rather than predicting outcomes on new data. In these cases, the goal is to understand the factors that are associated with certain outcomes or to identify significant predictors. The emphasis is on statistical inference rather than prediction. While it is still important to evaluate the generalizability of findings, the specific use of a training and test dataset split may not be mentioned.
- Limited Reporting Space: Research articles often have limitations in terms of word count or space for reporting methodological details. As a result, some authors may not provide an extensive description of the data splitting process or the specific techniques used. They might assume that the reader is already familiar with common practices, or they may prioritize reporting other aspects of their study, such as the results or interpretation of findings.
- Use of Traditional Statistical Software: You mentioned the use of SPSS, which is a widely used statistical software package. SPSS is primarily designed for traditional statistical analyses, and while it does have some machine learning capabilities, its focus is not primarily on predictive modeling. Therefore, researchers using SPSS may not explicitly mention data splitting because their analysis approach may not involve training and test datasets in the same way as machine learning models.

**So, can we do splitting data in SPSS when we do regression analysis?**

Yes, you can perform data splitting in SPSS when conducting regression analysis or any other analysis. While SPSS is commonly associated with traditional statistical analyses, it does provide options for data splitting and model validation.

In SPSS, you can split your data into training and test datasets using various techniques. One common approach is to randomly assign cases to either the training or test dataset. You can do this by creating a new binary variable (e.g., “Split”) and using SPSS syntax or the Data > Select Cases menu to randomly assign a value of 0 or 1 to each case.

Once you have split your data, you can conduct the regression analysis separately on the training dataset and evaluate its performance on the test dataset. You can examine various regression metrics, such as R-squared, adjusted R-squared, and regression coefficients, to assess the model’s fit and predictive power.

It is worth noting that SPSS provides additional features for cross-validation, such as k-fold cross-validation or leave-one-out cross-validation, which can be used to further validate and assess the model’s performance.

While SPSS is more commonly associated with traditional statistical analyses, it is important to adapt and apply appropriate model validation techniques, such as data splitting, when conducting predictive modeling or regression analysis to ensure robust and reliable results.

**Here are the steps to perform data splitting in SPSS:**

1) Open your dataset: Start by opening your dataset in SPSS. You can do this by going to File > Open > Data.

2) Create a new variable for data splitting: Create a new binary variable that will indicate whether each case belongs to the training or test dataset. For example, you can create a variable named “Split” and assign a value of 0 or 1 to each case.

To create a new variable using SPSS syntax, go to Transform > Compute Variable. In the Compute Variable dialog box, provide a name for the new variable (e.g., “Split”) and use the syntax to randomly assign values of 0 or 1 to each case. For example, you can use the following syntax:

COMPUTE Split = RV.BERNOULLI(0.5). EXECUTE.

This syntax assigns a value of 0 or 1 with equal probability (50% chance for each value) to each case in the dataset.

If you prefer a graphical interface, you can use the Recode into Different Variables option. Go to Transform > Recode into Different Variables. In the Recode into Different Variables dialog box, select your original variable (e.g., “Split”), specify the output variables (e.g., “Split_training” and “Split_test”), and define the recoding rules. For example, you can use the “Equal width” option with two output categories to randomly divide the cases into two groups.

3) Split the dataset: Once you have created the variable for data splitting, you need to split the dataset based on the values in this variable.

If you used SPSS syntax to create the “Split” variable, you can use the Select Cases command in SPSS syntax to split the dataset. For example, to select the cases for the training dataset (Split = 0) and save them to a new dataset, you can use the following syntax:

SELECT IF (Split = 0). SAVE OUTFILE=’path_to_save\training_dataset.sav’.

Similarly, you can use the syntax to select the cases for the test dataset (Split = 1) and save them to another file.

If you used the graphical interface to create the “Split_training” and “Split_test” variables, you can go to Data > Split File. In the Split File dialog box, select the variable(s) you created for data splitting (e.g., “Split_training” or “Split_test”). Remember to choose the option “Organize output by groups” and click OK. This will enable you to perform analyses separately on the training and test datasets.

4) Analyze the datasets separately: After splitting the dataset, you can now perform your desired analyses on the training and test datasets independently. For example, you can conduct regression analysis on the training dataset and assess the model’s performance on the test dataset.

Remember to interpret and compare the results obtained from each dataset separately to evaluate the model’s fit and predictive ability.

By following these steps, you can successfully split your data into training and test datasets in SPSS for further analysis and model validation.

**Is “to identify significant predictors” and “to develop a prediction model” the same aim?**

While “identifying significant predictors” and “developing a prediction model” share similarities, they are not exactly the same aim. Let’s understand the differences between these two objectives:

- Identifying Significant Predictors: When the goal is to identify significant predictors, the focus is on understanding the relationship between the predictor variables and the outcome variable. This typically involves conducting statistical analyses, such as regression analysis or hypothesis testing, to determine which predictors have a statistically significant association with the outcome. The emphasis is on identifying and quantifying the strength and direction of these associations. The purpose is to gain insights into the factors or variables that are related to the outcome of interest.
- Developing a Prediction Model: Developing a prediction model involves constructing a mathematical or statistical model that can predict the outcome variable based on one or more predictor variables. The aim is to create a model that can generalize well to new, unseen data and accurately predict the outcome variable for new instances. This process often entails more than just identifying significant predictors. It also includes techniques such as feature selection, model training, model validation, and evaluating the model’s predictive performance.

While identifying significant predictors can be a preliminary step in developing a prediction model, it is not the sole objective of prediction modeling. Prediction models go beyond determining significant predictors by incorporating various techniques to optimize the model’s predictive accuracy, handle overfitting, and assess the model’s performance on unseen data.

**But many articles overclaim that after they identify significant predictors, they successfully build a prediction model, especially using stepwise multiple regression.**

Many articles may overclaim the success of building a prediction model based solely on identifying significant predictors, particularly when using techniques like stepwise multiple regression. This overemphasis on identifying significant predictors and relying on stepwise regression can lead to misleading or overly optimistic claims about the predictive performance of the model. Here are a few reasons why this approach can be problematic:

- Inflated Type I Error: Stepwise regression techniques, including forward selection, backward elimination, or both, involve iteratively selecting or removing predictors based on their p-values or other criteria. This process can lead to inflated Type I error rates, meaning that predictors may be included or excluded from the model purely by chance or noise in the data. This can result in an overly complex model that performs well on the training data but fails to generalize to new data.
- Neglecting Model Assumptions: Stepwise regression and similar techniques often disregard important assumptions of regression models, such as linearity, independence, and homoscedasticity. Ignoring these assumptions can lead to biased parameter estimates and unreliable predictions.
- Ignoring Model Complexity and Overfitting: Stepwise regression may introduce unnecessary complexity to the model by including predictors that may not be truly associated with the outcome. This can lead to overfitting, where the model becomes overly tailored to the training data, resulting in poor performance on new, unseen data. Overfitting can occur when models are overly flexible and include too many variables relative to the sample size.
- Lack of External Validation: Over-reliance on stepwise regression for model building can result in a lack of external validation. Without evaluating the model’s performance on an independent test dataset or using cross-validation techniques, it is challenging to ascertain the true predictive accuracy and generalizability of the model.

To build reliable and robust prediction models, it is essential to follow best practices that go beyond simply identifying significant predictors. This includes careful consideration of model assumptions, appropriate feature selection techniques, regularization methods, model evaluation on independent datasets, and the use of alternative model evaluation metrics beyond p-values, such as measures of predictive accuracy (e.g., accuracy, precision, recall, ROC curve, etc.).

It is crucial to critically evaluate research articles and be cautious when claims about prediction models are based solely on identifying significant predictors through stepwise regression or similar techniques.

**Can it be stating “factors predicting”, which is identifying significant predictors, similar with “factors influencing”?**

“factors predicting” and “factors influencing” can be similar in the context of research. Both phrases indicate an interest in understanding the variables or factors that have an association or impact on a particular outcome or dependent variable. However, there can be slight nuances in their usage:

- Factors Predicting: When researchers use the phrase “factors predicting,” they often focus on identifying the variables or factors that can be used to build a predictive model. The emphasis is on determining which variables are statistically significant predictors of the outcome of interest. The goal is to develop a model that can accurately predict the outcome based on the identified predictors. This typically involves techniques such as regression analysis or machine learning algorithms.
- Factors Influencing: On the other hand, the phrase “factors influencing” generally implies a broader scope of investigation. It encompasses the identification and understanding of variables that have an impact on the outcome variable, whether or not prediction is the primary objective. The goal is to explore and describe the relationships between the predictor variables and the outcome, without necessarily focusing on developing a predictive model. This can involve various statistical methods, such as correlation analysis, regression analysis, or qualitative research approaches.

In other words, When the main focus is on identifying significant predictors, it is similar to the concept of “factors influencing” the outcome. In this context, the primary goal is to assess and understand the relationships between predictor variables and the outcome variable, rather than developing a prediction model.

The emphasis is on exploring and evaluating the influence or impact of different factors on the outcome variable. This can involve analyzing the strength and direction of associations, examining correlations, conducting hypothesis tests, or applying other statistical techniques to determine the significance of these relationships.

While developing a prediction model is not the primary objective in this case, the insights gained from identifying significant predictors or factors influencing the outcome can still be valuable for understanding the underlying mechanisms and informing decision-making processes.

It’s important to consider the specific aims and objectives of a research study when interpreting the terms “identifying significant predictors” and “factors influencing.” The context and research goals play a crucial role in determining the specific focus and methodology employed in the analysis.

**Cite this article as: Gunawan, J. (2023). Why many published articles with the title “Factors predicting” do not spilt data into training and test dataset? is it because they did in SPSS program? Available from why-many-published-articles-with-the-title-factors-predicting-do-not-spilt-data-into-training-and-test-dataset-is-it-because-they-did-in-spss-program**