Evaluating Training Data Variance for Neural Network Generalization

Type: BA thesis

Status: open

Supervisors: Linda-Sophie Schneider

Work description
The objective of this project is to investigate the influence of training data variance on the generalization capabilities of neural networks (NNs). Through systematic evaluation and experimentation within a specified domain, such as CAD files, this research aims to derive optimal strategies for assembling training datasets that significantly enhance NN performance across various tasks. The project will explore the dimensions of data diversity, including feature range, sample complexity, and inter-sample variability, to establish a correlation between data variance and NN generalization.

The following questions should be considered:

  • What metrics can effectively quantify the variance in a training dataset?
  • How does the variance within a training set impact the neural network’s ability to generalize to new, unseen data?
  • What is the optimal balance of diversity and specificity in a training dataset to maximize NN performance?
  • How can training datasets be curated to include a beneficial level of variance without compromising the quality of the neural network’s output?
  • What methodologies can be implemented to systematically adjust the variance in training data and evaluate its impact on NN generalization?

Applicants should have a solid background in machine learning and deep learning, with strong technical skills in Python and experience with PyTorch. Candidates should also possess the capability to work independently and have a keen interest in exploring the theoretical aspects of neural network training.

For your application, please send your transcript of record.