Author ORCID Identifier
0000-0003-1208-6991
Document Type
Conference Paper
Disciplines
Statistics
Abstract
Analysing big data has always been a major issue because its massive volume poses significant challenges for traditional analytical techniques. When the number of instances is extremely large, existing approaches become computationally infeasible due to the complexity of many algorithms, along with memory and time constraints inherent in processing large datasets. In such cases, using a subset of the data is considered a more practical solution, and analyses are typically performed over a simple random sample drawn from the entire dataset. Various subsampling methods have been proposed to address these issues. However, they often fall short in producing representative subsamples that can be used across different analytical techniques. Within this framework, this work presents a novel approach based on the so-called Data Nuggets to obtain a subset that can be used for any further required analysis. Existing techniques to get a subset from a large dataset may focus on particular objectives, or they may struggle to capture the structure and statistical properties of the original data. Our method builds on the robustness of the Data Nugget approach, which summarizes a dataset while keeping its structure. This new method, which we name DN-subset selection, is based on sampling from each refined data nugget, finally yielding a much smaller dataset that well represents the whole original large sample. The effectiveness of our method is evaluated through a simulation study.
DOI
https://doi.org/10.21427/mnw2-6d25
Recommended Citation
Kumar, Vipin; Balzano, Simona; and Porzio, Giovanni C., "Dealing with large data sets: the Data Nugget Subset Selection approach" (2025). SAML-25 Workshop on Statistical and Machine Learning. 18.
https://arrow.tudublin.ie/saml/18
Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 4.0 International License.
Publication Details
Statistical and Machine Learning: Methods and Applications (SAML-25) on June 5th and 6th, 2025 at TU Dublin, Ireland.
doi:10.21427/mnw2-6d25