Document Type
Theses, Ph.D
Disciplines
1.2 COMPUTER AND INFORMATION SCIENCE
Abstract
Sequential data modeling is an important challenge in various fields and in particular in natural language processing. Building effective sequential models faces a notable challenge in the form of Long-Distance Dependencies (LDDs) within the sequence data. Hence, successfully overcoming this challenge is imperative for developing robust and accurate sequential models across various domains and applications. To tackle this challenge, the first step is to conduct a detailed analysis of the complexity of LDDs observed in various sequence datasets. This thesis offers a thorough exploration and documentation of this analysis. An important finding from this thesis is the consistent patterns of LDD decay observed in datasets originating from similar domains, processes, and tasks. Furthermore, the analysis of LDDs of natural language datasets reveals that the dependencies decay following a broken power-law relationship.
The neural architectures employed for implementing Neural Language Models (NLMs) have demonstrated significant performance enhancements in sequential data modeling. These architectures include LSTM, Recurrent Highway Networks (RHN), DilatedRNNs, and Transformers. The effective design of neural architectures necessitates a comprehensive understanding of the inherent complexity of LDDs within NLM benchmark datasets. More importantly, evaluation tasks to evaluate the ability of neural architectures to model LDDs in sequence datasets should provide experimental control over the presence and complexity of LDDs.
An approach to address the lack of varied evaluation tasks entails the utilization of synthetic datasets generated using the grammars of Strictly k-Piecewise languages. Due to the well-established understanding of the characteristics of Strictly k-Piecewise languages, the synthesized datasets can be injected with the requisite characteristics. The complexity of LDDs within these generated datasets can be controlled by adjusting the (i) k parameter, (ii) the length of the generated strings, (iii) the number of unique symbols, and (iv) selecting suitable forbidden strings. These properties affect a number of properties of sequence datasets, such as (i) the number of unique symbols in a dataset, (ii) the size of the dataset, (ii) the number of interacting symbols within a given LDD, and (iv) the distance between the interacting symbols. Building on these observations in this work a variety of grammars of Strictly k-Piecewise languages were used to generate datasets for evaluating the representational capacity of several state-of-the-art (SoTA) neural architectures. These evaluations indicate that attention-based models deliver better results on datasets featuring complex LDDs than recurrent-based models.
A direct result of analyzing the complexity of LDDs exhibited by sequence data is the ability to guide the selection of optimal hyper-parameters of the neural architectures tasked with modeling a sequence dataset. This thesis proposes hyper-parameter selection process by investigating the DilatedRNN architecture. By studying the inherent LDDs across diverse datasets, it was possible to strategically optimize hyper-parameters of DilatedRNN, such as (i) dilations, (ii) valid sequence length, and (iii) total sequence length. The findings highlight that utilizing the identified optimal hyper-parameters leads to superior performance, establishing a strong correlation between insightful analysis of LDD complexity and the enhanced efficiency of the DilatedRNN model.
Thoroughly examining the complexity of LDDs within natural language datasets reveals the inherent challenges these datasets pose to NLMs. This thesis explores the complexity of LDDs by introducing a novel metric called the Long-Short Dependence Ratio (LSDR), tailored explicitly for natural language datasets characterized by dependencies decaying according to a broken power-law relationship.The LSDR computes the proportion of Short-Distance Dependencies (LDDs) and Long-Distance Dependencies (LDDs), offering insights into the dataset’s relative difficulty. A dataset with a higher proportion of LDDs is inherently more complex. Therefore, the LSDR serves as a metric that characterizes dataset complexity based on the presence of LDDs, which can be used to assess the performance of NLMs across diverse datasets. By leveraging the LSDR, this analysis aims to evaluate the effectiveness of NLMs and contribute directly to the advancement of better sequential models. The experiments demonstrate a direct relationship between a dataset’s LSDRml and the test perplexity of a Neural Language Model (NLM) on that dataset. Furthermore, it is noted that attention-based models perform well on datasets with low LSDRml values, whereas recurrent-based models perform better on datasets with high LSDRml values.
DOI
https://doi.org/10.21427/30rk-jh47
Recommended Citation
Mahalunkar, Abhijit Shrikant, "The Complexity of Long-Distance Dependencies and their Impact on Language Models" (2025). Doctoral. 5.
https://arrow.tudublin.ie/compdidadoc/5
Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 4.0 International License.
Publication Details
This dissertation is submitted for the degree of Doctor of Philosophy - October 2025.