(1 - 1 of 1)
- Learning from noisily connected data
- Yang, Tianbao
- Electronic Theses & Dissertations
Machine learning is a discipline of developing computational algorithms for learning predictive models from data. Traditional analytical learning methods treat the data as independent and identically distributed (i.i.d) samples from unknown distributions. However, this assumption is often violated in many real world applications that leading to the challenge of learning predictive models. For example, in electronic commerce website, customers could purchase a product by the recommendation of...
Show moreMachine learning is a discipline of developing computational algorithms for learning predictive models from data. Traditional analytical learning methods treat the data as independent and identically distributed (i.i.d) samples from unknown distributions. However, this assumption is often violated in many real world applications that leading to the challenge of learning predictive models. For example, in electronic commerce website, customers could purchase a product by the recommendation of their friends. Hence the purchasement records of customers are not i.i.d samples but correlated. Nowadays, data become correlated due to collaborations, interactions, communications, and many other types of connections. Effective learning from these connected data not only provides better understanding of the data but also brings significant economic benefits. How to learn from the connected data also brings unique challenges to both supervised learning and unsupervised learning algorithms because these algorithms are designed for i.i.d data and are often sensitive to the noise in the connected data. In this dissertation, I focus on developing theory and algorithms for learning from connected data. In particular, I consider two types of connections: the first type of connection is naturally formed in real wold networks, while the second type of connection is manually created to facilitate the learning process which is called must-and-cannot link. In the first part of this dissertation, I develop efficient algorithms for detecting communities in the first type of connected data. In the second part of this dissertation, I develop clustering algorithms that effectively utilize both must links and cannot links for the second type of connected dataA common approach toward learning from connected data is to assume that if two data points are connected, they are likely to be assigned to the same class/cluster. This assumption is often violated in real-word applications, leading to the noisy connection problems. One key challenge of learning from connected data is how to model the noisy pairwise connections that indicates the pairwise class-relationship between two data points. In the problem of detecting communities in networked data, I develop Bayesian approaches that explicitly model the noisy pairwise links by introducing additional hidden variables, besides community memberships, to explain potential inconsistency between the pairwise connections and pairwise class-relationship. In clustering must-and-cannot linked data, I will try to model how the noise is added into the pairwise connections in the manually generating process. The main contributions of this dissertation include (i) it introduces
popularityand productivityfor the first time besides the community memberships to model the generation of noisy links in real networks; the effectiveness of these factors is demonstrated through the task of community detection; (ii) it proposes a discriminative model for the first time that combines the content and link analysis together for detecting communities to alleviate the impact of noisy connections in community detection; (iii) it presents a general approach for learning from noisily labeled data, proves the theoretical convergence results for the first time and applies the approach in clustering noisy must-and-cannot linked data.