Signal and Data Processing
پردازش علائم و دادهها
JSDP
Engineering & Technology
http://jsdp.rcisp.ac.ir
1
admin
2538-4201
2538-421X
10.52547/jsdp
1
8888
fa
jalali
1399
6
1
gregorian
2020
9
1
17
2
online
1
fulltext
fa
ترکیب وزندار خوشهبندیها با هدف افزایش صحّت خوشهبندی نهایی
Weighted Ensemble Clustering for Increasing the Accuracy of the Final Clustering
مقالات پردازش دادههای رقمی
Paper
پژوهشي
Research
<div style="text-align: justify;"><strong><span dir="RTL"><span style="color:black;"><span style="font-family:B Nazanin;"><span style="font-size:10.0pt;">با توجه به ماهیت بدون ناظر مسائل خوشهبندی و تأثیرگذاری مؤلفههای مختلف از جمله تعداد خوشهها، معیار فاصله و الگوریتم انتخابی، ترکیب خوشهبندیها برای کاهش تأثیر این مؤلفهها و افزایش صحت خوشهبندی نهایی معرفی شده است. در این مقاله، روشی برای ترکیب وزن</span></span></span></span><em><span style="color:black;"><span style="font-family:B Nazanin;"><span style="font-size:10.0pt;"></span></span></span></em><span dir="RTL"><span style="color:black;"><span style="font-family:B Nazanin;"><span style="font-size:10.0pt;">دار خوشه</span></span></span></span><em><span style="color:black;"><span style="font-family:B Nazanin;"><span style="font-size:10.0pt;"></span></span></span></em><span dir="RTL"><span style="color:black;"><span style="font-family:B Nazanin;"><span style="font-size:10.0pt;">بندی</span></span></span></span><em><span style="color:black;"><span style="font-family:B Nazanin;"><span style="font-size:10.0pt;"></span></span></span></em><span dir="RTL"><span style="color:black;"><span style="font-family:B Nazanin;"><span style="font-size:10.0pt;">های پایه با وزن</span></span></span></span><em><span style="color:black;"><span style="font-family:B Nazanin;"><span style="font-size:10.0pt;"></span></span></span></em><span dir="RTL"><span style="color:black;"><span style="font-family:B Nazanin;"><span style="font-size:10.0pt;">دهی به خوشه</span></span></span></span><em><span style="color:black;"><span style="font-family:B Nazanin;"><span style="font-size:10.0pt;"></span></span></span></em><span dir="RTL"><span style="color:black;"><span style="font-family:B Nazanin;"><span style="font-size:10.0pt;">بندی</span></span></span></span><em><span style="color:black;"><span style="font-family:B Nazanin;"><span style="font-size:10.0pt;"></span></span></span></em><span dir="RTL"><span style="color:black;"><span style="font-family:B Nazanin;"><span style="font-size:10.0pt;">ها بر اساس روش </span></span></span></span><span style="color:black;"><span style="font-family:Times New Roman,serif;"><span style="font-size:8.0pt;">AD</span></span></span> <span dir="RTL"><span style="color:black;"><span style="font-family:B Nazanin;"><span style="font-size:10.0pt;">ارائه شده است. روش </span></span></span></span><span style="color:black;"><span style="font-family:Times New Roman,serif;"><span style="font-size:8.0pt;">AD</span></span></span> <span dir="RTL"><span style="color:black;"><span style="font-family:B Nazanin;"><span style="font-size:10.0pt;">برای برآورد صحّت انسان</span></span></span></span><em><span style="color:black;"><span style="font-family:B Nazanin;"><span style="font-size:10.0pt;"></span></span></span></em><span dir="RTL"><span style="color:black;"><span style="font-family:B Nazanin;"><span style="font-size:10.0pt;">ها در مسائل جمع­سپاری از هماهنگی یا تضاد بین آرای آنها استفاده میکند</span></span></span></span></strong> <strong><span dir="RTL"><span style="color:black;"><span style="font-family:B Nazanin;"><span style="font-size:10.0pt;">و با پیشنهاد مدلی احتمالاتی، فرآیند برآورد صحّت را بهکمک یک فرآیند بهینهسازی انجام می</span></span></span></span><span dir="RTL"><span style="color:black;"><span style="font-family:Cambria,serif;"><span style="font-size:10.0pt;"></span></span></span></span><span dir="RTL"><span style="color:black;"><span style="font-family:B Nazanin;"><span style="font-size:10.0pt;">دهد. نوآوری اصلی این مقاله، تخمین صحت خوشهبندیهای پایه با استفاده از روش </span></span></span></span><span style="color:black;"><span style="font-family:Times New Roman,serif;"><span style="font-size:8.0pt;">AD</span></span></span><span dir="RTL"><span style="color:black;"><span style="font-family:B Nazanin;"><span style="font-size:10.0pt;"> و استفاده از صحتهای تخمین زدهشده در وزندهی به خوشهبندیهای پایه در فرآیند ترکیب است. نحوه تطبیق مسأله خوشه</span></span></span></span><span dir="RTL"><span style="color:black;"><span style="font-family:Cambria,serif;"><span style="font-size:10.0pt;"></span></span></span></span><span dir="RTL"><span style="color:black;"><span style="font-family:B Nazanin;"><span style="font-size:10.0pt;">بندی به روش برآورد صحّت </span></span></span></span><span style="color:black;"><span style="font-family:Times New Roman,serif;"><span style="font-size:8.0pt;">AD</span></span></span> <span dir="RTL"><span style="color:black;"><span style="font-family:B Nazanin;"><span style="font-size:10.0pt;">و نحوه استفاده از صحّتهای برآوردشده در فرآیند ترکیب نهایی خوشه</span></span></span></span><span dir="RTL"><span style="color:black;"><span style="font-family:Cambria,serif;"><span style="font-size:10.0pt;"></span></span></span></span><span dir="RTL"><span style="color:black;"><span style="font-family:B Nazanin;"><span style="font-size:10.0pt;">ها، از چالشهایی است که در این پژوهش به آنها پرداخته شده است. چهار روش برای تولید خوشه</span></span></span></span><em><span style="color:black;"><span style="font-family:B Nazanin;"><span style="font-size:10.0pt;"></span></span></span></em><span dir="RTL"><span style="color:black;"><span style="font-family:B Nazanin;"><span style="font-size:10.0pt;">بندی</span></span></span></span><em><span style="color:black;"><span style="font-family:B Nazanin;"><span style="font-size:10.0pt;"></span></span></span></em><span dir="RTL"><span style="color:black;"><span style="font-family:B Nazanin;"><span style="font-size:10.0pt;">های پایه شامل الگوریتم</span></span></span></span><em><span style="color:black;"><span style="font-family:B Nazanin;"><span style="font-size:10.0pt;"></span></span></span></em><span dir="RTL"><span style="color:black;"><span style="font-family:B Nazanin;"><span style="font-size:10.0pt;">های متفاوت، معیارهای فاصله</span></span></span></span><em><span style="color:black;"><span style="font-family:B Nazanin;"><span style="font-size:10.0pt;"></span></span></span></em><span dir="RTL"><span style="color:black;"><span style="font-family:B Nazanin;"><span style="font-size:10.0pt;">ی متفاوت در اجرای </span></span></span></span><span style="color:black;"><span style="font-family:Times New Roman,serif;"><span style="font-size:8.0pt;">k-means</span></span></span><span dir="RTL"><span style="color:black;"><span style="font-family:B Nazanin;"><span style="font-size:10.0pt;">، ویژگی</span></span></span></span><em><span style="color:black;"><span style="font-family:B Nazanin;"><span style="font-size:10.0pt;"></span></span></span></em><span dir="RTL"><span style="color:black;"><span style="font-family:B Nazanin;"><span style="font-size:10.0pt;">های توزیع</span></span></span></span><em><span style="color:black;"><span style="font-family:B Nazanin;"><span style="font-size:10.0pt;"></span></span></span></em><span dir="RTL"><span style="color:black;"><span style="font-family:B Nazanin;"><span style="font-size:10.0pt;">شده و تعداد خوشه</span></span></span></span><em><span style="color:black;"><span style="font-family:B Nazanin;"><span style="font-size:10.0pt;"></span></span></span></em><span dir="RTL"><span style="color:black;"><span style="font-family:B Nazanin;"><span style="font-size:10.0pt;">های متفاوت بررسی شده است. در فرآیند ترکیب، قابلیت وزن</span></span></span></span><em><span style="color:black;"><span style="font-family:B Nazanin;"><span style="font-size:10.0pt;"></span></span></span></em><span dir="RTL"><span style="color:black;"><span style="font-family:B Nazanin;"><span style="font-size:10.0pt;">دهی به الگوریتمهای خوشه</span></span></span></span><em><span style="color:black;"><span style="font-family:B Nazanin;"><span style="font-size:10.0pt;"></span></span></span></em><span dir="RTL"><span style="color:black;"><span style="font-family:B Nazanin;"><span style="font-size:10.0pt;">بندی ترکیبی </span></span></span></span><span style="color:black;"><span style="font-family:Times New Roman,serif;"><span style="font-size:8.0pt;">CSPA</span></span></span> <span dir="RTL"><span style="color:black;"><span style="font-family:B Nazanin;"><span style="font-size:10.0pt;">و </span></span></span></span><span style="color:black;"><span style="font-family:Times New Roman,serif;"><span style="font-size:8.0pt;">HGPA</span></span></span> <span dir="RTL"><span style="color:black;"><span style="font-family:B Nazanin;"><span style="font-size:10.0pt;">اضافه شده است. نتایج روش پیشنهادی روی سیزده مجموعه داده مصنوعی و واقعی مختلف و بر اساس نُه معیار ارزیابی متفاوت نشان می</span></span></span></span><em><span style="color:black;"><span style="font-family:B Nazanin;"><span style="font-size:10.0pt;"></span></span></span></em><span dir="RTL"><span style="color:black;"><span style="font-family:B Nazanin;"><span style="font-size:10.0pt;">دهد که روش ترکیب وزن</span></span></span></span><em><span style="color:black;"><span style="font-family:B Nazanin;"><span style="font-size:10.0pt;"></span></span></span></em><span dir="RTL"><span style="color:black;"><span style="font-family:B Nazanin;"><span style="font-size:10.0pt;">دار ارائهشده در بیشتر موارد بهتر از روش ترکیب خوشه</span></span></span></span><em><span style="color:black;"><span style="font-family:B Nazanin;"><span style="font-size:10.0pt;"></span></span></span></em><span dir="RTL"><span style="color:black;"><span style="font-family:B Nazanin;"><span style="font-size:10.0pt;">بندی بدون وزن عمل می</span></span></span></span><em><span style="color:black;"><span style="font-family:B Nazanin;"><span style="font-size:10.0pt;"></span></span></span></em><span dir="RTL"><span style="color:black;"><span style="font-family:B Nazanin;"><span style="font-size:10.0pt;">کند که این بهبود برای روش </span></span></span></span><span style="color:black;"><span style="font-family:Times New Roman,serif;"><span style="font-size:8.0pt;">HGPA</span></span></span><span dir="RTL"><span style="color:black;"><span style="font-family:B Nazanin;"><span style="font-size:10.0pt;"> نسبت به </span></span></span></span><span style="color:black;"><span style="font-family:Times New Roman,serif;"><span style="font-size:8.0pt;">CSPA</span></span></span><span dir="RTL"><span style="color:black;"><span style="font-family:B Nazanin;"><span style="font-size:10.0pt;"> بیشتر است.</span></span></span></span></strong></div>
<div style="text-align: justify;"><strong>Clustering algorithms are highly dependent on different factors such as the number of clusters, the specific clustering algorithm, and the used distance measure. Inspired from ensemble classification, one approach to reduce the effect of these factors on the final clustering is ensemble clustering. Since weighting the base classifiers has been a successful idea in ensemble classification, in this paper we propose a method to use weighting in the ensemble clustering problem. The accuracies of base clusterings are estimated using an algorithm from crowdsourcing literature called agreement/disagreement method (AD). This method exploits the agreements or disagreements between different labelers for estimating their accuracies. It assumes different labelers have labeled a set of samples, so each two persons have an agreement ratio in their labeled samples. Under some independence assumptions, there is a closed-form formula for the agreement ratio between two labelers based on their accuracies. The AD method estimates the labelers’ accuracies by minimizing the difference between the parametric agreement ratio from the closed-form formula and the agreement ratio from the labels provided by labelers. To adapt the AD method to the clustering problem, an agreement between two clusterings are defined as having the same opinion about a pair of samples. This agreement can be as either being in the same cluster or being in different clusters. In other words, if two clusterings agree that two samples should be in the same or different clusters, this is considered as an agreement. Then, an optimization problem is solved to obtain the base clusterings’ accuracies such that the difference between their available agreement ratios and the expected agreements based on their accuracies is minimized. To generate the base clusterings, we use four different settings including different clustering algorithms, different distance measures, distributed features, and different number of clusters. The used clustering algorithms are mean shift, k-means, mini-batch k-means, affinity propagation, DBSCAN, spectral, BIRCH, and agglomerative clustering with average and ward metrics. For distance measures, we use correlation, city block, cosine, and Euclidean measures. In distributed features setting, the k-means algorithm is performed for 40%, 50%,…, and 100% of randomly selected features. Finally, for different number of clusters, we run the k-means algorithm by k equals to 2 and also 50%, 75%, 100%, 150%, and 200% of true number of clusters. We add the estimated weights by the AD algorithm to two famous ensemble clustering methods, i.e., Cluster-based Similarity Partitioning Algorithm (CSPA) and Hyper Graph Partitioning Algorithm (HGPA). In CSPA, the similarity matrix is computed by taking a weighted average of the opinions of different clusterings. In HGPA, we propose to weight the hyperedges by different values such as the estimated clustering accuracies, size of clusters, and the silhouette of clusterings. The experiments are performed on 13 real and artificial datasets. The reported evaluation measures include adjusted rand index, Fowlkes-Mallows, mutual index, adjusted mutual index, normalized mutual index, homogeneity, completeness, v-measure, and purity. The results show that in the majority of cases, the proposed weighted-based method outperforms the unweighted ensemble clustering. In addition, the weighting is more effective in improving the HGPA algorithm than CSPA. For different weighting methods proposed for HGPA algorithm, the best average results are obtained when we use the accuracies estimated by the AD method to weight the hyperedges, and the worst results are obtained when using the normalized silhouette measure for weighting. Finally, among different methods for generating base clusterings, the best results in weighted HGPA are obtained when we use different clustering algorithms to come up with different base clusterings.</strong></div>
خوشهبندی ترکیبی وزندار, یادگیری بدون نظارت, HGPA, CSPA, AD
Weighted Ensemble Clustering, Unsupervised Learning, HGPA, CSPA, AD
100
85
http://jsdp.rcisp.ac.ir/browse.php?a_code=A-10-1443-1&slc_lang=fa&sid=1
Sedigheh
Vahidi Ferdosi
صدیقه
وحیدی فردوسی
s.vahidi@stu.qom.ac.ir
10031947532846009101
10031947532846009101
No
University of Qom
دانشگاه قم
Hossein
Amirkhani
حسین
امیرخانی
amirkhani@qom.ac.ir
10031947532846009102
10031947532846009102
Yes
University of Qom
دانشگاه قم