Signal and Data Processing

fa خوشه بندی گروهی طیفی لاپلاسی-p نیمه نظارتی برای داده های با ابعاد بالا Ensembling semi-supervised p-spectral clustering for high dimensional data مقالات گروه علائم حیاتی ( مرتبط با مهندسی پزشکی) Paper پژوهشي Research با توجه به افزایش روزافزون اطلاعات و تحلیل دقیق آنها مسأله خوشه بندی که برای آشکارسازی الگوهای پنهان موجود در داده ها مورد استفاده قرار می گیرد، همچنان از اهمیت بالایی برخوردار است. از طرفی خوشه بندی داده های با ابعاد بالا با استفاده از روشهای سنتی پیشین دارای محدودیت های زیادی است. در مقاله حاضر، یک روش خوشه بندی گروهی نیمه نظارتی برای مجموعه ای از داده های پزشکی با ابعاد بالا پیشنهاد می شود. در فرموله سازی مسأله خوشه بندی اطلاعات نظارتی اندکی به عنوان دانش پیشین با استفاده از اطلاعات مربوط به تشابه و یا عدم تشابه (بصورت تعدادی زوج محدودیت های دوبه دو) در نظر گرفته میشود. در ابتدا با استفاده از خاصیت تراگذری زوج محدودیت های دوبه دو را بر روی تمام داده ها تعمیم می دهیم. سپس با تقسیم فضای ویژگی به صورت تصادفی به چندین زیرفضای نابرابر ابعاد داده ها را کاهش می دهیم. خوشه بندی طیفی نیمه نظارتی مبتنی بر گراف لاپلاسی- <m:r><m:rpr><m:scr m:val="roman"><m:sty m:val="p"></m:sty></m:scr></m:rpr>p</m:r> در هر زیر فضا بطور مستقل انجام می شود. سپس با استفاده از نتایج هر کدام یک ماتریس مجاورت، حاصل از تجمیع نتایج هر کدام (مبتنی بر یادگیری گروهی) ایجاد می شود. در نهایت با استفاده از چند عملگر جستجو روی زیرفضاها، بهترین زیرفضا، یعنی زیرفضایی که بهترین نتیجه خوشه بندی را دارد، می یابیم. نتایج آزمایشات متعدد بر روی چندین داده ی پزشکی با ابعاد بالا نشان می دهد که رویکرد پیشنهادی، عملکرد و کارآیی بهتری نسبت به روشهای پیشین دارد. Due to the increasing information and the detailed analysis of them, the clustering problems that detect the hidden patterns lie in the data are still of great importance. On the other hand, clustering of high-dimensional data using previous traditional methods has many limitations. In this study, a semi-supervised ensemble clustering method is proposed for a set of high-dimensional medical data. In the proposed method of this study, little information is available as prior knowledge using the information on similarity or dissimilarity (as a number of pairwise constraints). Initially using the transitive property, we generalize the pairwise constraints to all data. Then we divide the feature space into a number of sub-spaces, and to find the optimal clustering solution, the feature space is divided into an unequal number of sub-spaces randomly. A semi-supervised spectral clustering based on the p-Laplacian graph is performed at each sub-space independently. Specifically, to increase the accuracy of spectral clustering, we have used the spectral clustering method based on the p-Laplacian graph. The p-Laplacian graph is a nonlinear generalization of the Laplacian graph. The results of any clustering solutions are compared with the pairwise constraints and according to the level of matching, a degree of confidence is assigned to each clustering solution. Based on these degrees of confidence, an ensemble adjacency matrix is formed, which is the result of considering the results of all clustering solutions for each sub-space. This ensemble adjacency matrix is used in the final spectral clustering algorithm to find the clustering solution of the whole sub-space. Since the sub-spaces are generated randomly with an unequal number of features, clustering results are strongly influenced by different initial values. Therefore, it is necessary to find the optimal sub-space set. To this end, a search algorithm is designed to find the optimal sub-space set. The search process is initialized by forming several sets (we call each set an environment) consisting of several numbers of sub-spaces. An optimal environment is the one that has the best clustering results. The search algorithm utilized three search operators to find the optimal environment. The search operators search all the environments and the consequent sub-spaces both locally and globally. These operators combine two environments and/or replace an environment with a newly generated one. Each search operator tries to find the best possible environment in the entire search space or in a local space. We evaluate the performance of our proposed clustering schema on 20 cancer gene datasets. The normalized mutual information (NMI) criterion and the adjusted rand index (ARI) are used to evaluate the performance evaluation. We first examine the effect of a different number of pairwise constraints. As expected, with increasing the number of pairwise constraints, the efficiency of the proposed method also increases. For example, the NMI value increases from 0.6 to 0.9 on the Khan-2001 dataset, when the number of pairwise constraints increases from 20 to 100. More number of pairwise constraints means more information is available, which helps to improve the performance of the clustering algorithm. Furthermore, we examine the effect of the number of random subspaces. It is observed that increasing the number of random subspaces has a positive effect on clustering performance with respect to the NMI value. In most datasets, when the number of sub-spaces reaches 20, the performance of the proposed method does not change much and is stable. Examining the effect of sampling rate for random subspace generation shows that the proposed method has the best performance in most cancer datasets, such as Armstrong-2002-v3, and Bredel-2005 datasets, when the random subspace generation rate is 0.5, and by deviating the rate from 0.5, the level of satisfaction decreases. Then, the results of the proposed idea are compared with the results of the method proposed in the reference [21] according to ARI and we see that our proposed method has performed better in 12 data sets out of 20 data sets than the method proposed in the reference [21]. Finally, the proposed idea is compared with some metric learning approaches with respect to NMI. We have observed that the proposed method obtained the best results compared to other compared methods on 11 datasets out of 20 datasets. It also achieved the second-best result on 6 out of 20 datasets. For example, the value NMI obtained in the proposed method is 0.1042 more than the reference [21] and it is 0.1846 more than RCA and it is 0.4 more than ITML and also it is 0.468 more than DCA on the Bredel-2005 dataset. Utilizing ensemble clustering methods besides the confidence factor improves the ability of the proposed algorithm to achieve better results. Also, utilizing the transitive operators as well as the selection of random subspaces of unequal sizes play an important role in achieving better performance for the proposed algorithm. Using the p-Laplacian spectral clustering method produces a better, more balanced, and normal volume of clusters compared to the standard spectral clustering. Another effective approach to the performance of the proposed method is to use search operators to find the best subspace, which leads to better results.   خوشه بندی, یادگیری زیرفضا, یادگیری گروهی, یادگیری نیمه نظارتی, زوج محدودیت های دوبه دو Clustering, Subspace Learning, Ensemble Learning, Semi-supervised Learning, Pairwise Constraints 39 58 http://jsdp.rcisp.ac.ir/browse.php?a_code=A-10-588-1&slc_lang=fa&sid=1 Sedigheh Safari صدیقه صفری s.safari@eng.uk.ac.ir 100319475328460012397 100319475328460012397 No دانشگاه شهید باهنر کرمان Fatemeh Afsari فاطمه افسری afsari.f@gmail.com 100319475328460012398 100319475328460012398 Yes دانشگاه شهید باهنر کرمان