Due to the increasing number of documents, XML, effectively organize these documents in order to retrieve useful information from them is essential. A possible solution is performed on the clustering of XML documents in order to discover knowledge. Clustering XML documents is a key issue of how to measure the similarity between XML documents. Conventional clustering of text documents using a document similarity measure used in information content, they can cause structural information contained in XML documents is ignored. In this paper, a new model named matrix space model to represent both structural and content features of documents in XML, is proposed. Based on this model, the Jaccard similarity measure is defined and the colonial competitive algorithm for clustering XML documents is used. Experimental results show that the proposed model function in identifying similar documents which closely identified with the same structure and content information are effective. This method can improve the accuracy of clustering, and XML data can be used to increase productivity.
Rights and permissions | |
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. |