International Conference of Database and Data Mining (DBDM 2013)

Venue : Coral Deira - Dubai, Deira, Dubai, UAE. & Date : May 18~19, 2013

Accepted Papers

Identifying Patterns and Anomalies in Delayed Neutron Monitor Data for Maintenance of Nuclear Power Plants
Durga Toshniwal,Aditya Gupta,Indian Institute of Technology Roorkee Roorkee, India.
ABSTRACT
In nuclear fission, a delayed neutron is a neutron emitted by one of the fission products any time from a few milliseconds to a few minutes after the fission event. The counts of delayed neutrons constitute a time series sequence. The analysis of such time series can prove to be very significant for purpose of predictive maintenance in nuclear power plants. In this paper we aim to identify anomalies in neutron counts, which may be generated due to possible leaks in the nuclear reactor channel. Real world case data comprising of readings from Delayed Neutron Monitors (DNM) has been analyzed. The time sequences formed by the delayed neutrons have first been symbolically represented using Symbolic Approximation Algorithm (SAX), then anomaly detection and pattern detection algorithms have been applied on them.
Improving Predictions of Multiple Binary Models in ILP
Tarek Abudawood,King Abdulaziz City for Science and Technology,Saudi Arabia.
ABSTRACT
Most ILP learners can only handle two-class problems and could deal with a multi-class problem by reducing it to several two-class problems which produces multiple models eventually. Since combining crisp multi-model predictions is not straightforward in most situations, we investigate the reliability and consistency of one-vs-rest binary models and illustrate the difference with a proper multi-class model.
Rough Fuzzy Clustering Algorithm using Fuzzy Rough Correlation Factor
S.Revathy¹, B.Parvathavarthini²,¹Sathyabama University,India, ²St.Joseph’s College of Engineering,India.

ABSTRACT
There are advantages to both fuzzy set and rough set theories, Combining these two and used for clustering gives better results. Rough clustering is less restrictive than hard clustering and less descriptive than fuzzy clustering. Rough clustering is an appropriate method since it separates the objects that are definite members of a cluster from the objects that are only possible members of a cluster. In fuzzy clustering similarities are described by membership degrees while in rough clustering definite and possible members to a cluster are detected. Fuzzy Rough Correlation Factor is the threshold for degree of fuzziness. It determines how low a DFR value shall be for it to be considered for cluster membership assignment. This paper proposes new modified rough fuzzy clustering algorithm based on fuzzy rough correlation factor. Hence rough fuzzy clustering can be derived directly from the results obtained thro fuzzy clustering.
Classification of Web Log Data Using Cart & Sequential Mining to Identify Interested Users
Jagriti Chand and Abhishek Singh Chouhan,NIIST,India.
ABSTRACT
Web Usage Mining (WUM) is the process of extracting knowledge from Web users, who are actively involved in accessing the web data by exploiting Data Mining techniques. It can be used for different purposes such as personalization, system development and site amendment. Study of interested participated web users, provides precious knowledge for web designer to quickly respond to their individual needs. Here in this paper an efficient technique is implemented for the classification of web log data to identify interested users is proposed. The existing technique implemented for the interested users using Naive Bayes classifier is efficient in terms of time but having more error rate [1]. Here we are not only reducing the running time complexity of the system but also reduce the error rate. The technique uses the combinatorial method of classification algorithms CART and then applied sequential matching to search the interested users.
Using detrended fluctuation analysis method to calculate heart rate variability signals
Hoang ChuDuc and Phyllis K. Stein,Washington University in St. Louis Missouri, USA .
ABSTRACT
Heart rate variability (HRV) is used as a marker of autonomic modulation of heart rate. Nonlinear HRV parameters providing information about the scaling behaviour or the complexity of the cardiac system were included. In addition, the chaotic behaviour was quantified by means of the recently developed numerical noise titration technique. 24hours extract recordings of a large new-born subject using BedmasterEx system in Children hospital in Saint Louis, Missouri, USA. Numerical titration yielded similar information as other nonlinear HRV parameters do. In this work, we have calculated the long-range correlations using DFA. The long-range correlation for 10 subjects. Severe OSAS subjects have long range correlations (p < 0.02) than normal subjects
Cross-domain Scientific Collaborations Prediction Using Citation
Ying Guo and Xi Chen,Tsinghua University,China.
ABSTRACT
Cross-domain Scientific Collaborations have promoted rapid development of science and generated many innovative breakthroughs. However, predicting cross-domain scientific collaboration problem is rarely studied in academic research. Moreover, collaboration recommendation methods within single domain cannot be directly utilized for solving cross-domain problems, because there are topic skewness and sparse connections challenges. In this paper, we propose a Hybrid Graph Model, which combines both explicit co-author relationships and implicit co-citation relationships together to construct graph, then Random Walks with Restarts concept is used to measure and rank relatedness between nodes. Because co-citations appear in both source domain and target domain, they represent the topics which can be shared across domains. In this way, topic skewness problem is solved much cheaper and effectively compared with probabilistic topic models. In addition, co-citation relationship solves sparse connection problem by mining more potential connections between authors. However, few previous works use citation information for scientific collaboration recommendations. Finally, we compare the performances of Hybrid Graph Model with some baseline approaches on large publication data sets from different domains. The experiments show that Hybrid Graph Model outperforms comparison methods on several recommendation metrics. Moreover, citation information has been demonstrated to be very helpful for scientific collaboration recommendations.
Distance Queries on Large-scale Graphs based on Distributed Computing
Zhencai Zhao and Jizhou Luo,Harbin Institute of Technology,China.

ABSTRACT
As the development of various applications based on social networks and traffic networks, large-scale graph processing has become increasingly popular. The areas of graph problems include graph traversal, shortest path query, decompositions of graphs and so on. This paper makes a research on the shortest path query on graphs based on cloud computing and practical applications. First, we propose an unprecedented distributed scheme, referred to as D-Floyd, for answering distance queries on a large graph. We implement this scheme on the distributed platform of Hadoop. Second, we make some optimization using HaLoop, a distributed system developed from Hadoop and propose an incremental algorithm about D-Floyd. At last, we carry on a series of experiments. A detailed experimental evaluation on both synthetic and real datasets demonstrates that D-Floyd performs much better than existing state-of-the-art serial algorithms NaiveHCL, OptHCL-2 and BSC2Hop. D-Floyd provides better response time, adaptability and flexibility, theoretically and empirically. And it is a significant research with important actual meaning.
Supervised Feature Selection for Diagnosis of Coronary Artery Disease Based on Genetic Algorithm
Sidahmed Mokeddem and Baghdad Atmani,Oran University, Algeria.
ABSTRACT
Feature Selection (FS) have become the focus of much research in decision support systems areas for which datasets with tremendous number of variables are analyzed. In this paper we presents a new method for the diagnosis of Coronary Artery Diseases (CAD) founded on Genetic Algorithm (GA) wrapped Bayes Naïve (BN) based FS. Basically, CAD dataset contains two classes defined with 13 features. In GA–BN algorithm, GA generate in each iteration a subset of attributes that will be evaluated using the BN in the second step of the selection procedure. The final set of attribute contains the most relevant feature model that increase the accuracy. The algorithm in this case produces 85.50% classification accuracy in the diagnosis of CAD. Thus, the asset of the Algorithm is then compared with the use of Support Vector Machine (SVM), Multi-Layer Perceptron (MLP) and C4.5 decision tree Algorithm. The result of classification accuracy for those algorithm are respectively 83.5%, 83.16% and 80.85%. Consequently, the GA wrapped BN Algorithm is correspondingly compared with another FS algorithms. The Obtained results have shown a very promising outcomes for the diagnosis of CAD.
An Efficient Feature Selection Paradigm Using PCA-CFS-Shapley Values Ensemble Applied to Small Medical Data Sets
S.Sasikala¹,¹Anna university,India,Dr.S.Appavu alias Balamurugan²,²K.L.N. College of Information Technology,India,Dr.S.Geetha³,³Thiagarajar College of Engineering,India.
ABSTRACT
The precise diagnosis of patient profiles into categories, such as presence or absence of a particular disease along with its level of severity, remains to be a crucial challenge in biomedical field. This process is realized by the performance of the classifier by using a supervised training set with labeled samples .Then based on the result obtained, the classifier is allowed to predict the labels of new samples. Due to presence of irrelevant features it is difficult for standard classifiers from obtaining good detection rates. Hence it is important to select the features which are more relevant and by with good classifiers could be constructed to obtain a good accuracy and efficiency.This study is aimed to classify the medical profiles, and is realized by feature extraction (FE), feature ranking (FR) and dimension reduction methods (Shapley Values Analysis) as a hybrid procedure to improve the classification efficiency and accuracy. To appraise the success of the proposed method, experiments were conducted across 6 different medical data sets using J48 decision tree classifier. The experimental results showed that using the PCA – CFS – Shapley Values analysis procedure improves the classification efficiency and accuracy compared with individual usage.
Dealing with uncertainty in heterogeneous data streams over sliding window
Houda Hentech, Mohammed Salah Gouider, and Amine Farhat,Université de Tunis,Tunisia.
ABSTRACT
The existing methods for clustering uncertain data streams over sliding windows do not treat the categorical attributes. However, uncertain mixed data are ubiquitous. In this paper, an algorithm for clustering uncertain mixed data streams which contain both continuous attributes and categorical attributes over sliding windows, called SWHUClustering, is proposed. A Heterogeneous Uncertain Temporal Cluster Feature (HUTCF) is introduced to monitor the distribution statistics of heterogeneous data points. Based on this structure, Exponential Histogram of Heterogeneous Uncertain Cluster Feature (EHHUCF) is presented as a collection of HUTCF. This structure may help to handle the in-cluster evolution, and detects the temporal change of the cluster distribution. Our approach has several advantages over existing methods: 1) the higher execution efficiency benefits from its good design as it avoids the effects of old data on the final results. 2) We incorporated the k-NN into the clustering process in order to reduce the complexity of the algorithm. 3) Memory consumption can be managed efficiently by limiting the number of HUTCF in each EHHUCF. Simulations on real databases show the feasibility of our approach as well as its effectiveness by comparing it with UMicro algorithm.