N-gram-based detection of new malicious codeThe current commercial anti-virus software detects a virus only after the virus has appeared and caused damage. Motivated by the standard signature-based technique for detecting viruses, and a recent successful text classification method, we explore the idea of automatically detecting new malicious code using the collected dataset of the benign and malicious code. We obtained accuracy of 100% in the training data, and 98% in 3-fold cross-validation.
Combined mining of Web server logs and web contents for classifying user navigation patterns and predicting users’ future requestsHaibin Liu, Vlado Kešelj|Data & Knowledge Engineering|2006 Language independent authorship attribution using character level language modelsWe present a method for computer-assisted authorship attribution based on character-level n-gram language models. Our approach is based on simple information theoretic principles, and achieves improved performance across a variety of languages without requiring extensive pre-processing or feature selection. To demonstrate the effectiveness and language independence of our approach, we present experimental results on Greek, English, and Chinese data. We show that our approach achieves state of the art performance in each of these cases. In particular, we obtain a 18% accuracy improvement over the best published results for a Greek data set, while using a far simpler technique than previous investigations.
Automatic detection and rating of dementia of Alzheimer type through lexical analysis of spontaneous speechCurrent methods of assessing dementia of Alzheimer type (DAT) in older adults involve structured interviews that attempt to capture the complex nature of deficits suffered. One of the most significant areas affected by the disease is the capacity for functional communication as linguistic skills break down. These methods often do note capture the true nature of language deficits in spontaneous speech. We address this issue by exploring novel automatic and objective methods for diagnosing patients through analysis of spontaneous speech. We detail several lexical approaches to the problem of detecting and rating DAT. The approaches explored rely on character n-gram-based techniques, shown recently to perform successfully in a different, but related task of automatic authorship attribution. We also explore the correlation of usage frequency of different parts of speech and DAT. We achieve a high 95% accuracy of detecting dementia when compared with a control group, and we achieve 70% accuracy in rating dementia in two classes, and 50% accuracy in rating dementia into four classes. Our results show that purely computational solutions offer a viable alternative to standard approaches to diagnosing the level of impairment in patients. These results are significant step forward toward automatic and objective means to identifying early symptoms of DAT in older adults.
n-Gram-based classification and unsupervised hierarchical clustering of genome sequencesAndrija Tomović, Predrag Janičić, Vlado Kešelj|Computer Methods and Programs in Biomedicine|2006