


One of the main challenges for research in the field of Phraseology is to discover how phraseological combinations can be integrated into the grammatical rules they seem to contradict.

In the significantly reduced sets of features that render the best performance, there are only a few of the multi-element function words previously selected however, a further analysis of the features selected from the list of unigrams shows that many of these elements are either part of or they represent themselves multi-element units from a phraseological point of view (Beck and Mel‘ĉuk, 2011 Corpas, 2013 Shanavas, 1996). As will be reported here, the best results averaged by some classifier over all corpora are obtained after reducing the list of features by statistical techniques. The combined set of all these features is fed to a suite of the most common and successful machine learning classifiers in this task. Using these corpora to run several hundreds of experiments, two types of different classificatory features are tested: a rather short, previously selected list of multi-element function words and a large list with all word unigrams in a given corpus. The present study uses contributions to organized crime-related online forums to randomly create a number of corpora with an increasing number of subjects. Easy to tag by computational tools, these multi-element units can produce long lists of features even in the small corpora –text collections authored by some set of subjects– which are standard in authorship attribution. More recently, the number of features used in authorship attribution has exploded as all sorts of multi-element units have been introduced in the task, such as character, word and POS n-grams (Rico Sulayes, 2014). In order to achieve this goal, a constant proposal of new classificatory features has characterized the research on this task for a number of decades (Rudman, 1998). Frequently performed by means of automated processes, the task of authorship attribution is aimed at assigning an anonymous piece of text to a subject within a list of potential authors. The use of multi-element units in computational linguistics has a long tradition that is reflected on both automatic text classification and the related forensic linguistic task of authorship attribution.
