Saurabh Chakravarty, Edward Fox
People worldwide use Twitter to post updates about the events that concern them directly or indirectly. Study of these posts can help identify global events and trends of importance. Similarly, E-commerce applications organize their products in a way that can facilitate their management and satisfy the needs and expectations of their customers. However, classifying data such as tweets or product descriptions is still a challenge. These data are described by short texts, containing in their vocabulary abbreviations of sentences, emojis, hashtags, implicit codes, and other non-standard usage of written language. Consequently, traditional text classification techniques are not effective on these data. In this paper, we describe our use of the Spark platform to implement two classification strategies to process large data collections, where each datum is a short textual description. One of our solutions uses an associative classifier, while the other is based on a multiclass Logistic Regression classifier using Word2Vec as a feature selection and transformation technique. Our associative classifier captures the relationships among words that uniquely identify each class, and Word2Vec captures the semantic and syntactic context of the words. In our experimental evaluation, we compared our solutions, as well as Spark MLlib classifiers. We assessed effectiveness, efficiency, and memory requirements. The results indicate that our solutions are able to effectively classify the millions of data instances composed of thousands of distinct features and classes, found in our digital libraries.
Eduardo P. S. Castro, Saurabh Chakravarty, Eric Williamson, Denilson Alves Pereira , Edward A. Fox: Classifying Short Unstructured Data Using the Apache Spark Platform. JCDL 2017: 129-138
- Date of publication:
- July 27, 2017
- Joint Conference on Digital Libraries
- Page number(s):