Eman Abdelrahman, Edward Fox
Electronic Theses and Dissertations (ETDs) are documents rich in research information that provide many benefits to students and future generations of scholars in various disciplines. Therefore, research is taking place to extract data from ETDs and make them more accessible. However, much of the related research involved ETDs in the English language, while Arabic ETDs remain an untapped source of data, although the number of Arabic ETDs available digitally is growing. Therefore, the need to make them more browsable and accessible increases. Some ways to achieve this need include data annotation, indexing, translation, and classification. As the size of the data increases, manual subject classification becomes less feasible. Accordingly, automatic subject classification becomes essential for the searchability and management of data. There are two main roadblocks to performing automatic subject classification of Arabic ETDs. The first is the lack of a large public corpus of Arabic ETDs for training purposes, while the second is the Arabic language’s linguistic complexity, especially in academic documents. This research aims to collect key metadata of Arabic ETDs, and apply different automatic subject classification methodologies. The first goal is aided by scraping data from the AskZad Digital Library. The second goal is achieved by exploring different machine learning and deep learning techniques. The experiments’ results show that deep learning using pretrained language models yielded the highest accuracy of approximately 0.83, while classical machine learning techniques yielded approximately 0.41 and 0.70 for multiclass classification one-vs-all classification respectively. This indicates that using pretrained language models assists in understanding languages which is essential for the classification of text.
Eman Abdelrahman, Edward A. Fox: Improving Accessibility to Arabic ETDs Using Automatic Classification. TPDL 2022: 230-242
- Date of publication:
- September 15, 2022
- International Conference on Theory and Practice of Digital Libraries
- Page number(s):