Effectiveness of commercial text embedding models for multilingual, multi-class SaaS software classification: A practical study

|

Accepted: 2025-07-30

|

Published: 2025-11-20

DOI: https://doi.org/10.4995/jclr.2025.24200
Funding Data

Downloads

Keywords:

Natural Language Processing, Text Classification, Text Embedding, Large Language Model, CamemBERT, Software-as-a-Service

Supporting agencies:

This research was not funded

Abstract:

In the rapidly evolving field of Software as a Service (SaaS), the accurate categorization of multilingual SaaS applications represents a significant challenge due to the inherent linguistic diversity and continuous growth in available software categories. This study investigates the application of commercial text embedding models, which transform textual data into numerical representations, for multilingual, large-scale, multi-class software classification tasks. We systematically compare various text embedding models integrated with classification algorithms, examining their predictive performance and transfer learning capabilities across multiple languages. Our experiments demonstrate that these embedding models exhibit substantial robustness and efficacy in both monolingual and cross-lingual classification contexts. Notably, a multi-layer perceptron classifier trained on bilingual datasets (French and English) using OpenAI’s text-embedding-3-large embedding model achieved high accuracy (0.90) and F1-score (0.78), even when evaluated on languages not represented in the training corpus. This research not only offers valuable insights for professionals and practitioners in the SaaS sector but also lays the groundwork for further research in advanced applications, crucial for handling the extensive textual data in the contemporary digital marketplace.

Show more Show less

References:

Aravamuthan, Sarang, Prasad Jogalekar, and Jonghae Lee. 2022. “Extracting Features from Textual Data in Class Imbalance Problems.” Journal of Computer-Assisted Linguistic Research 6 (November): 42–58. https://doi.org/10.4995/jclr.2022.18200.

Aytekin, Mustafa Ulvi, and O. Ayhan Erdem. 2023. “Generative Pre-Trained Transformer (GPT) Models for Irony Detection and Classification.” 2023 4th International Informatics and Software Engineering Conference (IISEC), December, 1–8. https://doi.org/10.1109/IISEC59749.2023.10391005

Balikas, Georgios. 2023. “Comparative Analysis of Open Source and Commercial Embedding Models for Question Answering.” Proceedings of the 32nd ACM International Conference on Information and Knowledge Management (New York, NY, USA), CIKM ’23, October 21, 5232–33. https://doi.org/10.1145/3583780.3615994

Bender, Emily M., Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?” Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (New York, NY, USA), FAccT ’21, March 1, 610–23. https://doi.org/10.1145/3442188.3445922

Brown, Tom B., Benjamin Mann, Nick Ryder, et al. 2020. “Language Models Are Few-Shot Learners.” Proceedings of the 34th International Conference on Neural Information Processing Systems (Red Hook, NY, USA), NIPS ’20, December 6, 1877–901.

Conneau, Alexis, Kartikay Khandelwal, Naman Goyal, et al. 2020. “Unsupervised Cross-Lingual Representation Learning at Scale.” Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, July, 8440–51. https://doi.org/10.18653/v1/2020.acl-main.747

Dai, Haixing, Zhengliang Liu, Wenxiong Liao, et al. 2025. “AugGPT: Leveraging ChatGPT for Text Data Augmentation.” IEEE Transactions on Big Data 11 (3): 907–18. https://doi.org/10.1109/TBDATA.2025.3536934

Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), June, 4171–86. https://doi.org/10.18653/v1/N19-1423

Du, Yu, Erwann Lavarec, and Colin Lalouette. 2023. “Text Data Augmentation to Manage Imbalanced Classification: Apply to BERT-Based Large Multiclass Classification for Product Sheets.” International Journal of Computational Linguistics (IJCL) 14: 1–18.

Gao, Tianyu, Adam Fisch, and Danqi Chen. 2021. “Making Pre-Trained Language Models Better Few-Shot Learners.” In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), edited by Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli. Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.acl-long.295

Greene, Ryan, Ted Sanders, Lilian Weng, and Arvind Neelakantan. 2022. “New and Improved Embedding Model.” December 15. https://openai.com/index/new-and-improved-embedding-model/

Incitti, Francesca, Federico Urli, and Lauro Snidaro. 2023. “Beyond Word Embeddings: A Survey.” Information Fusion 89: 418–36. https://doi.org/10.1016/j.inffus.2022.08.024

Le, Quoc, and Tomas Mikolov. 2014. “Distributed Representations of Sentences and Documents.” Proceedings of the 31st International Conference on Machine Learning, June 18, 1188–96. https://doi.org/10.48550/arXiv.1405.4053

Licht, Hauke. 2023. “Cross-Lingual Classification of Political Texts Using Multilingual Sentence Embeddings.” Political Analysis 31 (3): 366–79. https://doi.org/10.1017/pan.2022.29

Liu, Menglin, and Ge Shi. 2024. “PoliPrompt: A High-Performance Cost-Effective LLM-Based Text Classification Framework for Political Science.” SSRN Scholarly Paper 4940136. Social Science Research Network, August 19. https://papers.ssrn.com/abstract=4940136

Liu, Yinhan, Myle Ott, Naman Goyal, et al. 2019. “Roberta: A Robustly Optimized Bert Pretraining Approach.” arXiv Preprint arXiv:1907.11692, ahead of print. https://doi.org/10.48550/arXiv.1907.11692

MarketsandMarkets. 2025. “Customer Information System Market Size, Share Report | Industry Analysis and Growth 2030.” MarketsandMarkets. https://www.marketsandmarkets.com/Market-Reports/customer-information-system-market-183489801.html

Martin, Louis, Benjamin Muller, Pedro Javier Ortiz Suárez, et al. 2020. “CamemBERT: A Tasty French Language Model.” Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, July, 7203–19. https://doi.org/10.18653/v1/2020.acl-main.645

Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. “Efficient Estimation of Word Representations in Vector Space.” arXiv:1301.3781. Preprint, arXiv, September 7. https://doi.org/10.48550/arXiv.1301.3781

OpenAI. 2024. “New Embedding Models and API Updates.” March 13. https://openai.com/index/new-embedding-models-and-api-updates/

Patil, Rajvardhan, Sorio Boit, Venkat Gudivada, and Jagadeesh Nandigam. 2023. “A Survey of Text Representation and Embedding Techniques in NLP.” IEEE Access 11: 36120–46. https://doi.org/10.1109/ACCESS.2023.3266377

Pattun, Geeta, and Pradeep Kumar. 2023. “Emotion Classification Using Generative Pre-Trained Embedding and Machine Learning.” 2023 IEEE International Conference on Machine Learning and Applied Network Technologies (ICMLANT), December, 1–6. https://doi.org/10.1109/ICMLANT59547.2023.10372980

Pennington, Jeffrey, Richard Socher, and Christopher Manning. 2014. “GloVe: Global Vectors for Word Representation.” In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), edited by Alessandro Moschitti, Bo Pang, and Walter Daelemans. Association for Computational Linguistics. https://doi.org/10.3115/v1/D14-1162

Peters, Matthew E., Mark Neumann, Mohit Iyyer, et al. 2018. “Deep Contextualized Word Representations.” In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), edited by Marilyn Walker, Heng Ji, and Amanda Stent. Association for Computational Linguistics. https://doi.org/10.18653/v1/N18-1202

Pires, Telmo, Eva Schlinger, and Dan Garrette. 2019. “How Multilingual Is Multilingual BERT?” arXiv:1906.01502. Preprint, arXiv, June 4. https://doi.org/10.48550/arXiv.1906.01502

R, Srinivasa Raghavan, Jayasimha K.R, and Rajendra V. Nargundkar. 2020. “Impact of Software as a Service (SaaS) on Software Acquisition Process.” Journal of Business & Industrial Marketing 35 (4): 757–70. world. https://doi.org/10.1108/JBIM-12-2018-0382

Radford, Alec, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. “Improving Language Understanding by Generative Pre-Training.” Preprint. https://www.mikecaptain.com/resources/pdf/GPT-1.pdf

Ravi, Jayasree, and Sushil Kulkarni. 2023. “Text Embedding Techniques for Efficient Clustering of Twitter Data.” Evolutionary Intelligence 16 (5): 1667–77. https://doi.org/10.1007/s12065-023-00825-3

Spärck Jones, Karen. 1972. “A Statistical Interpretation of Term Specificity and Its Application in Retrieval.” Journal of Documentation 28 (1): 11–21. https://doi.org/10.1108/eb026526

Vaswani, Ashish, Noam Shazeer, Niki Parmar, et al. 2017. “Attention Is All You Need.” Proceedings of the 31st International Conference on Neural Information Processing Systems (Red Hook, NY, USA), NIPS’17, December 4, 6000–6010. https://doi.org/10.48550/arXiv.1706.03762

Wang, Liang, Nan Yang, Xiaolong Huang, et al. 2024. “Text Embeddings by Weakly-Supervised Contrastive Pre-Training.” arXiv:2212.03533. Preprint, arXiv, February 22. https://doi.org/10.48550/arXiv.2212.03533

Xian, Yongqin, Zeynep Akata, Gaurav Sharma, Quynh Nguyen, Matthias Hein, and Bernt Schiele. 2016. “Latent Embeddings for Zero-Shot Classification.” arXiv:1603.08895. Preprint, arXiv, April 10. https://doi.org/10.48550/arXiv.1603.08895

Xie, Huang, and Tuomas Virtanen. 2019. “Zero-Shot Audio Classification Based on Class Label Embeddings.” arXiv:1905.01926. Preprint, arXiv, August 7. https://doi.org/10.48550/arXiv.1905.01926

Zhang, Yin, Rong Jin, and Zhi-Hua Zhou. 2010. “Understanding Bag-of-Words Model: A Statistical Framework.” International Journal of Machine Learning and Cybernetics 1 (1): 43–52. https://doi.org/10.1007/s13042-010-0001-0

Show more Show less