Development of а System of Linguistic Markers for Automated Unloading of Thematic Text Data from а Social Network

Development of а System of Linguistic Markers for Automated Unloading of Thematic Text Data from а Social Network

Authors

Keywords:

Linguistic marker, big data, automated data collection, data upload, text collection, full-text search, social networks, VK, open API, vaccination, COVID-19

Abstract

Automated search and selection of texts on a specific topic in the target source to form a representative thematic text collection (text dataset) of large dimensions, being a special case of obtaining and structuring primary data, remains one of the most demanded applied tasks of natural language processing. The article presents the experience of developing a system of linguistic markers that allows automated extraction of texts related to the topic of vaccination against COVID-19 on the material of the VKontakte social network. A combination of linguistic methods with methods for collecting and processing text data allows forming the final dataset. The test list of markers forms is based on background knowledge, work with dictionaries and special linguistic services. The task was to create a list of words united by a common conceptual feature, to predict the joint occurrence of words in texts about vaccination against COVID-19, or to find specific words that mark this topic: occasionalisms, designations of specific realities. The content of the VKontakte thematic communities uploaded using the test list of markers became the source of automated and expert extraction of the main array of markers (354 units). The procedure for automated filtering of an intermediate text sample (12.8 million texts) is in detail. The technique of formation of stop-words is given. For the period from 01.01.2020 to 03.01.2023, 4.5 million relevant messages were retrieved; the validity of the markers was confirmed by an insignificant amount of noise on the scale of big data. The general principles of preparing linguistic markers for automated unloading of large text data are systematized; the strengths and weaknesses of this tool are noted; recommendations for the formation of a list of linguistic markers are suggested.

Author Biographies

Anna Yu. Sarkisova, Lomonosov Moscow State University

PhD, Associate Professor, Research Associate, School of Public Administration, Lomonosov Moscow State University, Moscow, Russian Federation.

sarkisova@data.tsu.ru

Evgeny Yu. Petrov, National Research Tomsk State University

Technician, Supercomputer Center, National Research Tomsk State University, Tomsk, Russian Federation.

petrov@data.tsu.ru

Daria O. Dunaeva, Lomonosov Moscow State University

Research Associate, School of Public Administration, Lomonosov Moscow State University, Moscow, Russian Federation.

ddo@data.tsu.ru

References

Горностаева Ю.А. Опыт выявления вербальных маркеров психологических и когнитивных процессов в лингвистике: к истории вопроса // Филологические науки. Вопросы теории и практики. 2018. № 8(86). Ч. 1. С. 91–94. DOI: 10.30853/filnauki.2018-8-1.21

Карпова А.Ю., Савельев А.О., Вильнин А.Д., Чайковский Д.В. Изучение процесса онлайн-радикализации молодежи в социальных медиа (междисциплинарный подход) // Мониторинг общественного мнения: экономические и социальные перемены. 2020. № 3. С.159–181. DOI: 10.14515/monitoring.2020.3.1585

Колмогорова А.В., Талдыкина Ю.А., Калинин А.А. Языковые маркеры манипуляции в поляризованном политическом дискурсе: опыт параметризации // Политическая лингвистика. 2016. № 4(58). С. 194–199.

Колмогорова А.В., Калинин А.А., Маликова А.В. Типология и комбинаторика вербальных маркеров различных эмоциональных тональностей в интернет-текстах на русском языке // Вестник Томского государственного университета. 2019. № 448. С. 48–58. DOI: 10.17223/15617793/448/6

Концевой M.P. Онлайновые семантические вычисления на платформе RusVectōrēs в преподавании компьютерной лингвистики// Дистанционное обучение — образовательная среда XXI века: материалы XII Международной научно-методической конференции, Минск, 26 мая 2022 г. Минск: БГУИР, 2022. C. 75.

Мишланов В.А., Каджая Л.А., Кузнецова Ю.М. Лингвистические маркеры эмоционального состояния субъекта речи (к проблеме автоматического мониторинга текстов сетевой коммуникации) // Медиалингвистика. 2020. Т. 7. № 4. С. 428–444. DOI: 10.21638/spbu22.2020.405

Петров Е.Ю., Саркисова А.Ю. Ресурс аналитической платформы PolyAnalyst в социогуманитарных научных исследованиях // Открытые данные — 2021: материалы форума / под ред. А.Ю. Саркисовой. Томск: Издательство Томского государственного университета, 2021. С. 94–104.

Сбоев А.Г., Гудовских Д.В., Молошников И.А., Кукин К.А., Рыбка Р.Б., Иванов И.И., Власов Д.С. Автоматическое выделение психолингвистических характеристик текстов в рамках концепции Big Data // Современные информационные технологии и IT-образование. 2013. № 9. С. 433–438.

Ahmad S., Asghar M.Z., Alotaibi F.M., Awan I. Detection and Classification of Social Media-Based Extremist Affiliations Using Sentiment Analysis Techniques // Human-centric Computing and Information Sciences. 2019. Vol. 9. DOI: 10.1186/s13673-019-0185-6

Cohen K., Johansson F., Kaati L., Clausen Mork J.C. Detecting Linguistic Markers for Radical Violence in Social Media // Terrorism and Political Violence. 2014. Vol. 26. Is. 1. P. 246–256. DOI: 10.1080/09546553.2014.849948

Deng W., Hsu J.-H., Löfgren K., Cho W. Who Is Leading China’s Family Planning Policy Discourse in Weibo? A Social Media Text Mining Analysis // Policy & Internet. 2021. Vol. 13. Is. 4. P. 485–501. DOI: 10.1002/poi3.264

Erseghe T., Badia L., Dzanko L., Suitner C. PLMP: A Method to Map the Linguistic Markers of the Social Discourse onto Its Semantic Network // 2022 IEEE / ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM). November 10–13, 2022, Istanbul, Turkey. Istanbul: Institute of Electrical and Electronics Engineers, 2022. P. 247–251. DOI: 10.1109/ASONAM55673.2022.10068643

Huang F., Ding H., Liu Z., Wu P., Zhu M., Li A., Zhu T. How Fear and Collectivism Influence Public’s Preventive Intention towards COVID-19 Infection: A Study Based on Big Data from the Social Media // BMC Public Health. 2020. Vol. 20. DOI: 10.1186/s12889-020-09674-6

Huh J-H. Big Data Analysis for Personalized Health Activities: Machine Learning Processing for Automatic Keyword Extraction Approach // Symmetry. 2018. Vol. 10. Is. 4. DOI: 10.3390/sym10040093

Kessel R. van, Kyriopoulos I., Wong B.L.H., Mossialos E. The Effect of the COVID-19 Pandemic on Digital Health–Seeking Behavior: Big Data Interrupted Time-Series Analysis of Google Trends // Journal of Medical Internet Research. 2023. Vol. 25. DOI: 10.2196/42401

Liu T., Giorgi S., Yadeta K., Schwarts H.A., Ungar L.H., Curtis B. Linguistic Predictors from Facebook Postings of Substance Use Disorder Treatment Retention versus Discontinuation // The American Journal of Drug and Alcohol Abuse Encompassing. 2022. Vol. 48. Is. 5. P. 573–585. DOI: 10.1080/00952990.2022.2091450

Shchekotin E.V., Goiko V.L., Myagkov M.G., Dunaeva D.O. Assessment of Quality of Life in Regions of Russia Based on Social Media Data // Journal of Eurasian Studies. 2021. Vol. 12. № 2. DOI: 10.1177/18793665211034185

Downloads

Published

2023-10-30

How to Cite

Development of а System of Linguistic Markers for Automated Unloading of Thematic Text Data from а Social Network. (2023). Public Administration. E-Journal (Russia), 97, 70-84. https://doi.org/10.24412/6ne3kx55

Issue

Section

Scientific articles

Categories

How to Cite

Development of а System of Linguistic Markers for Automated Unloading of Thematic Text Data from а Social Network. (2023). Public Administration. E-Journal (Russia), 97, 70-84. https://doi.org/10.24412/6ne3kx55

Most read articles by the same author(s)

<< < 30 31 32 33 34 35 36 > >> 

Similar Articles

1-10 of 215

You may also start an advanced similarity search for this article.

Loading...