Abstract:
Amount of news is rapidly growing up in recent years. People cannot handle them effectively. This is the main reason why automatic methods of news stream analysis have become an important part of modern science. The paper is devoted to the part of the news stream analysis which is called “event detection”. “Event” is a group of news dedicated to one real-world event. We study news from Russian news agencies. We consider this task as clusterization on news and compare algorithms by external clusterization metrics. The paper introduces a novel approach to detect events at news in Russian language. We propose a two-staged clustering method. It comprises “rough” clustering algorithm at the first stage and clarifying classifier at the second stage. At the first stage, a combination of shingles method and naive named entity based clusterization is used. Also we present a labeled dataset of news event detection based on «Yandex News» service. This manually labeled dataset can be used to estimate event detection methods performance. Empirical evaluation on these corpora proved the effectiveness of the proposed method for event detection at news texts.
This work was supported by a grant from the Russian Foundation For Basic Research No18-07-01059
Document Type:
Article
Language: Russian
Citation:
K. A. Skorniakov, A. S. Laskina, D. Yu. Turdakov, “Two step method for grouping news with similar topics”, Proceedings of ISP RAS, 32:4 (2020), 165–174
\Bibitem{SkoLasTur20}
\by K.~A.~Skorniakov, A.~S.~Laskina, D.~Yu.~Turdakov
\paper Two step method for grouping news with similar topics
\jour Proceedings of ISP RAS
\yr 2020
\vol 32
\issue 4
\pages 165--174
\mathnet{http://mi.mathnet.ru/tisp532}
\crossref{https://doi.org/10.15514/ISPRAS-2020-32(4)-12}
Linking options:
https://www.mathnet.ru/eng/tisp532
https://www.mathnet.ru/eng/tisp/v32/i4/p165
This publication is cited in the following 1 articles:
D. Yu. Turdakov, S. V. Garbuk, P. V. Khenkin, I. S. Kozlov, A. V. Laguta, M. I. Varlamov, “A Model and Method for Detecting Information Campaigns”, Program Comput Soft, 47:4 (2021), 261