《广西师范大学学报》(哲学社会科学版) ›› 2019, Vol. 37 ›› Issue (3): 71-78.doi: 10.16088/j.issn.1001-6600.2019.03.008

• • 上一篇    

基于BTM和加权K-Means的微博话题发现

陈凤,蒙祖强*   

  1. 广西大学计算机与电子信息学院,广西南宁530004
  • 发布日期:2019-07-12
  • 通讯作者: 蒙祖强(1964—),男(壮族),广西罗城人,广西大学教授。E-mail:zqmeng@126.com
  • 基金资助:
    国家自然科学基金(61762009)

Topic Discovery in Microblog Based on BTM and Weighting K-Means

CHEN Feng,MENG Zuqiang*   

  1. School of Computer,Electronics and Information, Guangxi University, Nanning Guangxi 530004,China
  • Published:2019-07-12

摘要: 为适应微博数据的短文本、低词频、缺乏语义表达等特殊性,提高话题发现的准确性,利于用户从大量微博数据中获取有用信息,本文提出一种基于BTM和加权K-Means方法实现微博话题发现。首先,针对微博数据稀疏性的问题,采用BTM模型对微博中的短文本进行建模,获得话题词;然后针对传统K-Means算法本身的缺陷,提出加权K-Means算法实现微博话题发现;最后实验验证本文的方法,实验结果表明,BTM和加权K-Means方法解决了微博数据高维度和稀疏性的问题,提高了热点话题发现的准确性和有效性。

关键词: BTM模型, 加权K-Means, 微博数据, 话题发现

Abstract: In order to adapt to special features of microblogging data, such as short texts, low word frequency, and lack of semantic expression, improve accuracy of topic discovery, and help users obtain useful information, a method based on BTM and weighting K-Means is proposed to achieve topic discovery. Firstly, faced with the problem of data sparsity, the text model is built based on the BTM model to obtain the topic words. Secondly, aimed at defects of the traditional K-Means algorithm itself, the weighting K-Means algorithm is proposed to obtain microblogging topics. Finally, experiments are conducted to validate the method of this paper. The experimental results show that the BTM and weighting K-Means method can solve problems of high dimensionality and sparsity of microblogging data, and it improves the accuracy and effectiveness of topic discovery.

Key words: biterm topic model(BTM), weighting K-Means, microblogging data, topic discovery

中图分类号: 

  • TP391
[1] BLEI D M,NG A Y,JORDAN M I.Latent Dirichlet allocation[J].Journal of Machine Learning Research,2003,3: 993-1022.
[2] 谢昊,江红.一种面向微博主题挖掘的改进LDA模型[J].华东师范大学学报(自然科学版),2013(6):93-101.DOI: 10.3969/j.issn.1000-5641.2013.06.011.
[3] LIU Quanchao,HUANG Heyan,FENG Chong.Micro-blog post topic drift detection based on LDA model[C]// Behavior and Social Computing: LNCS Volume 8178,2013:106-118.DOI:10.1007/978-3-319-04048-6_10.
[4] GE Gaofei,CHEN Liping,DU Junping.The research on topic detection of microblog based on TC-LDA[C]//2013 15th IEEE International Conference on Communication Technology.Piscataway NJ:IEEE Press,2013:722-727.DOI:10.1109/ICCT.2013.6820469.
[5] YAN Xiaohui,GUO Jiafeng,LAN Yanyan,et al.A biterm topic model for short texts[C]//Proceedings of the 22nd International Conference on World Wide Web.New York,NY:ACM Press,2013:1445-1456.DOI:10.1145/ 2488388.2488514.
[6] CHENG Xueqi,YAN Xianhui,LAN Yanyan,et al.BTM:topic modeling over short texts[J].IEEE Transactions on Knowledge and Data Engineering,2014,26(12):2928-2941.DOI:10.1109/TKDE.2014.2313872.
[7] 张佳明,王波,唐浩浩,等.基于Biterm主题模型的无监督微博情感倾向性分析[J].计算机工程,2015,41(7): 219-223,229.DOI:10.3969/j.issn.1000-3428.2015.07.042.
[8] LI Weijiang,FENG Yanming,LI Dongjun,et al.Micro-blog topic detection method based on BTM topic model and K-means clustering algorithm[J]. Automatic Control and Computer Sciences,2016,50(4):271-277.DOI:10.3103/ S0146411616040040.
[9] 王亚民,胡悦.基于BTM的微博舆情热点发现[J].情报杂志,2016,35(11):119-124,140.DOI:10.3969/j.issn.1002-1965.2016.11.022.
[10]HE Xingwei,XU Hua,LI Jia,et al.FastBTM:reducing the sampling time for biterm topic model[J]. Knowledge-Based Systems,2017,132:11-20.DOI:10.1016/j.knosys.2017.06.005.
[11]ZHANG Peng,LI Bicheng,YANG Ruipeng.Research on the topic evolution of microblog based on BTM-LPA[C]// Proceedings of the International Conference on Computer Science and Technology.Singapore:World Scientific,2017:860-875.DOI:10.1142/9789813146426_0098.
[12]刘少鹏,印鉴,欧阳佳,等.基于MB-HDP模型的微博主题挖掘[J].计算机学报,2015,38(7):1408-1419.DOI: 10.11897/SP.J.1016.2015.01408.
[13]黄发良,冯时,王大玲,等.基于多特征融合的微博主题情感挖掘[J].计算机学报,2017,40(4):872-888. DOI:10.11897/SP.J.1016.2017.00872.
[14]GEMAN S,GEMAN D.Stochastic relaxation, gibbs distributions and the Bayesian restoration of images[J]. Journal of Applied Statistics,1993,20(5/6):25-62.DOI:10.1080/02664769300000058.
[15]FENG Jun,FANG Yu.Research on hot topic discovery technology of micro-blog based on biterm topic model[C]//Geo-Spatial Knowledge and Intelligence: 4th International Conference on Geo-Informatics in Resource Management and Sustainable Ecosystem.Berlin:Springer,2016:234-244.DOI:10.1007/978-981- 10-3969-0_27.
[16]谢修娟,李香菊,莫凌飞.基于改进K-means算法的微博舆情分析研究[J].计算机工程与科学,2018,40(1):155-158.DOI:10.3969/j.issn.1007-130X.2018.01.023.
[17]ZHANG Huaping,YU Hongkui,XIONG Deyi,et al.HHMM-based Chinese lexical analyzer ICTCLAS[C]// Proceedings of the second SIGHAN workshop on Chinese language processin:Volume 17.Stroudsburg,PA: Association for Computational Linguistics,2003:184-187.DOI:10.3115/1119250.1119280.
[18]刘泽锦,王洁.同主题词短文本分类算法中BTM的应用与改进[J].计算机系统应用,2017,26(11):213-219.DOI: 10.15888/j.cnki.csa.006071.
[19]李卫疆,王真真,余正涛.基于BTM和K-means的微博话题检测[J].计算机科学,2017,44(2):257-261,274.DOI: 10.11896/j.issn.1002-137X.2017.02.042.
[1] 许钢, 刘海燕, 张超英, 梁振燕. 基于元胞自动机的建构主义理论应用模拟[J]. 《广西师范大学学报》(哲学社会科学版), 2013, 31(4): 7-12.
[2] 马先兵, 孙水发, 覃音诗, 郭青, 夏平. 基于粒子滤波的on-line boosting目标跟踪算法[J]. 《广西师范大学学报》(哲学社会科学版), 2013, 31(3): 100-105.
[3] 孙水发, 李乐鹏, 董方敏, 邹耀斌, 陈鹏. 基于迭代阈值的子块部分重叠双直方图均衡算法[J]. 《广西师范大学学报》(哲学社会科学版), 2013, 31(3): 119-126.
[4] 马媛媛, 吕康, 徐久成. 基于粒计算多层次结构相似度的图像检索[J]. 《广西师范大学学报》(哲学社会科学版), 2013, 31(3): 127-131.
[5] 黄志敏, 王东利, 文颖, 吕岳. 基于改进网格特征的离线笔迹识别[J]. 《广西师范大学学报》(哲学社会科学版), 2013, 31(3): 132-137.
[6] 王峰, 靳小波, 于俊伟, 王贵财. V-最优直方图及其在车牌分类中的应用研究[J]. 《广西师范大学学报》(哲学社会科学版), 2013, 31(3): 138-143.
[7] 杨俊瑶, 蒙祖强. 基于时间依赖的物联网络模型的路径规划[J]. 《广西师范大学学报》(哲学社会科学版), 2013, 31(3): 152-156.
[8] 刘君, 卜朝晖, 池田尚志, 松本忠博. 基于语义组合的日语多义动词的机器汉译考察——以|-切れる-||-倒す-|为例[J]. 《广西师范大学学报》(哲学社会科学版), 2013, 31(3): 177-183.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!
版权所有 © 广西师范大学学报(哲学社会科学版)编辑部
地址:广西桂林市三里店育才路15号 邮编:541004
电话:0773-5857325 E-mail: xbgj@mailbox.gxnu.edu.cn
本系统由北京玛格泰克科技发展有限公司设计开发