一种基于Bloom Filter的正则表达式集合快速搜索算法

华南理工大学学报（自然科学版） ›› 2009, Vol. 37 ›› Issue (4): 37-41.

一种基于Bloom Filter的正则表达式集合快速搜索算法

徐克付齐德昱郑伟平钱正平

华南理工大学计算机系统结构研究所, 广东广州 510640

收稿日期:2008-02-26 修回日期:2008-05-21 出版日期:2009-04-25 发布日期:2009-04-25
通信作者: 徐克付（1977-），男，博士生，主要从事网络信息安全、计算机系统体系结构研究． E-mail:xkfool@163.com
作者简介:徐克付（1977-），男，博士生，主要从事网络信息安全、计算机系统体系结构研究．
基金资助:
中国博士后自然科学基金资助项目（2005037582）;粤港关键领域重点突破项目（2005A10307007）

A Fast Regular Expression Set Matching Algorithm Based on Bloom Filter

Xu Ke-fu Qi De-yu Zheng Wei-ping Qian Zheng-ping

Research Institute of Computer System, South China University of Technology, Guangzhou 510640, Guangdong, China

Received:2008-02-26 Revised:2008-05-21 Online:2009-04-25 Published:2009-04-25
Contact: 徐克付（1977-），男，博士生，主要从事网络信息安全、计算机系统体系结构研究． E-mail:xkfool@163.com
About author:徐克付（1977-），男，博士生，主要从事网络信息安全、计算机系统体系结构研究．
Supported by:
中国博士后自然科学基金资助项目（2005037582）;粤港关键领域重点突破项目（2005A10307007）

摘要/Abstract

摘要： 正则表达式搜索算法的性能与从非确定性有限状态自动机（NFA）的初始状态到终止状态的最短路径L_min成正比，与正则表达式所表达的语言的前缀集合Pref（RE）成反比，而一般情况下Pref（RE）较大，确定Pref（RE）中的元素在目标文本中的出现位置比较困难．文中提出了一种基于Bloom Filter的正则表达式集合搜索算法，此算法利用Bloom Filter集合查询时间与集合大小无关的特点，可以快速准备定位Pref（RE）的出现位置，使得搜索速度不受Pref（RE）的影响，如果采用多个Bloom Filter并行，还可以间接增大L_min．分析与测试结果表明，该算法较大地加快了正则表达式的搜索速度，对于正则表达式集合，算法性能改善尤其明显，在L_min较长、Pref（RE）较大时，搜索速度可以提高数倍至数十倍，适合大规模的多正则表达式的快速搜索．

关键词: 正则表达式匹配, Bloom Filter, 自动机, 模式匹配

Abstract:

The effectiveness of the regular expression searching algorithms are proportional to the shortest path L_min from the initial state to the final state of NFA and is inversely proportional to the prefix set Pref（RE） of the language that denotes the regular expression. In general, the elements in Pref（RE） are difficult to locate in the target text because the set of Pref（RE） is large. Proposed in this paper is a regular expression searching algorithm based on the Bloom Filter of which computation time to perform the query is independent of the string number. The proposed algorithm can fast locate Pref（RE） and perform a search with the speed immune from Pref（RE） , and, particularly, when multiple parallel Bloom Filters are employed, the algorithm may indirectly lengthen the shortest path. Analysis and experimental results indicate that the proposed algorithm greatly accelerates the search of regular expressions, especially for the search of an regular expression set, and that the searching speed increases several times and even up to tens of times when L_minand Pref（RE） values are both large. It is thus concluded that the proposed algorithm is suitable for the fast search of multiple regular expressions on a large scale.

Key words:regular expression matching; Bloom Filter; automaton ; pattern matching

Key words: Regular Expression, Bloom filter, Automaton, Fast Matching

徐克付齐德昱郑伟平钱正平. 一种基于Bloom Filter的正则表达式集合快速搜索算法[J]. 华南理工大学学报（自然科学版）, 2009, 37(4): 37-41.

Xu Ke-fu Qi De-yu Zheng Wei-ping Qian Zheng-ping. A Fast Regular Expression Set Matching Algorithm Based on Bloom Filter[J]. Journal of South China University of Technology (Natural Science Edition), 2009, 37(4): 37-41.

[1]	林培群陈丽甜雷永巍. 基于K近邻模式匹配的地铁客流量短时预测[J]. 华南理工大学学报（自然科学版）, 2018, 46(1): 50-57.
[2]	李拥军敖道敢. 一种快速模式近似匹配算法[J]. 华南理工大学学报(自然科学版), 2012, 40(6): 103-108.
[3]	朱明梁栋唐俊范益政颜普. 基于线图Q-谱的点模式匹配算法[J]. 华南理工大学学报（自然科学版）, 2011, 39(7): 102-108.
[4]	徐克付齐德昱钱正平向军郑伟平. 一种网络分组内容线速动态检测方法[J]. 华南理工大学学报（自然科学版）, 2008, 36(9): 15-19.
[5]	罗飞林小兰许玉格李慧娟. 基于免疫粒子群混合优化算法的新型派梯策略[J]. 华南理工大学学报（自然科学版）, 2008, 36(8): 1-5.
[6]	朱维军刘保罗周清雷 . 时间自动机与信号自动机的互模拟算法[J]. 华南理工大学学报（自然科学版）, 2008, 36(5): 38-42.
[7]	梁俊斌徐建闽. 基于感应线圆的骑线车辆检测方法[J]. 华南理工大学学报（自然科学版）, 2007, 35(7): 20-24.
[8]	罗玉涛周斯加赵克刚. 混合动力汽车上AMT 的换挡过程分析[J]. 华南理工大学学报（自然科学版）, 2007, 35(2): 33-36,42.
[9]	刘有延侯志林姚源卫旷卫民傅秀军. 本征模式匹配理论在经典波传播中的应用[J]. 华南理工大学学报（自然科学版）, 2007, 35(10): 214-220.
[10]	张开升陈玮孙延明郑时雄. 分布式制造信息系统单元的Agent模型[J]. 华南理工大学学报（自然科学版）, 2006, 34(2): 73-76,117.
[11]	张开升陈玮孙延明郑时雄. 一种分布式制造信息系统控制性能的研究方法[J]. 华南理工大学学报（自然科学版）, 2006, 34(11): 49-54.
[12]	卢暾张望李志蜀殷锋金虎. 基于I/O自动机的网格服务组合的形式化[J]. 华南理工大学学报（自然科学版）, 2005, 33(11): 55-60.
[13]	靳文舟张杰梅冬芳. 基于细胞自动机模型的交通流模拟程序[J]. 华南理工大学学报(自然科学版), 2003, 31(5): 34-38.