修改href的正则

要求:

替换一个html文档中所有的不以’/’结束的href标记,在原内容后面加上 .html。如:

1
2
3
4
<a href="http://baidu.com/"> 保持不变
<a href="link"> 修改为 <a href="link.html">
<a class="n" href="link">修改为<a class="n" href="link.html">
<a style="margin:0" href="link" target="_this"> 修改为 <a style="margin:0" href="link.html" target="_this">

测试页面为:

http://www.deskcity.com/details/picture/2100.html

方法一:

获取文件


wget http://www.deskcity.com/details/picture/2100.html

在irb中进行处理:

1
2
3
4
5
irb
>> s = File.open('2100.html').read
>> a = s.gsub(/<a.*?href=['"](.*?\w+)['"][^>]*?>/im){|x| x.gsub($1, "#{$1}.html")}
>> File.open('output.html', 'w'){|f| f.write(a)}
>> exit

查看修改


diff 2100.html output.html

方法二:


a = s.gsub(/(<a.*?href=['"])(.*?\w+)(['"][^>]*?>)/im){|x| "#{$1}#{$2}.html#{$3}"}

比较两个方法的性能

创建文件 href.rb

1
2
3
4
5
6
7
8
9
10
11
12
13
14
s = File.open('2100.html').read
start_at = Time.now.to_f
10000.times do
  a = s.gsub(/<a.*?href=['"](.*?\w+)['"][^>]*?>/im){|x| x.gsub($1, "#{$1}.html")}
end
end_at = Time.now.to_f
puts "twice gsub used: #{end_at - start_at}"

start_at = Time.now.to_f
10000.times do
  a = s.gsub(/(<a.*?href=['"])(.*?\w+)(['"][^>]*?>)/im){|x| "#{$1}#{$2}.html#{$3}"}
end
end_at = Time.now.to_f
puts "three $ used: #{end_at - start_at}"

获取测试结果:

1
2
3
$ ruby href.rb 
twice gsub used: 18.4503409862518
three $ used: 17.6931488513947

两者性能差不多

参考:


Wiki首页 | 查看所有 | 编辑 | 输出到博客 | 历史版本