修改href的正则
要求:
替换一个html文档中所有的不以’/’结束的href标记,在原内容后面加上 .html。如:
1 2 3 4 |
<a href="http://baidu.com/"> 保持不变 <a href="link"> 修改为 <a href="link.html"> <a class="n" href="link">修改为<a class="n" href="link.html"> <a style="margin:0" href="link" target="_this"> 修改为 <a style="margin:0" href="link.html" target="_this"> |
测试页面为:
http://www.deskcity.com/details/picture/2100.html
方法一:
获取文件
wget http://www.deskcity.com/details/picture/2100.html |
在irb中进行处理:
1 2 3 4 5 |
irb >> s = File.open('2100.html').read >> a = s.gsub(/<a.*?href=['"](.*?\w+)['"][^>]*?>/im){|x| x.gsub($1, "#{$1}.html")} >> File.open('output.html', 'w'){|f| f.write(a)} >> exit |
查看修改
diff 2100.html output.html |
方法二:
a = s.gsub(/(<a.*?href=['"])(.*?\w+)(['"][^>]*?>)/im){|x| "#{$1}#{$2}.html#{$3}"} |
比较两个方法的性能
创建文件 href.rb
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
s = File.open('2100.html').read start_at = Time.now.to_f 10000.times do a = s.gsub(/<a.*?href=['"](.*?\w+)['"][^>]*?>/im){|x| x.gsub($1, "#{$1}.html")} end end_at = Time.now.to_f puts "twice gsub used: #{end_at - start_at}" start_at = Time.now.to_f 10000.times do a = s.gsub(/(<a.*?href=['"])(.*?\w+)(['"][^>]*?>)/im){|x| "#{$1}#{$2}.html#{$3}"} end end_at = Time.now.to_f puts "three $ used: #{end_at - start_at}" |
获取测试结果:
1 2 3 |
$ ruby href.rb twice gsub used: 18.4503409862518 three $ used: 17.6931488513947 |
两者性能差不多
参考:
Wiki首页 | 查看所有 | 编辑 | 输出到博客 | 历史版本