23 May 2017

这个bug不是我发现的,而是同事乙发现的。

当时同事乙在某公司实习。他测试发现x264开启ARM 32位汇编优化后压缩效率明显变差:同样质量下,码率高了20%! 于是,他向x264开发者反映了这个问题。 x264的开发者很快意识到这是某个bug导致的,并找到问题做了修复

这是去年发生的。如今我知道这个事情后,又去详细探究了一番。

这个bug出现在文件common/arm/mc-a.S中的function x264_mbtree_propagate_list_internal_neon函数里。引入这个bug的提交是6f04b146

Shengbins-Mac-mini:x264 shengbin$ git show 6f04b146
commit 6f04b146875c45e6f7845a7bb5fb7fdf8e7534f1
Author: Martin Storsjö <martin@martin.st>
Date:   Thu Sep 3 09:30:44 2015 +0300

	arm: Implement x264_mbtree_propagate_{cost, list}_neon

	The cost function could be simplified to avoid having to clobber
	q4/q5, but this requires reordering instructions which increase
	the total runtime.

	checkasm timing       Cortex-A7      A8      A9
	mbtree_propagate_cost_c      63702   155835  62829
	mbtree_propagate_cost_neon   17199   10454   11106

	mbtree_propagate_list_c      104203  108949  84532
	mbtree_propagate_list_neon   82035   78348   60410
	
diff --git a/common/arm/mc-a.S b/common/arm/mc-a.S
index 5e0c117d..30d1c1ad 100644
--- a/common/arm/mc-a.S
+++ b/common/arm/mc-a.S

...(省略部分内容)

+function x264_mbtree_propagate_list_internal_neon
+    vld2.16         {d4[], d5[]}, [sp]      @ bipred_weight, mb_y
+    movrel          r12, pw_0to15
+    vmov.u16        q10, #0xc000
+    vld1.16         {q0},  [r12, :128]      @h->mb.i_mb_x,h->mb.i_mb_y
+    vmov.u32        q11, #4
+    vmov.u8         q3,  #32
+    vdup.u16        q8,  d5[0]              @ mb_y
+    vzip.u16        q0,  q8
+    ldr             r12, [sp, #8]
+8:

...(省略部分内容)

而修复这个bug的提交则是2ebdb90b

Shengbins-Mac-mini:x264 shengbin$ git show 2ebdb90b
commit 2ebdb90bd32c3d1618b1c5b360bff750b82b1d0b
Author: Martin Storsjö <martin@martin.st>
Date:   Tue Dec 27 00:22:48 2016 +0200

	arm: Load mb_y properly in mbtree_propagate_list_internal_neon

	The previous version, attempting to load two stack parameters at once,
	only would have worked if they were interpreted and loaded as 32 bit
	elements, not when loading them as 16 bit elements.
	
diff --git a/common/arm/mc-a.S b/common/arm/mc-a.S
index 165c1fa9..8c151915 100644
--- a/common/arm/mc-a.S
+++ b/common/arm/mc-a.S
@@ -1818,13 +1818,14 @@ function x264_mbtree_propagate_cost_neon
 endfunc
 
 function x264_mbtree_propagate_list_internal_neon
-    vld2.16         {d4[], d5[]}, [sp]      @ bipred_weight, mb_y
+    vld1.16         {d4[]}, [sp]            @ bipred_weight
     movrel          r12, pw_0to15
     vmov.u16        q10, #0xc000
     vld1.16         {q0},  [r12, :128]      @h->mb.i_mb_x,h->mb.i_mb_y
+    ldrh            r12,  [sp, #4]
     vmov.u32        q11, #4
     vmov.u8         q3,  #32
-    vdup.u16        q8,  d5[0]              @ mb_y
+    vdup.u16        q8,  r12                @ mb_y
     vzip.u16        q0,  q8
     ldr             r12, [sp, #8]
 8:

从2015年9月到2016年12月,这个bug存在了一年多没有被发现。凡是在这期间把x264编译到手机上运行的用户,都会受影响。而这个影响是很严重的,毕竟20%的码率开销呢。

是不是应该赶紧去看一下你用的是哪个版本的x264?

这件事中,让我略感惊讶的是:像x264这样历史悠久应用广泛的软件,也有这么不靠谱的时候!