[Tech.Notepad] Facts behind "Bush hid the facts"
- Open Notepad and type "Bush hid the facts" without quotation marks
- Save it and reopen it.
- Can you read it back?
Googling "Bush hid the facts" revealed a lot of interesting blog posts and forum threads promoting anti-bush conspiracy but I could not find any satisfactory answer.
I thought of doing my own analysis. Here is what I found.
First of all, Hex Viewer showed that content of the file was not modified … something happens after reading those bytes. Displayed characters were giving hint that there is some problem in character encoding. While trying to re save that file encoding was changed to Unicode. How did notepad determine that the file is Unicode? It must have used IsTextUnicode() win32 API. In a debugger I set a breakpoint at advapi32.dll ! IsTextUnicode(). Stack trace along with the disassembly showed that:
int __stdcall IsInputTextUnicode(LPVOID lpBuffer,int cb) function of Notepad calls
the advapi32.dll's IsTextUnicode(); This ultimately calls ntdll.dll's RtlIsTextUnicode().
This Undocumented Native API function holds the logic which “determines whether a buffer is likely to contain a form of Unicode text”.
010070B1 >/$ 8BFF MOV EDI,EDI
010070B3 |. 55 PUSH EBP
010070B4 |. 8BEC MOV EBP,ESP
010070B6 |. 51 PUSH ECX
010070B7 |. 834D FC FF OR DWORD PTR SS:[EBP-4],0FFFFFFFF
010070BB |. 8D45 FC LEA EAX,DWORD PTR SS:[EBP-4]
010070BE |. 50 PUSH EAX
010070BF |. FF75 0C PUSH DWORD PTR SS:[EBP+C]
010070C2 |. FF75 08 PUSH DWORD PTR SS:[EBP+8]
010070C5 |. FF15 0C100001 CALL DWORD PTR DS:[77DFD5FD] ; <&ADVAPI32.IsTextUnicode>
010070CB |. C9 LEAVE
010070CC \. C2 0800 RETN 8
We can see that IsInputTextUnicode is passing 0xFFFF in options flag which means test procedure includes all the flags including 'IS_TEXT_UNICODE_STATISTICS'. As there is no Unicode signature bytes 0xFF, 0xFE [ÿþ] (or 0xFE, 0xFF [þÿ] for big Endian) in file header, it tries to predict buffer type using byte statistics (This might include deviation of double byte values as each Unicode codepage is range constrained).This is a problem of classifying linearly separable set of inputs. Statistical analysis inherently demands large set of data (here large file) and fails to give satisfactory result in case of small set. In case of “Bush hid the facts” IsTextUnicode() returns 2 that is the value of 'IS_TEXT_UNICODE_STATISTICS'. According to MSDN this returned value means “The text is probably Unicode, with the determination made by applying statistical analysis. Absolute certainty is not guaranteed.”.
This is neither an anti-bush conspiracy nor a notepad specific bug. Any application using IsTextUnicode() API with IS_TEXT_UNICODE_STATISTICS flag set are vulnerable to this problem. MSDN clearly states this API is sort of guessing call when 'IS_TEXT_UNICODE_STATISTICS' flag is set in option parameter. A wrong guess of IsTextUnicode() tells Notepad to show non-unicode text as Unicode and the result is what you saw (If a Chinese language pack was installed in your system you probably saw a beautiful Chinese quote, alas .. u couldn’t read it).
The ultimate solution is improving the analysis algorithm. I made a quick patch(12KB) just for fun. You can make yours or directly patch using any hex editor with following file comparison info.
C:\windows\system32>fc /b notepad.exe notepad_patched.exe
Comparing files NOTEPAD.EXE and NOTEPAD_PATCHED.EXE
000064BA: FF FD
Bush facts are now public!
This patch simply tells notepad to call IsTextUnicode() including all tests except 'IS_TEXT_UNICODE_STATISTICS'.
Probably MS developers were tempted to leave it as it is to make notepad kinda auto-localized app. Text editors like PsPad n UltraEdit which don't use IsTextUnicode() ,don't allow Bush to hide facts but they don't handle signature bytes missing Unicode file properly.
Enjoy Reverse Engineering Statistical Analysis and find your won ‘bush facts’.
"Matrix can not lie"
"What are you doing"
“Osamabin laden leading all terrorist”
“Einstein's thought regarding mathematics motivated Dhilung Kirat thinkin mathematics wonderfully amazing languagez”
(Tips for making your own bush facts) ;-)
When the text is grouped in double character set in sequence, make sure that SPACE is always the first character. i.e first word always even characters length and all other odd characters length. Also keep the ACSII value of the characters as dispersed as possible in each word.
Got some interesting bush facts?