What's A Mennonite Doing In Manhattan?!: What Language Does Malware Dream In?

PREFACE: I am not an expert on any of the following, I'm merely sharing my ideas and questions, almost stream of conscious style, cuz sometimes when you ask wacky questions, you glean actionable intel. The following is a learn-as-I-go exercise and not definitive data or even perhaps 100% correct, it's simply a work-in-progress that seemed to make sense to release in the wild, all names removed.

The other day a friend hit me up with a link to a video. That friend is someone I completely admire and he's one of the smartest guys I've ever met. I had a ton of meetings that day, so I couldn't immediately put on my Beats and check it out. A few hours later, he IM's me: "Check out that video yet?" "No, but I will." "Best presentation I've ever seen, I'm going to buy every single one of his books!" That was coming from someone who recommends at least one article or video a week, so for him to come back like that, I knew I had to stop what I was doing and make it a top priority. The video ended up being one of the best I’ve seen as well. Right up there with my two long-time favorites: "Lateral Movement" by Harlan Carvey and "Finding Unknown Malware" by Alissa Torres of "Malware can hide but it must run" fame - the audio on "Finding..." has some issues, but I've watched it so many times I've lost track, it's worth putting up with the less than perfect audio. Alissa actually did another presentation that's similar: "Detecting Persistence Mechanisms" but I digress. So, after watching the video that my friend recommended, I had a conversation with him. After our discussion, I began to think about some things... If organizations are only “watching” their netflow for the English language, could they miss something? In other words, if the Chinese, for example, have infiltrated your network, or are attempting to, they may be writing code or binaries that are in Mandarin and using UTF-16 encoded in 16-bits, which would be 2 bytes and currently not easily or (out of the box) detectable by most sensors.

So then I started to think about all the hundreds of malware samples I’ve looked at in the past year-and-a-half, and I can count on one hand the number of them that had a Chinese signature.

I've also seen artifacts of chats from unwanted guests already on networks, in English. So would it also make sense to hunt for very specific Chinese language characters or strings of characters?

Not having all of the answers, and again not being an authority on any of this, I “phoned a friend” and ended up sitting down with two of my favorite Mandarin character experts, which of course led to even more questions :( ...

(1) Speaking only about binaries (not isolated strings or chats), if the binaries are undetected wouldn't they eventually still need (in the end) to convert to Assembly to run, and if so, you'd see them then?

(2) Based on (1) above, should one perhaps just be filtering on binary headers and looking at just the signatures?

(3) Would another approach be to search the binary source code for Chinese language characters?

What I learned was that the language of the binary is “usually” defined by the resource section. You have the locale ID and/or language identifier which tells you the language. For example Locale 0x0409 is English, 0x0X04 is Chinese (as well 0x0004, 0x07C04, 0x0404, 0x0804). Or, for example, Lang ID 0x09 is English, 0x0A is Spanish etc. For YARA it would be something like pe.language(0x09) for English.
Other codes: https://msdn.microsoft.com/en-us/library/windows/desktop/dd318693(v=vs.85).aspx

One challenge could be if you have employees who are Chinese, or offices in China, unless your searches are very specific, they could result in multiple false positives. And of course my inquiry isn't really just about China, that's merely one example. From there you could expand your character searches to Arabic, French, Korean, Portuguese, Russian etc.

Yet another one of my trusted contacts with whom I often bounce things off of had previously advised me that using a language scheme as an IOC is not going to generate meaningful data, period. So next I sat down with one more person to discuss all of the above, and quite frankly for a sanity check. My takeaways from that meeting were (a) I wasn't crazy, and (b) there's one more possible angle and it's regional based. For example, malware written in VB may be seen as elementary, and frowned upon by a high caliber of threat actor such as Russian, and that generally the more difficult programming languages are more respected among those circles. That doesn't mean that malware written in VB isn't from Russia, for example, but maybe it could help narrow your initial search.

Lastly, a little bird told me that if you're going to find any of the above proactively, before the headlines hit, your answer may lie in hunting for behavioral anomalies, machine learning,,,and a whole heck of a lot of luck! Because, I was reminded, we have to be lucky all the time, they only have to be lucky once.