Trojan Source is a new kind of source code and supply chain attack that causes the source code reviewed by a human to be different from the actual software generated by the compiler – meaning the behaviour of the software will not match what the source code appears to say.
In 2000, a phishing attack was launched that targeted PayPal customers, sending them to a website that was registered at the domain paypaI (that final character is an upper case i, not a lower case L). This was the first widespread homoglyph attack – using characters that look similar to humans in order to trick them. The introduction of Unicode characters into domain names with the arrival of Internationalised Domain Names in 2010 further expanded the opportunity for confusion. Domain names could contain letters that look identical to humans but were really different to the computer (such as the lower case a in Cyrillic and Latin character sets).
Trojan Source is a further development of this style of attack – it makes the source code read on the screen by a human significantly different from the binary code created by a compiler through the use of Unicode control characters.
Unicode includes a number of control characters that instruct the display renderer how to display the character that follow. This can include the switch from a Right-To-Left reading to a Left-To-Right reading – effectively reversing the flow of the text. It is somewhat analogous to the hidden control characters that turn on and off bold or italic text within a word processor document.
By careful combining these Unicode control character it is possible to reflow and reformat the source code as seen in a text editor in order to change the apparent flow and logic of the program, while the compiler ignores the control characters. As a result, a human source code review would not see the true flow of the software.
This is a particular concern as a vector for supply chain attack in open source projects as it could be possible for a malicious update to the source code to hide its true behaviour by using the Trojan Source techniques.
These techniques were discovered by a pair of researchers at the University of Cambridge and they detail their findings in a paper which includes several examples for different programming languages.