Compris.com GmbH: TextSign and TextHide / SubiText
by Paul Grosse - August 2000
One of the biggest barriers to companies of publishing information on the Internet is that of plagiarism. It is unfortunate that many people can produce an essay, report, thesis or article simply by going to a web site and copying the text straight from the browser, well beyond any definition of fair use, and then claiming it as their own. To illustrate the problem, work carried out at a UK University, using a simple electronic comparison of pieces of written work handed in by students showed that many of them were largely identical, not only carrying the same wording but the same typographical and spelling errors. This work was carried out on a known population with an easily identifiable product - their piece of work - unfortunately, on the Internet, the problem is not so easy to locate.
Being able to identify the source of a piece of written work, whether it is for the purpose of the copyright owner proving rights infringements or an organisation identifying the source of a security leak, has led to a number of innovations over the last few decades. Unfortunately, many of them can be removed simply by processing the text in a completely honest way. For example, the repositioning of text with slight adjustments to particular words' horizontal and vertical alignment or letter spacing may not be particularly visible to the unsuspecting eye and, it may retain the information when transmitted over a fax but it is completely removed when the document is read into a computer using Optical Character Recognition (OCR). Similarly, placing extra spaces at the ends of lines will not carry through the hard copy transformation and, many grammar checkers question multiple spaces in documents if they do happen to stay in electronic form. Clearly a method of carrying information within the wording of the text itself is required.
Compris.com's technology hides signature information within texts in a way that will allow the message to survive as long as the text itself does, that is to say that the document may have its typeface and point-size changed, be printed out, faxed, OCRed back into electronic form and so on, and still retain any information that was stored within it. It goes without saying that as this technology is capable of storing information regarding the identity of the originator of the document, it must also be capable of storing other data and doing so without any indication that there is any data there.
Modern cryptographic methods are capable of being extremely resistant to attack, either cryptographically or by using brute force but unfortunately, no matter how difficult the job of breaking into a piece encrypted data is made, the encrypted data is always instantly recognisable as such. There are vulnerabilities that are based merely on the frequency transmission of encrypted messages - this being particularly so in intelligence gathering, whether it is for the police, military intelligence or industrial espionage with the frequency of encrypted communications usually increasing before an event. With clues to extra activity, preparations for pre-emptive action can be made, whether this is raiding promisees or announcing a rival product.
Steganography, the art of hiding messages in such a way that it is far from obvious that there is any sort of data hidden in there at all, makes it possible for information to be stored in a plain looking, carrier text without raising any suspicion.
TextSign uses Compris.com's steganographic technology to hide a digital signature within the text itself. The signature can be a short identification code or a more verbose signature detailing the author, publisher, contact details and so on. The advantage of a short signature is that it can be repeated frequently within the watermarked text so that even if just a paragraph or two is plagiarised, it still carries the watermark. As with other watermarking vendors, web spidering technology allows for the location of plagiarised texts and the identification of the original author. The probability of a text producing a watermarked result innocently is calculated and the offender notified of the infringement and possible remedies to the situation, whether that is court action or the payment of extra fees reflecting the extra use.
For such a signature to be able to survive the superficial perturbations that so easily eradicate other text watermarking methods, it has to be hidden using characteristics that are incorporated within normal, plain text. Fortunately, modern languages are complex enough in terms of word redundancy and word order to carry information and still make sense although additional blank spaces may be used in justified text if the document is to remain in electronic form (although extra spaces and carriage returns are removed by web browsers) and with immediate and expressive media such as e-mail, typographical errors may be used to convey information.
In short, the fact that words may be substituted for others with equivalent meaning or that word order may be changed and the text still maintains its original meaning allows TextSign to take normal text and incorporate within it a signature that can only be removed by rewriting the work completely. Both TextHide and TextSign will detect watermarks and refuse to remove them.
SubiText / TextHide
TextHide takes the process of hiding information one step further and instead of hiding a watermark repeatedly throughout a document, takes text or other data (including digital images) compresses, encrypts and hides it instead. In this way, a normal looking piece of text, the sort of text that would pass as harmless straight through a content checking program on a firewall, can carry any information.
The algorithm that controls the substitution, order changing, error insertion and so on of a carrier text allows a distribution of roughly one byte of hidden text in every 10 to 20 bytes of carrier text. If the text is compressed before it is incorporated into the carrier text then this ratio is even higher. A compression program such as WinZip will give around 60 percent compression (only 40 percent of the file size remains) on a plain text file containing normal written English.
However, the algorithms used in WinZip are optimised for general applications, that is to say that they will compress any type of file equally well, whether it is text, bitmap images, executable files, sound files and so on. If an algorithm is used that is targeted specifically at a particular data type or types then the compression rate can be increased for those types to around 75 percent (only 25 percent of the file size remaining) although the compression will suffer on other types of data. By compressing the secret text before it is incorporated into the carrier text and then compressing the processed carrier text as well, the final file size may only be around 3 times bigger than the original secret text. Whilst this may sound quite inferior to 12.5 percent file size growth for DES, it must be remembered that if the compressed file is unzipped, there is no indication that it contains any information other than what can be read immediately.
The synonym dictionary, including which words should be used more or less frequently, is organised with a key that can be up to 100,000 bytes long giving approximately 10 30,000 (a 1 with 30,000 zeros after it) possible arrangements for rephrasing the text. A brute force attack would use up all of the energy in the Universe many times over and with the prospect of there being no data in the carrier text, this is clearly not a good choice of attack.
In addition to using long keys for hiding the encrypted text, TextHide uses the 256 bit Twofish secret-key block cypher algorithm, designed by cryptography expert Bruce Schneier and a contender for the replacement of DES, for encrypting the body of the original data with the session key encrypted using RSA public key cryptography of up to 4096 bit key-length. Encryption block headers are not used in the same obvious way that they are in overtly encrypted texts so the output from TextHide, if the correct dictionary key is used, will not reveal the presence of an encrypted message to anyone other than the recipient with the correct key. Both keys are required.
The current version of TextHide is in German but the English and French versions are to be released in the last quarter of 2000. These new versions will have a number of extra features such as: allowing several encrypted messages, meant for different recipients, to be incorporated into the same carrier text; the amount of deviation in meaning of synonyms may be specified, all dictionaries (English, French and German) are to be included in the professional edition; grammar correction; and, rules for punctuation. For the Enterprise edition: cryptography and steganography may be integrated in any network environment or as a component in applications; self-generated word lists or synonym groups may be converted into a TextHide compatible format thus making it more difficult for people to break through the steganography; and, the steganographic method itself may be adapted and combined with data encryption and compression.
For many years, the administrators of networks thought that their network and their company's data was safe from attack from the outside because they had a firewall in place. The firewall can stop viruses and other malicious code from entering the company's network by filtering out data that it is programmed to look for. Then some people realised that their company's were vulnerable from the leak of intellectual property from the inside so content checkers started to look at information flowing out of the company's network - stopping encrypted data and data containing key words or phrases such as "secret" or "process". Steganography hides the encrypted data so that without the correct keys, a piece of secret message carrying text that is passed through TextHide produces a text string that is not discernibly different to the text string that is produced from a piece of text that had no message. In this way, the recipient of a message cannot have their keys demanded by anybody as there is no proof that there is any sort of message hidden within the text.
One problem of using this method to carry large quantities of data on a regular basis is that of finding sufficient quantities of ordinary and uninteresting looking carrier text that is not restricted in its own copyright. Many books are available on the Internet but these are usually famous books with well known texts. A chapter of a book, running from around 20 - 40 kBytes may seem ideal as a carrier text but the user of such a text may well have to find an explanation as to what was so wrong with the original version and why they have decided to start chapter one with, "Name me Ishmael". To counter this problem, Compris.com has supplied a large number of carrier texts of a suitably banal nature covering topics such as: vacation; politics; business; jokes; anecdotes; glosses; news and newspaper articles and so on - a collection that can only grow.
Another problem with the interpretation of text is that general English usage has become so bad that often, people don't know the meaning of words anymore. The ironic sentence, "Gerald has made a quantum contribution to the company's knowledge base" may well be considered a compliment by Gerald if he takes the word quantum to have its meaning in current advertising culture. "Appliance" and "Application" have suffered a similar fate along with many other words that have been misappropriated by advertisers, their repercussions making their way into everyday spoken and written language.
In addition to this, by far the largest group of people who use computers are not professional writers and do not speak the way they spell (how many non-writers pronounce the "d" in Wednesday) and therefore common errors exist such as confusions between: prestige and prestigious; stationery and stationary; course and coarse; affect and effect; and so on. Any software that looks at and attempts to interpret sentences that contain words such as these needs to be aware that if someone said prestigious, they may well have meant prestige and therefore "this company is the most prestigious that I have ever worked for" may not turn out the way that it was intended once the synonym dictionary has been applied.
This technology is significantly more powerful than using lemon juice to write invisible messages on paper, in fact, it could turn on its head the content aspect of data security. Without a doubt, a complete inability to detect whether or not any ordinary looking text contains secret information is of great concern to the people in charge of information security. This steganography product represents both a powerful watermarking tool that can be used to detect plagiarised works from anywhere on the planet and a tool that could either allow secret data to be transferred between different departments of a company or undermine the security of a company.
Although this company is relatively small and new, Thomas Pötter has received awards for his business plan including one from Volkswagen AG and the technology that Compris.com has is extremely powerful and just as easy to use as any other encryption software such as PGP. There are few players in this part of the market and Compris.com is particularly well set up to take full advantage of the situation.
With a background in Computer Science, Neuro-Linguistic Programming and text understanding technology, Thomas Pötter founded Compris.com in 1998 in order to develop and market the award winning SubiText steganography system which was the official highlight of the CeBIT '99 fair.
Situated in Kaiserslautern in Germany, just 30 miles from the French border, Compris.com currently employs four people but is looking to expand significantly with a view to becoming a joint stock company within the next three to five years. Thomas Pötter's business plan has won several prizes including one from Volkswagen AG.
Compris.com's product and service portfolio includes: TextSign - text watermarking by inserting a digital watermark into any text for copyright protection; TextHide / SubiText - text based steganography; Data Encryption - using public key cryptography or secret key cryptography; Densifier - a set of programs that are optimised for specific data formats so that compression is more effective; Placens - a search engine that identifies plagiarised text in documents on the Internet; Reputas - an online shop system for any digital contents, whether it is downloadable texts, parts of texts or software, so that copyright owners are able to specify limitations on rights that are sold on to customers; FactMind - an automatic query answering system that is able to find the appropriate response based upon collections of answer texts; FirmWatch - systematic analysis and monitoring of competitors' web sites so that new products of price variations can form the basis or alerts; InnoMagic - patent and innovation meta search engine for patent related information including relevant venture capital; Reusable language component modules for integration into applications such as encyclopaedias, word inflection, ambiguity resolution and so on; and, Web Spidering.
moment, the company is establishing links to strategic partners
in the US and the UK and important customers include the
publishing house Norman Rentrop Verlag and euroscript GmbH. With
a product technology as powerful as that used in SubiText /
TextHide and TextSign, a large part of the data security market
concerned with content control is effectively emasculated thus
placing Compris.com in a particularly powerful position. Looking
at it from the other side, this technology makes it very
difficult for people to plagiarise work because of digital
watermarking. Without any doubt, Compris.com is one to watch out
for in the years to come.
Tel: +49 6301 70 33 40
Fax: +49 6301 70 31 19
Copyright (c) 2000 P. A. Grosse. All Rights Reserved.
Back to the Internet Security Index
Back to the Index