A Comprehensive Bibliography of Linguistic Steganography

by Richard Bergmair

This bibliography resides on http://www.semantilog.org/biblingsteg/ and is also available in PDF and BibTeX formats.


Under our interpretation of the term, linguistic steganography excludes techniques that directly operate on typeset or written text as a graphic, or speech as a waveform. It also excludes techniques that rely on specific file formats such as ASCII or HTML.

This bibliography is divided into three sections. The primary bibliography gives pointers into the literature directly dealing with linguistic steganography, the related bibliography contains literature about more remotely connected subjects, and the implemented systems section lists software prototypes.

The primary bibliography is grouped by research collaborations. The order in which they are listed has no particular significance. Within each group, references are sorted in chronological order.

In order to bootstrap a process of community collaboration on keeping this bibliography relevant and up to date, the following material was collected in an extensive and systematic literature search carried out in March 2005 and reiterated in November 2006. The sources that were searched included the English and German web as perceived by Yahoo, MSN Search, Google, Google Scholar, CiteSeer, and the meta search engine Copernic Agent Pro, as well as the computing literature as indexed in the ACM Guide, the ACM Digital Library and IEEExplore.

This bibliography will be updated and maintained on http://www.semantilog.org/biblingsteg/. The maintainer invites the participation of the whole community in keeping this resource relevant and up to date and will gratefully incorporate any suggestions brought to his attention by contacting the maintainer via e-mail at the maintainer whose email address in reverse is gro.macu.golitnames at getsgnilbib .

  • If you spot any omissions in this bibliography, please bring it to the maintainer's attention, preferably by emailing a BibTeX file including abstracts.
  • If you are an author or editor publishing new material on linguistic steganography, please bring it to the maintainer's attention, preferably by emailing a BibTeX file including abstracts.
  • If you are maintaining a personal webpage of a researcher in linguistic steganography or any other webpage related to work in linguistic steganography, please send a link to your webpage to the maintainer and add a link to http://www.semantilog.org/biblingsteg/ to your own webpage. This will help search engines produce more relevant results for queries related to linguistic steganography.

PRIMARY BIBLIOGRAPHY

Peter Wayner

Peter Wayner. Mimic functions. Cryptologia, XVI(3):193-214, July 1992. [ bib ]

A mimic function changes a file A so it assumes the statistical properties of another file B. That is, if p(t,A) is the probability of some substring t occuring in A, then a mimic function f, recodes A so that p(t,f(A)) approximates p(t,B) for all strings t of length less than some n. This paper describes the algorithm with its functional inverse, Huffman coding. The paper also provides a description of more robust and more general mimic functions which can be defined using context-free grammars and van Wijngaarden grammars.

Keywords: compression; subliminal channels; context-free grammar

Peter Wayner. Strong theoretical steganography. Cryptologia, XIX(3):285-299, July 1995. [ bib ]

Hiding the existence of a message can be an important technique in this era of terabit networks. One technique for practicing this obfuscation, Mimic Functions, is derived from Context-Free Grammars and can be as secure as inverting RSA or factoring Blum integers. This paper discusses the implications of the result and presents a practical solution for securely hiding information from inspection.

Keywords: Mimic function; natural language processing; RSA

Purdue CERIAS

Mikhail J. Atallah, Victor Raskin, Michael Crogan, Christian Hempelmann, Florian Kerschbaum, Dina Mohamed, and Sanket Naik. Natural language watermarking: Design, analysis, and a proof-of-concept implementation. In Ira S. Moskowitz, editor, Information Hiding: Fourth International Workshop, volume 2137 of Lecture Notes in Computer Science, pages 185-199. Springer, April 2001. [ bib | .pdf ]

We describe a scheme for watermarking natural language text. Let n denote the total number of sentences of a text, α denote the number of sentences that carry watermark bits. The modifications that an adversary can perform (for the purpose of removing the watermark) are as follows: (i) Meaning-preserving transformations of sentences of the text (e.g. translation to another natural language). This cannot damage the watermark. (ii) Meaning-modifying transformations of sentences of the text. Each such transformation has probability <=3 α/ n of damaing the watermark. (iii) Insertions of new sentences in the text. Each such insertion has probability <=2 α/ n of damaging the watermark. (iv) Moving a contiguous block of sentences from one place of the text to another. Each block-motion has probability <=3 α/ n of damaging the watermark.

Our scheme is keyed, and having the key is all that is required for reading the watermark; it does not require knowledge of the original (pre-watermark) version of the text, or knowledge of the watermark message. The probability of a “false positive”, i.e. that the text spuriously contains any particular w-bit watermark, is 2-w.

Mikhail J. Atallah, Victor Raskin, Christian F. Hempelmann, Mercan Karahan, Radu Sion, Umut Topkara, and Katrina E. Triezenberg. Natural language watermarking and tamperproofing. In Fabien A. P. Petitcolas, editor, Information Hiding: Fifth International Workshop, volume 2578 of Lecture Notes in Computer Science, pages 196-212. Springer, October 2002. [ bib | .pdf ]

Two main results in the area of information hiding in natural language text are presented. A semantically-based scheme dramatically improves the information hiding capacity of any text through two techniques: (i) modifying the granularity of meaning of individual sentences, whereas our own previous scheme kept the granularity fixed, and (ii) halving the number of sentences affected by the watermark. No longer a “long text, short watermark” approach, it now makes it possible to watermark short texts like wire agency reports. Using both the above-mentioned semantic marking scheme and our previous syntactically-based method hides information in a way that reveals any non-trivial tampering with the text (while re-formatting is not considered to be tampering - the problem would be solved trivially otherwise by hiding a hash of the text) with a probability l - 2 β(n + 1) , n being its number of sentences and β a small positive integer based on the extend of co-referencing.

Krista Bennett. Linguistic steganography: Survey, analysis, and robustness concerns for hiding information in text. Technical Report TR 2004-13, Purdue CERIAS, May 2004. [ bib | .pdf ]

Steganography is an ancient art. With the advent of computers, we have vast accessible bodies of data in which to hide information, and increasingly sophisticated techniques with which to analyze and recover that information. While much of the recent research in steganography has been centered on hiding data in images, many of the solutions that work for images are more complicated when applied to natural language text as a cover medium. Many approaches to steganalysis attempt to detect statistical anomalies in cover data which predict the presence of hidden information. Natural language cover texts must not only pass the statistical muster of automatic analysis, but also the minds of human readers. Linguistically naive approaches to the problem use statistical frequency of letter combinations or random dictionary words to encode information. More sophisticated approaches use context-free grammars to generate syntactically correct cover text which mimics the syntax of natural text. None of these uses meaning as a basis for generation, and little attention is paid to the semantic cohesiveness of a whole text as a data point for statistical attack. This paper provides a basic introduction to steganography and steganalysis, with a particular focus on text steganography. Text-based information hiding techniques are discussed, providing motivation for moving toward linguistic steganography and steganalysis. We highlight some of the problems inherent in text steganography as well as issues with existing solutions, and describe linguistic problems with character-based, lexical, and syntactic approaches. Finally, the paper explores how a semantic and rhetorical generation approach suggests solutions for creating more believable cover texts, presenting some current and future issues in analysis and generation. The paper is intended to be both general enough that linguists without training in information security and computer science can understand the material, and specific enough that the linguistic and computational problems are described in adequate detail to justify the conclusions suggested.

Cuneyt M. Taskiran, Umut Topkara, Mercan Topkara, and Edward J. Delp. Attacks on lexical natural language steganography systems. In Proceedings of the SPIE International Conference on Security, Steganography, and Watermarking of Multimedia Contents, January 2006. [ bib | .pdf ]

Text data forms the largest bulk of digital data that people encounter and exchange daily. For this reason the potential usage of text data as a covert channel for secret communication is an imminent concern. Even though information hiding into natural language text has started to attract great interest, there has been no study on attacks against these applications. In this paper we examine the robustness of lexical steganography systems.In this paper we used a universal steganalysis method based on language models and support vector machines to differentiate sentences modified by a lexical steganography algorithm from unmodified sentences. The experimental accuracy of our method on classification of steganographically modified sentences was 84.9 percent. On classification of isolated sentences we obtained a high recall rate whereas the precision was low.

Mercan Topkara, Guiseppe Riccardi, Dilek Hakkani-Tur, and Mikhail J. Atallah. Natural language watermarking: Challenges in building a practical system. In Proceedings of the SPIE International Conference on Security, Steganography, and Watermarking of Multimedia Contents, January 2006. [ bib | .pdf ]

This paper gives an overview of the research and implementation challenges we encountered in building an end-to-end natural language processing based watermarking system. With natural language watermarking, we mean embedding the watermark into a text document, using the natural language components as the carrier, in such a way that the modifications are imperceptible to the readers and the embedded information is robust against possible attacks. Of particular interest is using the structure of the sentences in natural language text in order to insert the watermark. We evaluated the quality of the watermarked text using an objective evaluation metric, the BLEU score. BLEU scoring is commonly used in the statistical machine translation community. Our current system prototype achieves 0.45 BLEU score on a scale [0,1].

Mercan Topkara, Cuneyt M. Taskiran, and Edward J. Delp. Natural language watermarking. In Edward J. Delp and Ping W. Wong, editors, Proceedings of the SPIE International Conference on Security, Steganography, and Watermarking of Multimedia Contents, volume 5681, January 2005. [ bib | .pdf ]

In this paper we discuss natural language watermarking, which uses the structure of the sentence constituents in natural language text in order to insert a watermark. This approach is different from techniques, collectively referred to as text watermarking, which embed information by modifying the appearance of text elements, such as lines, words, or characters. We provide a survey of the current state of the art in natural language watermarking and introduce terminology, techniques, and tools for text processing. We also examine the parallels and differences of the two watermarking domains and outline how techniques from the image watermarking domain may be applicable to the natural language watermarking domain.

Keywords: text watermarking, natural language processing, text steganography

Mercan Topkara, Umut Topkara, and Mikhail J. Atallah. Words are not enough: Sentence level natural language watermarking. In Proceedings of the ACM Workshop on Content Protection and Security (in conjuction with ACM Multimedia), October 2006. [ bib ]

Mercan Topkara, Umut Topkara, and Mikhail J. Atallah. Information hiding through errors: A confusing approach. In Proceedings of the SPIE International Conference on Security, Steganography, and Watermarking of Multimedia Contents, January 2007. [ bib ]

Umut Topkara, Mercan Topkara, and Mikhail J. Atallah. The hiding virtues of ambiguity: quantifiably resilient watermarking of natural language text through synonym substitutions. In MM&Sec '06: Proceeding of the 8th workshop on Multimedia and security, pages 164-174, New York, NY, USA, 2006. ACM Press. [ bib | DOI ]

Information-hiding in natural language text has mainly consisted of carrying out approximately meaning-preserving modifications on the given cover text until it encodes the intended mark. A major technique for doing so has been synonym-substitution. In these previous schemes, synonym substitutions were done until the text confessed, i.e., carried the intended mark message. We propose here a better way to use synonym substitution, one that is no longer entirely guided by the mark-insertion process: It is also guided by a resilience requirement, subject to a maximum allowed distortion constraint. Previous schemes for information hiding in natural language text did not use numeric quantification of the distortions introduced by transformations, they mainly used heuristic measures of quality based on conformity to a language model (and not in reference to the original cover text). When there are many alternatives to carry out a substitution on a word, we prioritize these alternatives according to a quantitative resilience criterion and use them in that order. In a nutshell, we favor the more ambiguous alternatives. In fact not only do we attempt to achieve the maximum ambiguity, but we want to simultaneously be as close as possible to the above-mentioned distortion limit, as that prevents the adversary from doing further transformations without exceeding the damage threshold; that is, we continue to modify the document even after the text has confessed to the mark, for the dual purpose of maximizing ambiguity while deliberately getting as close as possible to the distortion limit. The quantification we use makes possible an application of the existing information-theoretic framework, to the natural language domain, which has unique challenges not present in the image or audio domains. The resilience stems from both (i) the fact that the adversary does not know where the changes were made, and (ii) the fact that automated disambiguation is a major difficulty faced by any natural language processing system (what is bad news for the natural language processing area, is good news for our scheme's resilience). In addition to the above mentioned design and analysis, another contribution of this paper is the description of the implementation of the scheme and of the experimental data obtained.

Grothoff et al.

Christian Grothoff, Krista Grothoff, Ludmila Alkhutova, Ryan Stutsman, and Mikhail Atallah. Translation-based steganography. Technical Report TR 2005-39, Purdue CERIAS, 2005. [ bib | .pdf ]

This paper investigates the possibilities of steganographically embedding information in the “noise” created by automatic translation of natural language documents. Because the inherent redundancy of natural language creates plenty of room for variation in translation, machine translation is ideal for steganographic applications. Also, because there are frequent errors in legitimate automatic text translations, additional errors inserted by an information hiding mechanism are plausibly undetectable and would appear to be part of the normal noise associated with translation. Significantly, it should be extremely difficult for an adversary to determine if inaccuracies in the translation are caused by the use of steganography or by deficiencies of the translation software.

Christian Grothoff, Krista Grothoff, Ludmila Alkhutova, Ryan Stutsman, and Mikhail Atallah. Translation-based steganography. In Proceedings of Information Hiding Workshop (IH 2005), pages 213-233. Springer, 2005. [ bib | .pdf ]

This paper investigates the possibilities of steganographically embedding information in the “noise” created by automatic translation of natural language documents. Because the inherent redundancy of natural language creates plenty of room for variation in translation, machine translation is ideal for steganographic applications. Also, because there are frequent errors in legitimate automatic text translations, additional errors inserted by an information hiding mechanism are plausibly undetectable and would appear to be part of the normal noise associated with translation. Significantly, it should be extremely difficult for an adversary to determine if inaccuracies in the translation are caused by the use of steganography or by deficiencies of the translation software.

Ryan Stutsman, Mikhail Atallah, Christian Grothoff, and Krista Grothoff. Lost in just the translation. In Proceedings of the 21st Annual ACM Symposium on Applied Computing (SAC 2006), April 2006. [ bib | .pdf ]

This paper describes the design and implementation of a scheme for hiding information in translated natural language text, and presents experimental results using the implemented system. Unlike the previous work, which required the presence of both the source and the translation, the protocol presented in this paper requires only the translated text for recovering the hidden message. This is a significant improvement, as transmitting the source text was both wasteful of resources and less secure. The security of the system is now improved not only because the source text is no longer available to the adversary, but also because a broader repertoire of defenses (such as mixing human and machine translation) can now be used.

Chapman et al.

Mark T. Chapman. Hiding the hidden: A software system for concealing ciphertext as innocuous text. Master's thesis, University of Wisconsin-Milwaukee, May 1997. [ bib | .ps ]

In this thesis we present a system for protecting the privacy of cryptograms to avoid detection by censors. The system transforms ciphertext into innocuous text which is transformed back into the original ciphertext. The expandable set of tools allows experimentation with custom dictionaries, automatic simulation of writing style, and the use of Context-Free Grammars to control text generation.

Keywords: ciphertext, privacy, information hiding

Mark T. Chapman and George I. Davida. Hiding the hidden: A software system for concealing ciphertext as innocuous text. In Okamoto Sihan Qing Yongfei Han Tatsuaki, editor, Information and Communications Security: First International Conference, volume 1334 of Lecture Notes in Computer Science. Springer, August 1997. [ bib | .ps ]

In this paper we present a system for protecting the privacy of cryptograms to avoid detection by censors. The system transforms ciphertext into innocuous text which can be transformed back into the original ciphertext. The expandable set of tools allows experimentation with custom dictionaries, automatic simulation of writing style, and the use of Context-Free Grammars to control text generation.

Mark T. Chapman and George I. Davida. Plausible deniability using automated linguistic steganography. In George I. Davida and Yair Frankel, editors, Infrastructure Security: International Conference, volume 2437 of Lecture Notes in Computer Science, pages 276-287. Springer, October 2002. [ bib | .ps ]

Information hiding has several applications, one of which is to hide the use of cryptography. The Nicetext system introduced a method for hiding cryptographic information by converting cryptographic strings (random-looking) into nice text (namely innocuous looking). The system retains the ability to recover the original ciphertext from the generated text. Nicetext can hide both plaintext and cryptographic text.

The purpose of such transformations are to mask ciphertext from anyone who wants to detect or censor encrypted communication, such as a corporation that may monitor, or censor, its employee private mail. Even if the message is identified as the output of Nicetext, the sender might claim that the input was simply a pseudo-random number source rather than ciphertext.

This paper extends the Nicetext protocol to enable deniable cryptography/messaging using the concepts of plausible deniability [1]. Deniability is derived from the fact that even if one is forced to reveal a key to the random string that nice text reverts to, the real cryptographic/plaintext messages may be stored within additional required sources of randomness in the extended protocol.

Mark T. Chapman, George I. Davida, and Marc Rennhard. A practical and effective approach to large-scale automated linguistic steganography. In George I. Davida and Yair Frankel, editors, Information Security: Fourth International Conference, volume 2200 of Lecture Notes in Computer Science, page 156ff. Springer, October 2001. [ bib | .pdf ]

Several automated techniques exist to transform ciphertext into text that looks like natural-language text while retaining the ability to recover the original ciphertext. This transformation changes the ciphertext so that it doesn't attract undue attention from, for example, attackers or agencies or organizations that might want to detect or censor encrypted communication. Although it is relatively easy to generate a small sample of quality text, it is challenging to be able to generate large texts that are meaningful to a human reader and which appear innocuous.

This paper expands on a previous approach that used sentence models and large dictionaries of words classified by part-of-speech. By using an extensible contextual template approach combined with a synonym-based replacement strategy, much more realistic text is generated than was possible with NICETEXT.

Richard Bergmair

Richard Bergmair. Natural language steganography and an “ai-complete” security primitive. talk, December 2004. talk held at the 21st Chaos Communication Congress. [ bib ]

It is out of question, that we will have a long way to go, until we can encode our favourite MP3-files to t-shirt slogans, and distribute them by wearing them on the streets with the music industry unable to prove that something like an information exchange is taking place, but hopefully this article shows why research in natural language steganography is worth the effort. Some major ideas from steganography and computational linguistics are introduced and it is shown how they can be drawn together for security purposes. We present our technique of content-aware linguistic steganography, which is based on the general idea of using “AI-complete” problems as security primitives, and hope to inspire the hacker-community to come up with new creative security technologies.

Richard Bergmair. Towards linguistic steganography: A systematic investigation of approaches, systems, and issues. final year thesis, April 2004. handed in in partial fulfillment of the degree requirements for the degree “B.Sc. (Hons.) in Computer Studies” to the University of Derby. [ bib | .ps.gz ]

Steganographic systems provide a secure medium to covertly transmit information in the presence of an arbitrator. In linguistic steganography, in particular, machine-readable data is to be encoded to innocuous natural language text, thereby providing security against any arbitrator tolerating natural language as a communication medium.

So far, there has been no systematic literature available on this topic, a gap the present report attempts to fill. This report presents necessary background information from steganography and from natural language processing. A detailed description is given of the systems built so far. The ideas and approaches they are based on are systematically presented. Objectives for the functionality of natural language stegosystems are proposed and design considerations for their construction and evaluation are given. Based on these principles current systems are compared and evaluated.

A coding scheme that provides for some degree of security and robustness is described and approaches towards generating steganograms that are more adequate, from a linguistic point of view, than any of the systems built so far, are outlined.

Bolshakov et al.

Igor A. Bolshakov. A method of linguistic steganography based on collocationally-verified synonymy. In Jessica J. Fridrich, editor, Information Hiding: 6th International Workshop, volume 3200 of Lecture Notes in Computer Science, pages 180-191. Springer, May 2004. [ bib | DOI ]

A method is proposed of the automatic concealment of digital information in rather long orthographically and semantically correct texts. The method does not change the meaning of the source text; it only replaces some words by their synonyms. Groups of absolute synonyms are used in a context independent manner, while the groups of relative synonyms are previously tested for semantic compatibility with the collocations containing the word to be replaced. A specific replacement is determined by the hidden information. The collocations are syntactically connected and semantically compatible pairs of content words; they are massively gathered beforehand, with a wide diversity in their stability and idiomacity. Thus the necessary linguistic resources are a specific synonymy dictionary and a very large database of collocations. The steganographic algorithm is informally outlined. An example of hiding binary information in a Russian text fragment is manually traced, with a rough evaluation of the steganographic bandwidth.

Hiram Calvo and Igor A. Bolshakov. Using selectional preferences for extending a synonymous paraphrasing method in steganography. In J. H. Sossa Azuela, editor, Avances en Ciencias de la Computacion e Ingenieria de Computo - CIC'2004: XIII Congreso Internacional de Computacion, pages 231-242, October 2004. [ bib ]

Linguistic steganography allows hiding information in a text. The resulting text must be grammatically correct and semantically coherent to be unsuspicious. Among several methods of linguistic steganography we adhere to previous approaches which use synonymous paraphrasing, i.e., substituting content words by their equivalents. Context must be considered to avoid possible substitutions that break coherence (for example spicy dog instead of hot dog). We base our method on previous work in linguistic steganography that uses collocations for verifying context. We propose using selectional preferences instead of collocations because selectional preferences can be collected automatically from large corpora in a reliable manner, thus allowing our method to be applied for any language. The steganographic algorithm is informally outlined and an example of hiding binary information in a Spanish text fragment is presented, with a rough evaluation of the ratio of hidden information size to the necessary size of the original text.

Murphy et al.

B. Murphy. Syntactic information hiding in plain text. Master's thesis, Department of Computer Science, Trinity College Dublin, 2001. [ bib ]

B. Murphy and C. Vogel. Statistically constrained shallow text marking: techniques, evaluation paradigm, and results. In Proceedings of the SPIE International Conference on Security, Steganography, and Watermarking of Multimedia Contents, January 2007. [ bib ]

B. Murphy and C. Vogel. The syntax of concealment: reliable methods for plain text information hiding. In Proceedings of the SPIE International Conference on Security, Steganography, and Watermarking of Multimedia Contents, January 2007. [ bib ]

Meral et al.

H. M. Meral, B. Sankur, and A. S. Ozsoy. Watermarking tools for turkish texts. In Proceedings of the 14th IEEE Conference on Signal Processing and Communications Applications, pages 1-4. IEEE, April 2006. [ bib | DOI ]

Text watermarking is a recent subject of natural language processing aimed to the content security and authentication information of the text documents. This study explores possible text watermarking tools for Turkish language. Various watermarking tools such as changes of morphological and syntactic structures, and swapping of synonyms and punctuations are investigated and their relative performance measured.

H. M. Meral, B. Sankur, and S. Ozsoy. Syntactic tools for natural language watermarking. In Proceedings of the SPIE International Conference on Security, Steganography, and Watermarking of Multimedia Contents, January 2007. [ bib ]

Others

V. Chand and C. O. Orgun. Exploiting linguistic features in lexical steganography: Design and proof-of-concept implementation. In Proceedings of the 39th Annual Hawaii International Conference on System Sciences (HICSS '06), volume 6, page 126b. IEEE, January 2006. [ bib | DOI ]

This paper develops a linguistically robust encryption, LUNABEL, which converts a message into semantically innocuous text. Drawing upon linguistic criteria, LUNABEL uses word replacement, with substitution classes based on traditional word replacement features (syntactic categories and sub-categories), as well as features under-exploited in earlier works: semantic criteria, graphotactic structure, inflectional class and frequency statistics. The original message is further hidden through the use of cover texts — within these, LUNABEL retains all function words and targets specific classes of content words for replacement, creating text which preserves the syntactic structure and semantic context of the original cover text. LUNABEL takes advantage of cover text styles which are not expected to be necessarily comprehensible to the general public, making any semantic anomalies more opaque. This line of work has the promise of creating encrypted texts which are less detectable than earlier steganographic efforts.

Yuei-Lin Chiang, Lu-Ping Chang, Wen-Tai Hsieh, and Wen-Chih Chen. Natural language watermarking using semantic substitution for chinese text. In Ton Kalker, Ingemar J. Cox, and Yong Man Ro, editors, Digital Watermarking: Second International Workshop, IWDW 2003, volume 2939 of Lecture Notes in Computer Science, pages 129-140. Springer, October 2003. [ bib | DOI ]

Numerous schemes have been designed for watermarking multimedia contents. Many of these schemes are vulnerable to watermark erasing attacks. Naturally, such methods are ineffective on text unless the text is represented as a bitmap image, but in that case, the watermark can be erased easily by using Optical Character Recognition (OCR) to change the representation of the text from a bitmap to ASCII or EBCDIC. This study attempts to develop a method for embedding watermark in the text that is as successful as the frequency-domain methods have been for image and audio. The novel method embeds the watermark in original text, creating ciphertext, which preserves the meaning of the original text via various semantic replacements.

B. Macq and O. Vybornova. A method of text watermarking using presuppositions. In Proceedings of the SPIE International Conference on Security, Steganography, and Watermarking of Multimedia Contents, January 2007. [ bib ]

Hiroshi Nakagawa, Kouji Sampei, Tsutomu Matsumoto, Shuji Kawaguchi, Kyoto Makino, and Ichiro Murase. Text information hiding with preserved meaning - a case for japanese documents. IPSJ Transaction, 42(9):2339 - 2350, 2001. originally published in Japanese. A similar paper was disseminated by the first author in English and is kept available for download from http://www.r.dl.itc.u-tokyo.ac.jp/ nakagawa/academic-res/finpri02.pdf. [ bib ]

Digital fingerprinting is being paid growing attention as a technology resolving copyright problems. Previously, researchers have only been interested in image based digital fingerprinting where secret information is hidden in images, as opposed to our the method we will put forward herein, which uses text. It is based on a paraphrasing method that is supposed to preserve meaning of the original contents. We experimentally evaluated the proposed method with Japanese manuals and user agreement forms of software, and found the paraphrased text is preserving the meaning of the original contents and closely mimics natural language.

Michiharu Niimi, Sayaka Minewaki, Hideki Noda, and Eiji Kawaguchi. A framework of text-based steganography using sd-form semantics model. IPSJ Journal, 44(8), August 2003. [ bib | .pdf ]

This paper describes a framework of text-base steganography in consideration of the meaning of natural language sentences. To deal with the meaning of sentences, this method uses SD-Form Semantics Model that has been developed by the authors. In the model, sentences are described by the form named SD-Form. An SD-Form is assigned an amount of semantic information. The amount of the meaning of sentences is used to carry secret information on text data. In embedding secret information, sentences are transformed to SD-Forms and then the amount of semantic information of SD-Forms is decreased or increased to coincide with the value of the secret information. We show methods to decrease or increase the amount of the meaning of SD-Forms.

M. Hassan Shirali-Shahreza and Mohammad Shirali-Shahreza. A new approach to persian/arabic text steganography. In Proceedings of the 5th IEEE/ACIS International Conference on Computer and Information Science, pages 310-315, Washington, DC, USA, 2006. IEEE Computer Society. [ bib | DOI ]

Conveying information secretly and establishing hidden relationship has been of interest since long past. Text documents have been widely used since very long time ago. Therefore, we have witnessed different method of hiding information in texts (text steganography) since past to the present. In this paper we introduce a new approach for steganography in Persian and Arabic texts. Considering the existence of too many points in Persian and Arabic phrases, in this approach, by vertical displacement of the points, we hide information in the texts. This approach can be categorized under feature coding methods. This method can be used for Persian/Arabic Watermarking. Our method has been implemented by JAVA programming language.

Xingming Sun, Gang Luo, and Huajun Huang. Component-based digital watermarking of chinese texts. In InfoSecu '04: Proceedings of the 3rd international conference on Information security, pages 76-81. ACM Press, 2004. [ bib | DOI ]

According to the types of the host media, digital watermarking may be classified mainly as image watermarking, video watermarking, audio watermarking, and text watermarking. The principle of the three watermarking research fields are similar in that they make use of the redundant information of their host media and the characteristics of human video system or human audio system. Unfortunately, text has no redundant information. Text watermarking techniques are totally different from them. And text watermarking algorithm is very difficult to satisfy the requirements of transparence and robustness. In this paper, a novel text watermarking algorithm based on the thought of the mathematical expression will be presented. Since watermarking signals are embedded into some Chinese characters that can be divided into left and right components, this algorithm is totally based on the content. Therefore, it breaks through the difficulties of text watermarking. Experiments also show that the component-based text watermarking technique is relatively robust and transparent. It will play an important role in protecting the security of Chinese documents over Internet.

Adam J. Tenenbaum. Linguistic steganography: Passing covert data using text-based mimicry. final year thesis, April 2002. submitted in partial fulfillment of the requirements for the degree of “Bachelor of Applied Science” to the University of Toronto. [ bib | .pdf ]

The goal of linguistic steganography systems is to transmit a secret message over an open communication channel while concealing the presence of the secret message altogether. The secret message is hidden by encoding its bits within a “cover” message that mimics natural language. Existing text mimicry algorithms are flawed in that there exists a tradeoff between the quality of the output text and the resources required to manually design an appropriate grammar for the content of the cover message.

In Peter Wayner's basic mimicry algorithm, the system learns from frequency analysis of a “training source” in order to attempt to mimic the source. This thesis improves upon Wayner's algorithm by changing the “atom” in frequency analysis from a single character to a single word. The resulting linguistic steganography algorithm generates a cover text that more closely resembles the style of the training source but also mimics the grammar of the source text in a dynamic, automated fashion.

Ozlem Uzuner. Natural language processing with linguistic information for digital fingerprinting and watermarking. In Proceedings of the SPIE International Conference on Security, Steganography, and Watermarking of Multimedia Contents, January 2006. [ bib ]

Keith Winstein. Lexical steganography through adaptive modulation of the word choice hash, January 1999. Was disseminated during secondary education at the Illinois Mathematics and Science Academy. The paper won the third prize in the 2000 Intel Science Talent Search. [ bib | .ps ]

Steganography provides for the embedding of information in a block of host data in conditions where perceptible modification of the host data is intolerable. Steganographic techniques are highly dependent on the character of the host data; a technique for embedding information in images might make subtle changes in hue, while a method for embedding information in audio data could exploit the limitations of the human ear by encoding the encapsulated information in inaudible frequency ranges. Current implementations of textual steganography exploit tolerances in typesetting by making minute changes in line placement and kerning in order to encapsulate hidden information, making them vulnerable to simple retypesetting attacks. This paper defines a framework for lexical steganography and discusses the details of an implementation.

RELATED BIBLIOGRAPHY

Mikhail J. Atallah, Craig J. McDonough, Victor Raskin, and Sergei Nirenburg. Natural language processing for information assurance and security: an overview and implementations. In Mary Ellen Zurko and Steven J. Greenwald, editors, NSPW '00: Proceedings of the 2000 workshop on New security paradigms, pages 51-65. ACM Press, September 2000. [ bib | DOI ]

The paper introduces and advocates an ontological semantic approach to information security. Both the approach and its resources, the ontology and lexicons, are borrowed from the field of natural language processing and adjusted to the needs of the new domain. The approach pursues the ultimate dual goals of inclusion of natural language data sources as an integral part of the overall data sources in information security applications, and formal specification of the information security community know-how for the support of routine and time-efficient measures to prevent and counteract computer attacks. As the first order of the day, the approach is seen by the information security community as a powerful means to organize and unify the terminology and nomenclature of the field.

Richard Bergmair and Stefan Katzenbeisser. Towards human interactive proofs in the text-domain. In Kan Zhang and Yuliang Zheng, editors, Proceedings of the 7th Information Security Conference, volume 3225 of Lecture Notes in Computer Science, pages 257-267. Springer Verlag, September 2004. [ bib | .ps.gz ]

We outline the linguistic problem of word-sense ambiguity and demonstrate its relevance to current computer security applications in the context of Human Interactive Proofs (HIPs). Such proofs enable a machine to automatically determine whether it is interacting with another machine or a human. HIPs were recently proposed to fight abuse of web services, denial-of-service attacks and spam. We describe the construction of an HIP that relies solely on natural language and draws its security from the problem of word-sense ambiguity, i.e., the linguistic phenomenon that a word can have different meanings dependent on the context it is used in.

Richard Bergmair and Stefan Katzenbeisser. Content-aware steganography: About lazy prisoners and narrow-minded wardens. Technical Report fki-252-05, Technische Universität München, Institut für Informatik AI/Cognition Group, December 2005. [ bib | .pdf ]

We introduce content-aware steganography as a new paradigm of steganography stemming from a shift in perspectives towards the objects of steganography. In particular, we abandon the point of view that steganographic objects can be considered pieces of data, suggesting that they should rather be considered pieces of information. We provide some evidence to suggest that this shift in perspectives is in fact necessary, and pinpoint a semantic problem that has not received sufficient attention in the past. We also propose a solution to this problem, by putting forward a new kind of steganography that employs human interactive proofs as a security primitive.

Richard Bergmair and Stefan Katzenbeisser. Content-aware steganography: About lazy prisoners and narrow-minded wardens. In Proceedings of the 8th Information Hiding Workshop, Lecture Notes in Computer Science. Springer Verlag, 2006. in print. [ bib ]

We introduce content-aware steganography as a new paradigm of steganography stemming from a shift in perspectives towards the objects of steganography. In particular, we abandon the point of view that steganographic objects can be considered pieces of data, suggesting that they should rather be considered pieces of information. We provide some evidence to suggest that this shift in perspectives is in fact necessary, and pinpoint a semantic problem that has not received sufficient attention in the past. We also propose a solution to this problem, by putting forward a new kind of steganography that employs human interactive proofs as a security primitive.

Igor A. Bolshakov and Alexander Gelbukh. Synonymous paraphrasing using wordnet and internet. In Farid Meziane and Elisabeth Elisabeth Metais, editors, Natural Language Processing and Information Systems: 9th International Conference on Applications of Natural Language to Information Systems, NLDB 2004, volume 3136 of Lecture Notes in Computer Science, pages 312-323. Springer, June 2004. [ bib | DOI ]

We propose a method of synonymous paraphrasing of a text based on WordNet synonymy data and Internet statistics of stable word combinations (collocations). Given a text, we look for words or expressions in it for which WordNet provides synonyms, and substitute them with such synonyms only if the latter form valid collocations with the surrounding words according to the statistics gathered from Internet. We present two important applications of such synonymous paraphrasing: (1) style-checking and correction: automatic evaluation and computer-aided improvement of writing style with regard to various aspects (increasing vs. decreasing synonymous variation, conformistic vs. individualistic selection of synonyms, etc.) and (2) steganography: hiding of additional information in the text by special selection of synonyms. A basic interactive algorithm of style improvement is outlined and an example of its application to editing of newswire text fragment in English is traced. Algorithms of style evaluation and information hiding are also proposed.

Victor Raskin, Christian F. Hempelmann, Katrina E. Triezenberg, and Sergei Nirenburg. Ontology in information security: a useful theoretical foundation and methodological tool. In Victor Raskin and Steven J. Greenwald, editors, NSPW '01: Proceedings of the 2001 workshop on New security paradigms, pages 53-59. ACM Press, September 2001. [ bib | DOI | .pdf ]

The paper introduces and advocates an ontological semantic approach to information security. Both the approach and its resources, the ontology and lexicons, are borrowed from the field of natural language processing and adjusted to the needs of the new domain. The approach pursues the ultimate dual goals of inclusion of natural language data sources as an integral part of the overall data sources in information security applications, and formal specification of the information security community know-how for the support of routine and time-efficient measures to prevent and counteract computer attacks. As the first order of the day, the approach is seen by the information security community as a powerful means to organize and unify the terminology and nomenclature of the field.

Peter Wayner. Disappearing Cryptography - Information Hiding: Steganography & Watermarking. Morgan Kaufmann Publishers, Los Altos, CA 94022, USA, second edition, 2002. Chapters 6 and 7 serve as good introductions to mimic functions. [ bib ]

Disappearing Cryptography, Second Edition describes how to take words, sounds, or images and hide them in digital data so that they look like other words, sounds, or images. When used properly, this powerful technique makes it almost impossible to trace the author or the recipient of a message. Conversations can be submerged in the flow of information through the Internet so that no one can know if a conversation exists at all.

This full revision of the best-selling first edition describes a number of different techniques to hide information. These techniques include encryption (making data incomprehensible), steganography (embedding information into video, audio, or graphic files), watermarking (hiding data in the noise of image or sound files), mimicry (dressing up data and making it appear to be other data), and others. This second edition also includes an expanded discussion on hiding information with spread-spectrum algorithms, shuffling tricks, and synthetic worlds. Each chapter is divided into sections, first providing an introduction and high-level summary for those who want to understand the concepts without wading through technical explanations, and then presenting greater detail for those who want to write their own programs.

IMPLEMENTED SYSTEMS

Mark T. Chapman and George I. Davida. Nicetext. Website, August 1997. http://www.nicetext.com/, accessed 2005-03-09. [ bib | http ]

NICETEXT is a package that converts any file into pseudo-natural-language text. It also has the ability to recover the original file from the text! The expandable set of tools allows experimentation with custom dictionaries, automatic simulation of writing style, and the use of Context-Free-Grammars to control text generation.

Texthide. Website. http://www.texthide.com/, accessed 2005-03-20. [ bib | http ]

Compris Intelligence GmbH has developed a program which can automatically reformulate text. Nevertheless, the meaning of the text is retained completely. This feature can also be used to hide data in normal text. The technical term for this is steganography.

Textsign. Website. accessed 2005-03-20. [ bib | www: ]

Scanning, speech recognition, internet downloading, intelligent text processing systems: Text processing becomes increasingly simple, writing good texts remains difficult. With TextMark you protect your intellectual property. This innovation distinguishes itself by its broad applicability to all kinds of textual documents and its tamper proof characteristic.

Steven E. Hugg. Stegparty. Website, November 1999. http://www.fasterlight.com/hugg/projects/stegparty.html, accessed 2005-03-25. [ bib | .html ]

StegParty is a system for hiding information inside of plain-text files. Unlike similar tools currently available it does not use random gibberish to encode data - it relies on small alterations to the message, like changes to spelling and punctuation. Because of this you can use any plain-text file as your carrier , and it will be more-or-less understandable after the secret message is embedded.

Kevin Maher. Texto. circulating on the web, February 1995. http://www.ecn.org/crypto/soft/texto.zip, accessed 2005-03-22. [ bib | http ]

Texto is a rudimentary text steganography program which transforms uuencoded or pgp ascii-armoured ascii data into English sentences. This program was written to facilitate the exchange of binary data, especially encrypted data. Why is this necessary? People or programs may be reading your mail. Recent events in the US congress may _require_ Internet Service Providers to monitor incoming mail and determine whether or not it is obscene or lives up to particular parochial moral standards. Since they can't scan the contents of an encrypted message, and probably don't have time to manually look at each uuencoded message, such emails will probably go into the bit bucket. This program's output is hopefully close enough to normal English text that it will slip by any kind of automated scanning.

David McKellar. Spammimic. Website, June 2000. http://www.spammimic.com/, accessed 2004-04-12. [ bib | http ]

There is tons of spam flying around the Internet. Most people can't delete it fast enough. It's virtually invisible. This site gives you access to a program that will encrypt a short message into spam. Basically, the sentences it outputs vary depending on the message you are encoding. Real spam is so stupidly written it's sometimes hard to tell the machine written spam from the genuine article.

Paul Shields. Stegano. circulating on the web, November 2001. http://zooid.org/~paul/crypto/natlang/stegano-1.02.tar.gz, accessed 2005-03-25. [ bib | http ]

This is a small set of heuristic tools intended for use in steganographic writings. How you use these tools is up to you.

John Walker. Steganosaurus. circulating on the web, December 1994. http://www.fourmilab.ch/stego/stego.shar.gz, accessed 2005-03-25. [ bib | http ]

Steganosaurus is a plain text steganography (secret writing) utility which encodes a (usually encrypted) binary file as gibberish text, based on either a spelling dictionary or words taken from a text document. In portable C; public domain.

Peter Wayner. Mimicry applet. Website, August 1997. http://www.wayner.org/texts/mimic/, accessed 2004-04-12. [ bib | http ]

This applet shows how data can be mutated into innocent sounding plaintext with the push of a button. In this case, the destination is a the voiceover from a hypothetical baseball game between two teams named the Blogs and the Whappers. The information is encoded by choosing the words, the players and the action in the game. In some cases, one message will lead to a string of homeruns and in other cases a different message will strike out three players in a row.

Keith Winstein. Tyrannosaurus lex. Website, January 1999. http://alumni.imsa.edu/~keithw/tlex/, accessed 2005-03-09. [ bib | http ]

Steganography is a field concerned with hiding information, typically within some unsuspicious carrier. For instance, an online news site might use steganographic watermarking to encode their images with some copyright notice, allowing them to easily search for copies of the same images on another web site by searching for images containing the watermark. Schemes for hiding data in blocks of text exist, but are usually dependent on being able to modify the physical appearance of the text - usually by subtly moving lines up and down, etc. Lexical steganography is the encoding of data in blocks of text on the lexical, or word, level.

Michal Zalewski. snowdrop. freshmeat entry, September 2002. http://freshmeat.net/projects/snowdrop/, accessed 2005-03-20. [ bib | http ]

snowdrop is a steganographic text document and C code watermarking tool that uses redundant, tamper-evident and modification-proof information embedded in the content itself, instead of the medium, to simplify tracking of proprietary code leaks, sensitive information disclosure, etc.

(c) Copyright 2007 -- 2009