Textual Analysis for Network Attack Recognition
Textual Analysis Tools

Textual Analysis — Background

Author attribution, the analysis of written texts in an attempt to measure distinctive patterns of words or word sequence use, is an area that has gotten quite a bit of attention. Maybe this area can help solve this problem.

Consider the problem of comparing various English language texts. There will be similarities and differences, and these might suggest common authorship, or different authorship on the same theme. To an even greater degree than the logged attack sequences, the similarity will be imperfect. No author uses identical distributions of words or multi-word phrases in two works (or at least no author would be popular or probably even published given such drearily repetitive output).

For example, Mark Twain, Rudyard Kipling and Joseph Conrad wrote at approximately the same time, so there will be similarities in their use of English. They also had some themes in common — both Twain's Huckleberry Finn and Kipling's Kim were tales of a boy's grand adventure. Similarly, both Kipling's Kim and Conrad's Heart of Darkness feature adventure and intrigue in exotic settings. Kipling's Captains Courageous and Conrad's Lord Jim are both nautical stories. But there will also be distinctive differences:

First, Twain's fiction was largely set in south-central or far western America, Kipling's were largely set in greater India, and Conrad's Heart of Darkness was set in equatorial Africa. So, each author's set of works will include distinctive place names and terminology.

Second, both Twain and Kipling wrote at least partly in local vernacular, using literal transcriptions of the pronunciations and grammars of south-central American and the Cockney of British soldiers stationed in India, so their writing styles are very distinct from strictly English-language authors of that period. (and they are orthographic nightmares to automatically analyze, leading to my not using their vernacular works here!)

Third, people study literature and earn degrees for work analyzing the differences in vocabulary and phrasing that would still be present when two authors wrote about the same thing. Their vocabularies will vary, and so will their sentence structures. The problem of author attribution is the attempt to decide who wrote a previously unknown work. This is what we are trying to accomplish with log analysis — does this attack sequence appear "similar enough" to a known cluster of general attack forms to suggest attribution to some previously seen attacker?

However, we are looking for tools to automate the process. The previous page presented some examples of similar or identical network attacks, but they were based on skilled human analysis as were the above (unskilled) human comments on similarities and differences among 19th century literature.

Some potentially useful measures of the similarity of two texts have been defined. They are based on breaking each text into subsequences and then measuring the similarity between the subsequence sets.

The texts are broken into sets called n-grams, n-tuples, or more meaningfully, w-shingles, overlapping subsequences of w tokens. For example, Kipling's The Man Who Would Be King begins:

The Law, as quoted, lays down a fair conduct of life, and one not easy to follow. I have been fellow to a beggar again and again under circumstances which prevented either of us finding out whether the other was worthy.

Punctuation and capitalization are removed and the shingles formed as overlapping subsequences. Let's say we are forming 6-shingles:

the law as quoted lays down
    law as quoted lays down a
        as quoted lays down a fair
           quoted lays down a fair conduct
	          ....

This is continued to form the list of shingles. The set formed by the first sentence contains:

the law as quoted lays down
law as quoted lays down a
as quoted lays down a fair
quoted lays down a fair conduct
lays down a fair conduct of
down a fair conduct of life
a fair conduct of life and
fair conduct of life and one
conduct of life and one not
of life and one not easy
life and one not easy to
and one not easy to follow

It seems inappropriate to form a shingle spanning the end of one sentence and the beginning of the next. They are different thoughts, and further into the text you would be forming shingles spanning from one paragraph to the next, or even one chapter to the next. It isn't fair to claim that Kipling wrote the non-grammatical phrase "one not easy to follow I" or the Yoda-like utterance "easy to follow I have been" just because he happened to use the first few words to finish one sentence and then the remainder to start some later sentence (or paragraph, or even chapter). So, the subsequent 6-shingles in the set should skip to the start of the next sentence:

i have been fellow to a beggar
have been fellow to a beggar again
been fellow to a beggar again and
... and so on ...

So, with English texts we really should do some pre-processing to break the text into sentences. This leads to difficulty caused by orthography and uncertainty about just how to deal with quoted text. To create a contrived example, how would you most meaningfully deal with nested quotes of this form:

Bagheera warned Mowgli, "Shere Khan has often said 'I will kill the man-cub.'"

Furthermore, is "man-cub" a single token or is it two tokens joined by a dash? Are the nominative "Bagheera" and the possessive "Bagheera's" to be treated as distinct tokens, or as equivalent because they are inflected forms of the same word? If they are treated as equivalent, then what about the nominative "I" versus the accusative or objective "me" versus the possessive "my"?

Those are questions of linguistics, not statistics. My approach in the initial development and testing based on literary example texts has been to attempt to preprocess the texts to do something fairly reasonable and say "close enough". Removing all punctuation after splitting into sentences means:

Bagheera's becomes bagheeras, distinct from bagheera.
man-cub becomes two tokens man and cub.
i, me, and my are, of course, distinct.
That contrived example becomes the "sentence":
bagheera warned mowgli shere khan has often said i will kill the man cub

The good news is that we don't have to worry about any of this when analyzing login guess sequences, as each sequence is treated as one long "sentence".

The Street Called Straight in Damascus, Syria, a subject of some of Twain's non-fiction writing.

Textual Analysis — Measures of Similarity

Resemblance, or Jaccard's Coefficient

Let's say we have two documents, and A and B are their two sets of unique w-shingles for some shingle size w. The resemblance, also called Jaccard's Similarity Coefficent, can be estimated as the ratio of the magnitudes of the intersection and union of their w-shingle sets. To what extent are they built from the same w-shingles? This will range from 0 (entirely disjoint) to 1 (identical shingle sets).

		\| A ∩ B \|
r(A,B)	=	----------------------
		\| A ∪ B \|

The Sørensen-Dice Similarity Coefficient

Sørensen's Similarity Coefficient is also known as the Czekanowski Index. Multiplied by two you get Dice's Coefficient, a value in the range 0 through 1:

		\| A ∩ B \|
S(A,B)	=	----------------------
		\| A \| + \| B \|

		2 \| A ∩ B \|
D(A,B)	=	----------------------
		\| A \| + \| B \|

Containment

The Containment of one shingle set within another indicates the inclusion of set A within set B, or the extent to which A is composed of elements of B.

		\| A \|
C(A,B)	=	----------------------
		\| A ∩ B \|

Hamming Distance

The Hamming Distance is the minimum number of substitutions required to change one string into a string of equal length. I cannot use the Hamming Distance as the login sequences are not of equal length. However....

Levenshtein distance

The Levenshtein distance is also called the edit distance, is the minimum number of operations needed to transform one sequence into the other, where the operations are insertion, deletion, and substitution. The Unix diff command produces editor-like output similar to a description of the Levenshtein distance.

Vector Space Model

One problem with the above measures is that they are based on sets of unique w-shingles within documents. All shingles have equal weight regardless of how many times they appear in a document. A Vector Space Model, or term vector model, provides a way of modeling the distribution of shingle uses.

Start with a corpus, a collection of input texts. I used a number of classic works from Project Gutenberg, edited to extract the text from the boilerplate header and final notes, and processed to regularize the orthography and split it into one sentence per line (at least approximately, given the above discussion of orthographic difficulties). Simple shell scripts and awk programs can then form the shingle sets and calculate the resemblance, Dice's coefficient, and containment.

For the vector space model, form the list of all unique shingles found in the corpus. This list can be in any order, alphabetical or lexicographical is fine and makes this set the output of the fairly trivial sort -u command.

Let's say there are n unique shingles. Then for document number i we form a vector of n floating point numbers:
V_i = { v₁, v₂, ..., v_n }
where v_j is the percentage of the shingles of document i identical to shingle j from the union set. The vector is then normalized so its magnitude is 1.

A vector space similarity measure is then the dot product of the vectors for two documents.

Similiarity Measure Results on English Text

The corpus from Project Gutenberg is as follows. For these examples of English prose, increased shingle size quickly leads to sets of mostly unique shingles — that is, shingles that appear just once within the text. Roughly half the 2-shingles and over 70% of the 3-shingles within each work appear just once, This causes most of the similarity measures to quickly drop to near zero. The vector space model is the only one that seems to be of much use for discovering textual similarities.

Author	Work	Words	Percentage of w-shingles appearing once only
Author	Work	Words	w=1	w=2	w=3	w=4	w=5
Joseph Conrad	Heart of Darkness	38533	15.61	64.07	92.57	98.72	99.69
	Lord Jim	130220	9.11	51.85	86.88	97.35	99.39
	The Secret Sharer	16659	18.16	67.09	93.30	98.64	99.51
	— total —	185412	7.67	48.39	84.88	96.89	99.30
Charles Dickens	Bleak House	355382	5.01	39.94	79.36	95.11	98.60
	A Christmas Carol	29201	17.28	66.34	92.89	98.03	98.88
	The Cricket on the Hearth	31634	15.61	65.10	92.84	98.28	99.33
	David Copperfield	357111	5.05	38.14	77.61	94.67	98.67
	Great Expectations	185259	6.75	43.92	81.79	95.88	98.95
	A Tale of Two Cities	136306	8.27	50.42	86.62	97.05	98.99
	— total —	1094893	2.92	30.87	71.21	92.38	98.08
Rudyard Kipling	The Second Jungle Book	64301	10.53	56.72	88.98	97.24	99.07
	Captains Courageous	52926	15.15	66.37	94.02	98.92	99.70
	Just-So Stories	29012	12.88	54.79	82.80	91.29	94.46
	Kim	106110	11.41	56.19	89.25	97.70	99.33
	The Light That Failed	71581	12.26	57.55	90.30	98.06	99.49
	The Jungle Book	51090	10.59	56.58	88.75	97.22	98.96
	The Man Who Would Be King	14238	20.05	71.03	95.16	98.96	99.72
	— total —	389258	6.51	44.41	82.44	95.77	98.65
W. Somerset Maugham	The Moon and Sixpence	74974	10.03	51.75	86.25	96.84	99.18
	Of Human Bondage	259615	5.25	38.09	76.24	93.25	98.20
	— total —	334589	4.73	36.40	75.06	92.92	98.12
H.G. Wells	The First Men in the Moon	68675	11.61	57.37	89.50	97.96	99.63
	The Invisible Man	48890	13.37	59.77	90.04	97.74	99.36
	The Island of Doctor Moreau	43743	13.44	59.57	90.19	97.78	99.35
	The Time Machine	32473	15.53	62.89	91.38	98.25	99.55
	The War of the Worlds	60213	12.31	58.74	89.92	98.02	99.56
	— total —	253994	6.64	45.85	83.05	96.25	99.18

Within-Author Results

Measures for shingles of just one or two tokens tend to be nearly 1 for comparisons within a single author. Below are the vector space measures for 3-shingles and 5-shingles.

Joseph Conrad

                   Heart  LordJi Secret   3-shingles
Heart of Darkness  1.0000 0.8445 0.9035
Lord Jim           0.8445 1.0000 0.8760
The Secret Sharer  0.9035 0.8760 1.0000

                   Heart  LordJi Secret   5-shingles
Heart of Darkness  1.0000 0.1012 0.1566
Lord Jim           0.1012 1.0000 0.1406
The Secret Sharer  0.1566 0.1406 1.0000

Charles Dickens

                           BleakH CCarol Cricke DavidC GreatE TaleOf   3-shingles
Bleak House                1.0000 0.9767 0.9764 0.9515 0.9616 0.7210
A Christmas Carol          0.9767 1.0000 0.9947 0.9779 0.9839 0.7306
The Cricket on the Hearth  0.9764 0.9947 1.0000 0.9779 0.9836 0.7307
David Copperfield          0.9515 0.9779 0.9779 1.0000 0.9648 0.7212
Great Expectations         0.9616 0.9839 0.9836 0.9648 1.0000 0.7208
A Tale of Two Cities       0.7210 0.7306 0.7307 0.7212 0.7208 1.0000

                           BleakH CCarol Cricke DavidC GreatE TaleOf   5-shingles
Bleak House                1.0000 0.3853 0.3795 0.2170 0.2477 0.2390
A Christmas Carol          0.3853 1.0000 0.6325 0.3815 0.4245 0.3128
The Cricket on the Hearth  0.3795 0.6325 1.0000 0.3760 0.4179 0.3095
David Copperfield          0.2170 0.3815 0.3760 1.0000 0.2474 0.2404
Great Expectations         0.2477 0.4245 0.4179 0.2474 1.0000 0.2454
A Tale of Two Cities       0.2390 0.3128 0.3095 0.2404 0.2454 1.0000

Rudyard Kipling

                           2ndJun CptnsC JustSo Kim    LightT Jungle MWWBKg   3-shingles
The Second Jungle Book     1.0000 0.9071 0.9121 0.8806 0.8955 0.9160 0.9384
Captains Courageous        0.9071 1.0000 0.9220 0.8909 0.9095 0.9193 0.9450
Just-So Stories            0.9121 0.9220 1.0000 0.8942 0.9136 0.9254 0.9530
Kim                        0.8806 0.8909 0.8942 1.0000 0.8805 0.8915 0.9216
The Light That Failed      0.8955 0.9095 0.9136 0.8805 1.0000 0.9104 0.9378
The Jungle Book            0.9160 0.9193 0.9254 0.8915 0.9104 1.0000 0.9497
The Man Who Would Be King  0.9384 0.9450 0.9530 0.9216 0.9378 0.9497 1.0000

                           2ndJun CptnsC JustSo Kim    LightT Jungle MWWBKg   5-shingles
The Second Jungle Book     1.0000 0.2308 0.3162 0.1974 0.2128 0.2477 0.3401
Captains Courageous        0.2308 1.0000 0.3257 0.1982 0.2109 0.2439 0.3332
Just-So Stories            0.3162 0.3257 1.0000 0.2712 0.2989 0.3390 0.4829
Kim                        0.1974 0.1982 0.2712 1.0000 0.1828 0.2096 0.2920
The Light That Failed      0.2128 0.2109 0.2989 0.1828 1.0000 0.2256 0.3098
The Jungle Book            0.2477 0.2439 0.3390 0.2096 0.2256 1.0000 0.3589
The Man Who Would Be King  0.3401 0.3332 0.4829 0.2920 0.3098 0.3589 1.0000

W. Somerset Maugham

                       MoonSx HumanB
The Moon and Sixpence  1.0000 0.9298   3-shingles
Of Human Bondage       0.9298 1.0000

                       MoonSx HumanB
The Moon and Sixpence  1.0000 0.2216   5-shingles
Of Human Bondage       0.2216 1.0000

H. G. Wells

                           1stMen Invisi Island TimeMa WarWor   3-shingles
The First Men in the Moon  1.0000 0.8835 0.8935 0.9034 0.8843
The Invisible Man          0.8835 1.0000 0.9019 0.9074 0.8921
The Island of Dr Moreau    0.8935 0.9019 1.0000 0.9187 0.9000
The Time Machine           0.9034 0.9074 0.9187 1.0000 0.9077
The War of the Worlds      0.8843 0.8921 0.9000 0.9077 1.0000

                           1stMen Invisi Island TimeMa WarWor   5-shingles
The First Men in the Moon  1.0000 0.1003 0.1035 0.1092 0.0903
The Invisible Man          0.1003 1.0000 0.1225 0.1289 0.1062
The Island of Dr Moreau    0.1035 0.1225 1.0000 0.1345 0.1095
The Time Machine           0.1092 0.1289 0.1345 1.0000 0.1164
The War of the Worlds      0.0903 0.1062 0.1095 0.1164 1.0000

The results for the works of Charles Dickens best show what was expected — his Christmas stories A Christmas Carol and Cricket on the Hearth are the most similar to each other. Their 5-shingle vector space similarity was 0.6325. His A Tale of Two Cities, with an entirely different setting and topic, the French Revolution, seems to differ the most from his other works.

The vector space similarity measure seems to be the only measure useful for English prose similarity, but the next page tries all the measures on the network attack signature data.

Given the common early termination of individual attack threads, the resemblance and even the containment measures may be of use.

To The Security Page