Textual Analysis for Network Attack Recognition
Textual Analysis Tools
Textual Analysis — Background
Author attribution, the analysis of written texts in an attempt to measure distinctive patterns of words or word sequence use, is an area that has gotten quite a bit of attention. Maybe this area can help solve this problem.
Consider the problem of comparing various English language texts. There will be similarities and differences, and these might suggest common authorship, or different authorship on the same theme. To an even greater degree than the logged attack sequences, the similarity will be imperfect. No author uses identical distributions of words or multi-word phrases in two works (or at least no author would be popular or probably even published given such drearily repetitive output).
For example, Mark Twain, Rudyard Kipling and Joseph Conrad wrote at approximately the same time, so there will be similarities in their use of English. They also had some themes in common — both Twain's Huckleberry Finn and Kipling's Kim were tales of a boy's grand adventure. Similarly, both Kipling's Kim and Conrad's Heart of Darkness feature adventure and intrigue in exotic settings. Kipling's Captains Courageous and Conrad's Lord Jim are both nautical stories. But there will also be distinctive differences:
First, Twain's fiction was largely set in south-central or far western America, Kipling's were largely set in greater India, and Conrad's Heart of Darkness was set in equatorial Africa. So, each author's set of works will include distinctive place names and terminology.
Second, both Twain and Kipling wrote at least partly in local vernacular, using literal transcriptions of the pronunciations and grammars of south-central American and the Cockney of British soldiers stationed in India, so their writing styles are very distinct from strictly English-language authors of that period. (and they are orthographic nightmares to automatically analyze, leading to my not using their vernacular works here!)
Third, people study literature and earn degrees for work analyzing the differences in vocabulary and phrasing that would still be present when two authors wrote about the same thing. Their vocabularies will vary, and so will their sentence structures. The problem of author attribution is the attempt to decide who wrote a previously unknown work. This is what we are trying to accomplish with log analysis — does this attack sequence appear "similar enough" to a known cluster of general attack forms to suggest attribution to some previously seen attacker?
However, we are looking for tools to automate the process. The previous page presented some examples of similar or identical network attacks, but they were based on skilled human analysis as were the above (unskilled) human comments on similarities and differences among 19th century literature.
Some potentially useful measures of the similarity of two texts have been defined. They are based on breaking each text into subsequences and then measuring the similarity between the subsequence sets.
The texts are broken into sets called n-grams, n-tuples, or more meaningfully, w-shingles, overlapping subsequences of w tokens. For example, Kipling's The Man Who Would Be King begins:
The Law, as quoted, lays down a fair conduct
of life, and one not easy to follow.
I have been fellow to a beggar again and
again under circumstances which prevented
either of us finding out whether the other
was worthy.
Punctuation and capitalization are removed and the shingles formed as overlapping subsequences. Let's say we are forming 6-shingles:
the law as quoted lays down law as quoted lays down a as quoted lays down a fair quoted lays down a fair conduct ....
This is continued to form the list of shingles. The set formed by the first sentence contains:
the law as quoted lays down
law as quoted lays down a
as quoted lays down a fair
quoted lays down a fair conduct
lays down a fair conduct of
down a fair conduct of life
a fair conduct of life and
fair conduct of life and one
conduct of life and one not
of life and one not easy
life and one not easy to
and one not easy to follow
It seems inappropriate to form a shingle spanning the end of one sentence and the beginning of the next. They are different thoughts, and further into the text you would be forming shingles spanning from one paragraph to the next, or even one chapter to the next. It isn't fair to claim that Kipling wrote the non-grammatical phrase "one not easy to follow I" or the Yoda-like utterance "easy to follow I have been" just because he happened to use the first few words to finish one sentence and then the remainder to start some later sentence (or paragraph, or even chapter). So, the subsequent 6-shingles in the set should skip to the start of the next sentence:
i have been fellow to a beggar
have been fellow to a beggar again
been fellow to a beggar again and
... and so on ...
So, with English texts we really should do some pre-processing to break the text into sentences. This leads to difficulty caused by orthography and uncertainty about just how to deal with quoted text. To create a contrived example, how would you most meaningfully deal with nested quotes of this form:
Bagheera warned Mowgli,
"Shere Khan has often said
'I will kill the man-cub.'"
Furthermore, is "man-cub" a single token or is it two tokens joined by a dash? Are the nominative "Bagheera" and the possessive "Bagheera's" to be treated as distinct tokens, or as equivalent because they are inflected forms of the same word? If they are treated as equivalent, then what about the nominative "I" versus the accusative or objective "me" versus the possessive "my"?
Those are questions of linguistics, not statistics. My approach in the initial development and testing based on literary example texts has been to attempt to preprocess the texts to do something fairly reasonable and say "close enough". Removing all punctuation after splitting into sentences means:
- Bagheera's becomes bagheeras, distinct from bagheera.
- man-cub becomes two tokens man and cub.
- i, me, and my are, of course, distinct.
-
That contrived example becomes the "sentence":
bagheera warned mowgli shere khan has often said i will kill the man cub
The good news is that we don't have to worry about any of this when analyzing login guess sequences, as each sequence is treated as one long "sentence".
Textual Analysis — Measures of Similarity
Resemblance, or Jaccard's Coefficient
Let's say we have two documents, and A and B are their two sets of unique w-shingles for some shingle size w. The resemblance, also called Jaccard's Similarity Coefficent, can be estimated as the ratio of the magnitudes of the intersection and union of their w-shingle sets. To what extent are they built from the same w-shingles? This will range from 0 (entirely disjoint) to 1 (identical shingle sets).
| A ∩ B | | ||
r(A,B) | = | ---------------------- |
| A ∪ B | |
The Sørensen-Dice Similarity Coefficient
Sørensen's Similarity Coefficient is also known as the Czekanowski Index. Multiplied by two you get Dice's Coefficient, a value in the range 0 through 1:
| A ∩ B | | ||
S(A,B) | = | ---------------------- |
| A | + | B | | ||
2 | A ∩ B | | ||
D(A,B) | = | ---------------------- |
| A | + | B | |
Containment
The Containment of one shingle set within another indicates the inclusion of set A within set B, or the extent to which A is composed of elements of B.
| A | | ||
C(A,B) | = | ---------------------- |
| A ∩ B | |
Hamming Distance
The Hamming Distance is the minimum number of substitutions required to change one string into a string of equal length. I cannot use the Hamming Distance as the login sequences are not of equal length. However....
Levenshtein distance
The
Levenshtein distance
is also called the edit distance,
is the minimum number of operations needed to transform
one sequence into the other,
where the operations are insertion, deletion, and substitution.
The Unix
diff
command produces editor-like output similar to a description
of the Levenshtein distance.
Vector Space Model
One problem with the above measures is that they are based on sets of unique w-shingles within documents. All shingles have equal weight regardless of how many times they appear in a document. A Vector Space Model, or term vector model, provides a way of modeling the distribution of shingle uses.
Start with a corpus, a collection of input texts.
I used a number of classic works from
Project Gutenberg,
edited to extract the text from the boilerplate
header and final notes,
and processed to regularize the orthography and split
it into one sentence per line (at least approximately,
given the above discussion of orthographic difficulties).
Simple
shell scripts and
awk
programs can then form the shingle sets and calculate
the resemblance, Dice's coefficient, and containment.
For the vector space model,
form the list of all unique shingles found in the corpus.
This list can be in any order, alphabetical or lexicographical
is fine and makes this set the output of the fairly trivial
sort -u
command.
Let's say there are n unique shingles.
Then for document number i we form a vector of n
floating point numbers:
Vi = { v1, v2, ..., vn }
where vj is the percentage of the shingles of
document i identical to shingle j from the union set.
The vector is then normalized so its magnitude is 1.
A vector space similarity measure is then the dot product of the vectors for two documents.
Similiarity Measure Results on English Text
The corpus from Project Gutenberg is as follows. For these examples of English prose, increased shingle size quickly leads to sets of mostly unique shingles — that is, shingles that appear just once within the text. Roughly half the 2-shingles and over 70% of the 3-shingles within each work appear just once, This causes most of the similarity measures to quickly drop to near zero. The vector space model is the only one that seems to be of much use for discovering textual similarities.
Author | Work | Words | Percentage of w-shingles appearing once only | ||||
w=1 | w=2 | w=3 | w=4 | w=5 | |||
Joseph Conrad | Heart of Darkness | 38533 | 15.61 | 64.07 | 92.57 | 98.72 | 99.69 |
Lord Jim | 130220 | 9.11 | 51.85 | 86.88 | 97.35 | 99.39 | |
The Secret Sharer | 16659 | 18.16 | 67.09 | 93.30 | 98.64 | 99.51 | |
— total — | 185412 | 7.67 | 48.39 | 84.88 | 96.89 | 99.30 | |
Charles Dickens | Bleak House | 355382 | 5.01 | 39.94 | 79.36 | 95.11 | 98.60 |
A Christmas Carol | 29201 | 17.28 | 66.34 | 92.89 | 98.03 | 98.88 | |
The Cricket on the Hearth | 31634 | 15.61 | 65.10 | 92.84 | 98.28 | 99.33 | |
David Copperfield | 357111 | 5.05 | 38.14 | 77.61 | 94.67 | 98.67 | |
Great Expectations | 185259 | 6.75 | 43.92 | 81.79 | 95.88 | 98.95 | |
A Tale of Two Cities | 136306 | 8.27 | 50.42 | 86.62 | 97.05 | 98.99 | |
— total — | 1094893 | 2.92 | 30.87 | 71.21 | 92.38 | 98.08 | |
Rudyard Kipling | The Second Jungle Book | 64301 | 10.53 | 56.72 | 88.98 | 97.24 | 99.07 |
Captains Courageous | 52926 | 15.15 | 66.37 | 94.02 | 98.92 | 99.70 | |
Just-So Stories | 29012 | 12.88 | 54.79 | 82.80 | 91.29 | 94.46 | |
Kim | 106110 | 11.41 | 56.19 | 89.25 | 97.70 | 99.33 | |
The Light That Failed | 71581 | 12.26 | 57.55 | 90.30 | 98.06 | 99.49 | |
The Jungle Book | 51090 | 10.59 | 56.58 | 88.75 | 97.22 | 98.96 | |
The Man Who Would Be King | 14238 | 20.05 | 71.03 | 95.16 | 98.96 | 99.72 | |
— total — | 389258 | 6.51 | 44.41 | 82.44 | 95.77 | 98.65 | |
W. Somerset Maugham | The Moon and Sixpence | 74974 | 10.03 | 51.75 | 86.25 | 96.84 | 99.18 |
Of Human Bondage | 259615 | 5.25 | 38.09 | 76.24 | 93.25 | 98.20 | |
— total — | 334589 | 4.73 | 36.40 | 75.06 | 92.92 | 98.12 | |
H.G. Wells | The First Men in the Moon | 68675 | 11.61 | 57.37 | 89.50 | 97.96 | 99.63 |
The Invisible Man | 48890 | 13.37 | 59.77 | 90.04 | 97.74 | 99.36 | |
The Island of Doctor Moreau | 43743 | 13.44 | 59.57 | 90.19 | 97.78 | 99.35 | |
The Time Machine | 32473 | 15.53 | 62.89 | 91.38 | 98.25 | 99.55 | |
The War of the Worlds | 60213 | 12.31 | 58.74 | 89.92 | 98.02 | 99.56 | |
— total — | 253994 | 6.64 | 45.85 | 83.05 | 96.25 | 99.18 |
Within-Author Results
Measures for shingles of just one or two tokens tend to be nearly 1 for comparisons within a single author. Below are the vector space measures for 3-shingles and 5-shingles.
Joseph Conrad
Heart LordJi Secret 3-shingles Heart of Darkness 1.0000 0.8445 0.9035 Lord Jim 0.8445 1.0000 0.8760 The Secret Sharer 0.9035 0.8760 1.0000 Heart LordJi Secret 5-shingles Heart of Darkness 1.0000 0.1012 0.1566 Lord Jim 0.1012 1.0000 0.1406 The Secret Sharer 0.1566 0.1406 1.0000
Charles Dickens
BleakH CCarol Cricke DavidC GreatE TaleOf 3-shingles Bleak House 1.0000 0.9767 0.9764 0.9515 0.9616 0.7210 A Christmas Carol 0.9767 1.0000 0.9947 0.9779 0.9839 0.7306 The Cricket on the Hearth 0.9764 0.9947 1.0000 0.9779 0.9836 0.7307 David Copperfield 0.9515 0.9779 0.9779 1.0000 0.9648 0.7212 Great Expectations 0.9616 0.9839 0.9836 0.9648 1.0000 0.7208 A Tale of Two Cities 0.7210 0.7306 0.7307 0.7212 0.7208 1.0000 BleakH CCarol Cricke DavidC GreatE TaleOf 5-shingles Bleak House 1.0000 0.3853 0.3795 0.2170 0.2477 0.2390 A Christmas Carol 0.3853 1.0000 0.6325 0.3815 0.4245 0.3128 The Cricket on the Hearth 0.3795 0.6325 1.0000 0.3760 0.4179 0.3095 David Copperfield 0.2170 0.3815 0.3760 1.0000 0.2474 0.2404 Great Expectations 0.2477 0.4245 0.4179 0.2474 1.0000 0.2454 A Tale of Two Cities 0.2390 0.3128 0.3095 0.2404 0.2454 1.0000
Rudyard Kipling
2ndJun CptnsC JustSo Kim LightT Jungle MWWBKg 3-shingles The Second Jungle Book 1.0000 0.9071 0.9121 0.8806 0.8955 0.9160 0.9384 Captains Courageous 0.9071 1.0000 0.9220 0.8909 0.9095 0.9193 0.9450 Just-So Stories 0.9121 0.9220 1.0000 0.8942 0.9136 0.9254 0.9530 Kim 0.8806 0.8909 0.8942 1.0000 0.8805 0.8915 0.9216 The Light That Failed 0.8955 0.9095 0.9136 0.8805 1.0000 0.9104 0.9378 The Jungle Book 0.9160 0.9193 0.9254 0.8915 0.9104 1.0000 0.9497 The Man Who Would Be King 0.9384 0.9450 0.9530 0.9216 0.9378 0.9497 1.0000 2ndJun CptnsC JustSo Kim LightT Jungle MWWBKg 5-shingles The Second Jungle Book 1.0000 0.2308 0.3162 0.1974 0.2128 0.2477 0.3401 Captains Courageous 0.2308 1.0000 0.3257 0.1982 0.2109 0.2439 0.3332 Just-So Stories 0.3162 0.3257 1.0000 0.2712 0.2989 0.3390 0.4829 Kim 0.1974 0.1982 0.2712 1.0000 0.1828 0.2096 0.2920 The Light That Failed 0.2128 0.2109 0.2989 0.1828 1.0000 0.2256 0.3098 The Jungle Book 0.2477 0.2439 0.3390 0.2096 0.2256 1.0000 0.3589 The Man Who Would Be King 0.3401 0.3332 0.4829 0.2920 0.3098 0.3589 1.0000
W. Somerset Maugham
MoonSx HumanB The Moon and Sixpence 1.0000 0.9298 3-shingles Of Human Bondage 0.9298 1.0000 MoonSx HumanB The Moon and Sixpence 1.0000 0.2216 5-shingles Of Human Bondage 0.2216 1.0000
H. G. Wells
1stMen Invisi Island TimeMa WarWor 3-shingles The First Men in the Moon 1.0000 0.8835 0.8935 0.9034 0.8843 The Invisible Man 0.8835 1.0000 0.9019 0.9074 0.8921 The Island of Dr Moreau 0.8935 0.9019 1.0000 0.9187 0.9000 The Time Machine 0.9034 0.9074 0.9187 1.0000 0.9077 The War of the Worlds 0.8843 0.8921 0.9000 0.9077 1.0000 1stMen Invisi Island TimeMa WarWor 5-shingles The First Men in the Moon 1.0000 0.1003 0.1035 0.1092 0.0903 The Invisible Man 0.1003 1.0000 0.1225 0.1289 0.1062 The Island of Dr Moreau 0.1035 0.1225 1.0000 0.1345 0.1095 The Time Machine 0.1092 0.1289 0.1345 1.0000 0.1164 The War of the Worlds 0.0903 0.1062 0.1095 0.1164 1.0000
The results for the works of Charles Dickens best show what was expected — his Christmas stories A Christmas Carol and Cricket on the Hearth are the most similar to each other. Their 5-shingle vector space similarity was 0.6325. His A Tale of Two Cities, with an entirely different setting and topic, the French Revolution, seems to differ the most from his other works.
The vector space similarity measure seems to be the only measure useful for English prose similarity, but the next page tries all the measures on the network attack signature data.
Given the common early termination of individual attack threads, the resemblance and even the containment measures may be of use.
To The Security Page