| Set | Number of 7-Shingles |
Attacker | Target |
| A | 316 | 136.199.196.40 | c193 |
| B | 19 | 136.199.196.40 | i192 |
| C | 258 | 136.199.196.40 | x053 |
| D | 162 | 194.204.212.6 | x053 |
| E | 400 | 200.32.73.4 | c193 |
| F | 826 | 200.32.73.4 | i192 |
| G | 207 | 200.32.73.4 | x053 |
| H | 62 | 202.131.2.167 | i192 |
| I | 17 | 202.131.2.167 | x053 |
| J | 3042 | 211.96.64.124 | c193 |
| K | 3042 | 211.96.64.124 | i192 |
| L | 1088 | 211.96.64.124 | x053 |
| M | 902 | 217.9.0.196 | c193 |
| N | 9391 | 217.9.0.196 | i192 |
| O | 76 | 217.9.0.196 | x053 |
Here are some preliminary results.
Somewhat surprisingly, with English prose the measures of resemblance, containment, and Dice's coefficient were of little use but the vector space model worked reasonably well.
The opposite is true with the syslog data. The vector space model was of little use, its results are not shown here.
A collection of log data was selected, including 15 attack sequences from 6 attacking hosts:
Below are the results of the resemblance measure. The following sets of attack sequences are from the same attacking host, where sequences J and K are identical:
Picking an arbitrary threshold of 0.3, the values colored in green show correct classification of similar attack form, while values in yellow show errors — within-source measures below the cutoff (false-negative error) or between-source measures above it (false-positive error).
Resemblance table:
A B C D E F G H I J K L M N O
A 1.00000 0.06485 0.80205 0.00000 0.00296 0.00193 0.00203 0.00000 0.00000 0.00066 0.00066 0.00161 0.00152 0.00030 0.00000
B 0.06485 1.00000 0.08085 0.00000 0.00248 0.00131 0.00459 0.00000 0.00000 0.00036 0.00036 0.00103 0.00260 0.00015 0.00000
C 0.80205 0.08085 1.00000 0.00000 0.00324 0.00205 0.00230 0.00000 0.00000 0.00067 0.00067 0.00168 0.00167 0.00030 0.00000
D 0.00000 0.00000 0.00000 1.00000 0.03795 0.02257 0.05848 0.37423 0.09816 0.05925 0.05925 0.16963 0.00000 0.01469 0.00000
E 0.00296 0.00248 0.00324 0.03795 1.00000 0.51747 0.51948 0.02288 0.00000 0.01299 0.01299 0.02056 0.00133 0.00175 0.00000
F 0.00193 0.00131 0.00205 0.02257 0.51747 1.00000 0.26882 0.01256 0.00000 0.12230 0.12230 0.17741 0.00090 0.00181 0.00000
G 0.00203 0.00459 0.00230 0.05848 0.51948 0.26882 1.00000 0.03968 0.00000 0.01172 0.01172 0.01852 0.00177 0.00165 0.00000
H 0.00000 0.00000 0.00000 0.37423 0.02288 0.01256 0.03968 1.00000 0.27419 0.02230 0.02230 0.06381 0.00000 0.00431 0.00000
I 0.00000 0.00000 0.00000 0.09816 0.00000 0.00000 0.00000 0.27419 1.00000 0.00585 0.00585 0.01674 0.00000 0.00247 0.00000
J 0.00066 0.00036 0.00067 0.05925 0.01299 0.12230 0.01172 0.02230 0.00585 1.00000 1.00000 0.34931 0.00032 0.01121 0.00000
K 0.00066 0.00036 0.00067 0.05925 0.01299 0.12230 0.01172 0.02230 0.00585 1.00000 1.00000 0.34931 0.00032 0.01121 0.00000
L 0.00161 0.00103 0.00168 0.16963 0.02056 0.17741 0.01852 0.06381 0.01674 0.34931 0.34931 1.00000 0.00076 0.01393 0.00000
M 0.00152 0.00260 0.00167 0.00000 0.00133 0.00090 0.00177 0.00000 0.00000 0.00032 0.00032 0.00076 1.00000 0.05658 0.20765
N 0.00030 0.00015 0.00030 0.01469 0.00175 0.00181 0.00165 0.00431 0.00247 0.01121 0.01121 0.01393 0.05658 1.00000 0.01175
O 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.20765 0.01175 1.00000
Containment table:
A B C D E F G H I J K L M N O
A 1.00000 1.00000 1.00000 0.00000 0.00520 0.00269 0.00500 0.00000 0.00000 0.00073 0.00073 0.00209 0.00273 0.00031 0.00000
B 0.06485 1.00000 0.08085 0.00000 0.00260 0.00134 0.00500 0.00000 0.00000 0.00037 0.00037 0.00105 0.00273 0.00016 0.00000
C 0.80205 1.00000 1.00000 0.00000 0.00520 0.00269 0.00500 0.00000 0.00000 0.00073 0.00073 0.00209 0.00273 0.00031 0.00000
D 0.00000 0.00000 0.00000 1.00000 0.05195 0.02688 0.10000 0.98387 0.94118 0.05925 0.05925 0.16963 0.00000 0.01484 0.00000
E 0.00683 0.05263 0.00851 0.12346 1.00000 0.51747 1.00000 0.16129 0.00000 0.01463 0.01463 0.02827 0.00273 0.00186 0.00000
F 0.00683 0.05263 0.00851 0.12346 1.00000 1.00000 1.00000 0.16129 0.00000 0.13863 0.13863 0.26806 0.00273 0.00201 0.00000
G 0.00341 0.05263 0.00426 0.12346 0.51948 0.26882 1.00000 0.16129 0.00000 0.01244 0.01244 0.02199 0.00273 0.00170 0.00000
H 0.00000 0.00000 0.00000 0.37654 0.02597 0.01344 0.05000 1.00000 1.00000 0.02231 0.02231 0.06387 0.00000 0.00433 0.00000
I 0.00000 0.00000 0.00000 0.09877 0.00000 0.00000 0.00000 0.27419 1.00000 0.00585 0.00585 0.01675 0.00000 0.00247 0.00000
J 0.00683 0.05263 0.00851 1.00000 0.10390 0.50941 0.17000 0.98387 0.94118 1.00000 1.00000 1.00000 0.00273 0.01577 0.00000
K 0.00683 0.05263 0.00851 1.00000 0.10390 0.50941 0.17000 0.98387 0.94118 1.00000 1.00000 1.00000 0.00273 0.01577 0.00000
L 0.00683 0.05263 0.00851 1.00000 0.07013 0.34409 0.10500 0.98387 0.94118 0.34931 0.34931 1.00000 0.00273 0.01577 0.00000
M 0.00341 0.05263 0.00426 0.00000 0.00260 0.00134 0.00500 0.00000 0.00000 0.00037 0.00037 0.00105 1.00000 0.05658 1.00000
N 0.00683 0.05263 0.00851 0.59259 0.03117 0.01747 0.05500 0.45161 0.94118 0.03731 0.03731 0.10681 1.00000 1.00000 1.00000
O 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.20765 0.01175 1.00000
Dice's coefficient table:
A B C D E F G H I J K L M N O
A 1.00000 0.12180 0.89015 0.00000 0.00590 0.00386 0.00406 0.00000 0.00000 0.00132 0.00132 0.00321 0.00304 0.00059 0.00000
B 0.12180 1.00000 0.14961 0.00000 0.00495 0.00262 0.00913 0.00000 0.00000 0.00073 0.00073 0.00205 0.00520 0.00031 0.00000
C 0.89015 0.14961 1.00000 0.00000 0.00645 0.00409 0.00460 0.00000 0.00000 0.00135 0.00135 0.00336 0.00333 0.00060 0.00000
D 0.00000 0.00000 0.00000 1.00000 0.07313 0.04415 0.11050 0.54464 0.17877 0.11188 0.11188 0.29006 0.00000 0.02896 0.00000
E 0.00590 0.00495 0.00645 0.07313 1.00000 0.68202 0.68376 0.04474 0.00000 0.02565 0.02565 0.04030 0.00266 0.00350 0.00000
F 0.00386 0.00262 0.00409 0.04415 0.68202 1.00000 0.42373 0.02481 0.00000 0.21794 0.21794 0.30135 0.00180 0.00361 0.00000
G 0.00406 0.00913 0.00460 0.11050 0.68376 0.42373 1.00000 0.07634 0.00000 0.02318 0.02318 0.03636 0.00353 0.00330 0.00000
H 0.00000 0.00000 0.00000 0.54464 0.04474 0.02481 0.07634 1.00000 0.43038 0.04363 0.04363 0.11996 0.00000 0.00857 0.00000
I 0.00000 0.00000 0.00000 0.17877 0.00000 0.00000 0.00000 0.43038 1.00000 0.01163 0.01163 0.03292 0.00000 0.00493 0.00000
J 0.00132 0.00073 0.00135 0.11188 0.02565 0.21794 0.02318 0.04363 0.01163 1.00000 1.00000 0.51776 0.00064 0.02217 0.00000
K 0.00132 0.00073 0.00135 0.11188 0.02565 0.21794 0.02318 0.04363 0.01163 1.00000 1.00000 0.51776 0.00065 0.02217 0.00000
L 0.00321 0.00205 0.00336 0.29006 0.04030 0.30135 0.03636 0.11996 0.03292 0.51776 0.51776 1.00000 0.00151 0.02748 0.00000
M 0.00304 0.00520 0.00333 0.00000 0.00266 0.00180 0.00353 0.00000 0.00000 0.00064 0.00065 0.00151 1.00000 0.10710 0.34389
N 0.00059 0.00031 0.00060 0.02896 0.00350 0.00361 0.00330 0.00857 0.00493 0.02217 0.02217 0.02748 0.10710 1.00000 0.02322
O 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.34389 0.02322 1.00000
Resemblance and Dice's coefficient seem most useful. Dice's coefficient performed slightly better on this data set.
The containment measure was less useful due to false-positive matches, as might be expected. Small sequences (e.g., H and I) are contained within large sequences (D, J, K, L, and N in this case).
Set: H I D J K L N
Shingles: 62 17 162 3042 3042 1088 9391
Classification was better on sets of similar sizes than on sets where one sequence terminated much earlier. Again, this makes sense:
Better: Set: E F G
Shingles: 400 826 207
Set: H I
Shingles: 62 17
Set: J K L
Shingles: 3042 3042 1088
Worse: Set: A B C
Shingles: 316 19 258
Set: M N O
Shingles: 902 9391 76
Looking just at resemblance and Dice's coefficient, we can first rule out those measures not of interest:
So, the eleven true positive matches of interest are:
AB, AC, BC,
EF, EG, FG,
HI,
JL,
MN, MO, NO
and the true negative matches of interest are the remaining
80 pairs.
The thresholds could be adjusted to optimize the error rates for the resemblance and Dice's coefficient classification. Study of the tables shows the following performances with varying thresholds. The maximum total correct classifications and the crossover points are highlighted.
Resemblance:
| Threshold | Total Correct | Matches correctly classified | Non-matches correctly classified |
| 0.80206 — 1.00000 | 80/91 88% | 0/11 0% | 80/80 100% |
| 0.51949 — 0.80204 | 81/91 89% | 1/11 9% | 80/80 100% |
| 0.51748 — 0.51947 | 82/91 90% | 2/11 18% | 80/80 100% |
| 0.37424 — 0.51746 | 83/91 91% | 3/11 27% | 80/80 100% |
| 0.34932 — 0.37422 | 82/91 90% | 3/11 27% | 79/80 99% |
| 0.27420 — 0.34930 | 83/91 91% | 4/11 36% | 79/80 99% |
| 0.26883 — 0.27418 | 84/91 92% | 5/11 45% | 79/80 99% |
| 0.20766 — 0.26881 | 85/91 93% | 6/11 55% | 79/80 99% |
| 0.17742 — 0.20764 | 86/91 95% | 7/11 64% | 79/80 99% |
| 0.16964 — 0.17740 | 85/91 93% | 7/11 64% | 78/80 98% |
| 0.12231 — 0.16962 | 84/91 92% | 7/11 64% | 77/80 96% |
| 0.09817 — 0.12233 | 83/91 91% | 7/11 64% | 76/80 95% |
| 0.08086 — 0.09815 | 82/91 90% | 7/11 64% | 75/80 94% |
| 0.06486 — 0.08084 | 83/91 91% | 8/11 73% | 75/80 94% |
| 0.06382 — 0.06484 | 84/91 92% | 9/11 82% | 75/80 94% |
| 0.05926 — 0.06380 | 83/91 91% | 9/11 82% | 74/80 93% |
| 0.05849 — 0.05924 | 82/91 90% | 9/11 82% | 73/80 91% |
| 0.05659 — 0.05847 | 81/91 89% | 9/11 82% | 72/80 90% |
| 0.03969 — 0.05657 | 82/91 90% | 10/11 91% | 72/80 90% |
| 0.03796 — 0.03967 | 81/91 89% | 10/11 91% | 71/80 89% |
| 0.02289 — 0.03794 | 80/91 88% | 10/11 91% | 70/80 88% |
| 0.02258 — 0.02287 | 79/91 87% | 10/11 91% | 69/80 86% |
| 0.02229 — 0.02256 | 78/91 86% | 10/11 91% | 68/80 85% |
| 0.02057 — 0.02231 | 77/91 85% | 10/11 91% | 67/80 84% |
| 0.01853 — 0.02231 | 76/91 84% | 10/11 91% | 66/80 83% |
| 0.01675 — 0.02231 | 75/91 82% | 10/11 91% | 65/80 81% |
| 0.01470 — 0.01673 | 74/91 81% | 10/11 91% | 64/80 80% |
| 0.01394 — 0.01468 | 73/91 80% | 10/11 91% | 63/80 79% |
| 0.01300 — 0.01392 | 72/91 79% | 10/11 91% | 62/80 78% |
| 0.01176 — 0.01298 | 71/91 78% | 10/11 91% | 61/80 76% |
| 0.01173 — 0.01174 | 72/91 79% | 11/11 100% | 61/80 76% |
| 0.00001 — 0.01171 | 65-71/91 71-78% | 11/11 100% | 54-60/80 68-75% |
Dice's Coefficient:
| Threshold | Total Correct | Matches correctly classified | Non-matches correctly classified |
| 0.89016 — 1.00000 | 80/91 88% | 0/11 0% | 80/80 100% |
| 0.68377 — 0.89016 | 81/91 89% | 1/11 9% | 80/80 100% |
| 0.68203 — 0.68375 | 82/91 90% | 2/11 18% | 80/80 100% |
| 0.54465 — 0.68201 | 83/91 91% | 3/11 27% | 80/80 100% |
| 0.51777 — 0.54463 | 82/91 90% | 3/11 27% | 79/80 99% |
| 0.43039 — 0.51775 | 83/91 91% | 4/11 36% | 79/80 99% |
| 0.42374 — 0.43037 | 84/91 92% | 5/11 45% | 79/80 99% |
| 0.34390 — 0.42372 | 85/91 93% | 6/11 55% | 79/80 99% |
| 0.30136 — 0.34388 | 86/91 95% | 7/11 64% | 79/80 99% |
| 0.29007 — 0.30134 | 85/91 93% | 7/11 64% | 78/80 98% |
| 0.21795 — 0.29005 | 84/91 92% | 7/11 64% | 77/80 96% |
| 0.17878 — 0.21793 | 83/91 91% | 7/11 64% | 76/80 95% |
| 0.14962 — 0.17876 | 82/91 90% | 7/11 64% | 75/80 94% |
| 0.12181 — 0.14960 | 83/91 91% | 8/11 73% | 75/80 94% |
| 0.11997 — 0.12179 | 84/91 92% | 9/11 82% | 75/80 94% |
| 0.11189 — 0.11995 | 83/91 91% | 9/11 82% | 74/80 93% |
| 0.11051 — 0.11187 | 82/91 90% | 9/11 82% | 73/80 91% |
| 0.10711 — 0.11049 | 81/91 89% | 9/11 82% | 72/80 90% |
| 0.07635 — 0.10709 | 82/91 90% | 10/11 91% | 72/80 90% |
| 0.07314 — 0.07633 | 81/91 89% | 10/11 91% | 71/80 89% |
| 0.04475 — 0.07312 | 80/91 88% | 10/11 91% | 70/80 88% |
| 0.04416 — 0.04473 | 79/91 87% | 10/11 91% | 69/80 86% |
| 0.04364 — 0.04414 | 78/91 86% | 10/11 91% | 68/80 85% |
| 0.04031 — 0.04362 | 77/91 85% | 10/11 91% | 67/80 84% |
| 0.03637 — 0.04029 | 76/91 84% | 10/11 91% | 66/80 83% |
| 0.03294 — 0.03635 | 75/91 82% | 10/11 91% | 65/80 81% |
| 0.02897 — 0.03292 | 74/91 81% | 10/11 91% | 64/80 80% |
| 0.02749 — 0.02895 | 73/91 80% | 10/11 91% | 63/80 79% |
| 0.02566 — 0.02747 | 72/91 79% | 10/11 91% | 62/80 78% |
| 0.02482 — 0.02564 | 71/91 78% | 10/11 91% | 61/80 76% |
| 0.02323 — 0.02480 | 70/91 77% | 10/11 91% | 60/80 75% |
| 0.02319 — 0.02321 | 71/91 78% | 11/11 100% | 60/80 75% |
| 0.00001 — 0.02317 | 64-70/91 70-77% | 11/11 100% | 53-59/80 66-74% |
The first problem that comes to mind is the computational complexity of applying this to large collections of attack sequences. The eventual goal would be to answer for a newly observed sequence, Does this sequence strongly resemble anything seen so far? That would require an approach based on a feature vector, so the only new computation would be the calculation of the new feature vector and a comparison to an existing catalog of observed attack signatures. Analysis using the other inter-document similarity measures would require calculation growing with the size of the existing data collection.
An unexpected problem arose when applying this this same sort of sequence extraction and analysis to a new month's data. That data included an attack by a botnet of 707 hosts in which only guesses for the root password were attempted.
Start: Nov 19 15:00:42 End: Nov 21 03:18:36 Duration: 36:17:54 Guesses: 2666
Typical guesses were separated by 20 to 80 seconds from the one before. With two exceptions, no botnet members made two consecutive guesses. A gnuplot histogram of the inter-guess times shows the distribution.
Click here to see the list of botnet members and the inter-attack timing.
The SSH daemon and the PAM modules it uses only
log the event of a failure.
Sniffing packets can show the the host-to-host
handshaking, in which every client identified itself as:
SSH-2.0-libssh-0.2,
immediately suggesting that this was a botnet, all the
members running C code compiled against the libssh API.
However, once the hosts attempt host-to-host authentication
(which, of course, fails in this case) and then negotiate
ciphers and a session key, you see nothing else useful
in the raw network traffic.
The trick is to attach to the SSH server process with strace and observe the I/O at the application level.
First, find the process ID of the listening SSH server.
Run this command:
lsof -i tcp:ssh
and look for the PID of the process marked LISTEN.
Second, if that PID were 12345, run this command as
root:
strace -f -e 'read,write' -p12345
The page http://www.whatsmypass.com/?p=415 claims to list "The Top 500 Worst Passwords of All Time", but there is no explanation of where they got that data. Since admin isn't even on the list despite being the default password on lots of network gear, I don't think the list is very authoritative. But it's kind of interesting.
Coming in mid-attack, I saw an alphabetical list of names
and words being used as root password guesses:
dominique
domino
dontknow
doogie
doors
dork
doudou
doug
downtown
dragon1
driver
and so on.
Some further analysis showed that this sequence
had been used as as target logins for password guesses
within two earlier attacks,
click here to see the UNIX command-line
trick to easily find these.
Those attacks were separated by 77 days,
and the two attacking hosts were in Brazil and Germany:
| Attacker | Target | Start | End | Password guesses for: | |||
| root | non-root | invalid users | all users | ||||
| 200.213.105.90 c8d5695a.static.cps.virtua.com.br. inetnum: 200.213.105/24 owner: TV CABO DE PORTO ALEGRE LTDA responsible: Grupo de Segurança da Informação Vírtua country: BR |
x053 | Mar 16 10:32:36 | Mar 16 13:27:24
10488 seconds |
5 / 5 | 2807 / 2806 |
2812 / 2811
3.73 sec/guess |
|
| i192 | Mar 16 10:32:37 | Mar 16 13:28:57
10580 seconds |
4 / 4 | 2850 / 2849 |
2854 / 2853
3.71 sec/guess |
||
| c193 | Mar 16 10:32:38 | Mar 16 13:28:05
10527 seconds |
4 / 4 | 2810 / 2809 |
2814 / 2813
3.74 sec/guess |
||
| a201 | Mar 16 10:32:39 | Mar 16 13:26:45
10446 seconds |
4 / 4 | 2816 / 2815 |
2820 / 2819
3.71 sec/guess |
||
| a202 | Mar 16 10:32:41 | Mar 16 13:26:45
10444 seconds |
4 / 4 | 2811 / 2810 |
2815 / 2814
3.71 sec/guess |
||
| a204 | Mar 16 10:32:51 | Mar 16 13:26:45
10434 seconds |
4 / 4 | 2807 / 2806 |
2811 / 2810
3.71 sec/guess |
||
| a203 | Mar 16 10:32:57 | Mar 16 13:26:42
10425 seconds |
4 / 4 | 2762 / 2761 |
2766 / 2765
3.77 sec/guess |
||
| a205 | Mar 16 10:32:58 | Mar 16 13:26:40
10422 seconds |
4 / 4 | 2763 / 2762 |
2767 / 2766
3.77 sec/guess |
||
|
total: 8 targets 22459 probes |
Mar 16 10:32:36 | Mar 16 13:28:57
10581 seconds |
33 / 6 | 22426 / 2850 |
22459 / 2853
0.47 sec/guess |
||
| Attacker | Target | Start | End | Password guesses for: | |||
| root | non-root | invalid users | all users | ||||
| 85.114.130.49 kd14.ab-webspace.de. inetnum: 85.114.128.0 - 85.114.135.255 netname: FASTIT-DE-DUS1-COLO4 descr: fast IT Colocation country: DE address: fast IT GmbH address: Am Gatherhof 44 address: 40472 Duesseldorf address: DE address: fibre one networks GmbH address: Network Operations & Services descr: DE-FIBRE1-85-114-128-0---slash-19 descr: DE-FIBRE1-85-114-128-0---slash-20 |
s120 | Jun 1 15:49:25 | Jun 1 18:03:18
8033 seconds |
4 / 4 | 2394 / 2393 |
2398 / 2397
3.35 sec/guess |
|
| x053 | Jun 1 15:49:28 | Jun 1 15:49:28 | 1 / 1 | 1 / 1 | |||
| t121 | Jun 1 15:54:37 | Jun 1 18:03:21
7724 seconds |
4 / 4 | 2303 / 2302 |
2307 / 2306
3.35 sec/guess |
||
| a201 | Jun 1 15:56:06 | Jun 1 18:03:18
7632 seconds |
4 / 4 | 2264 / 2263 |
2268 / 2267
3.37 sec/guess |
||
| c193 | Jun 1 15:56:07 | Jun 1 15:56:07 | 1 / 1 | 1 / 1 | |||
| i192 | Jun 1 15:56:08 | Jun 1 15:56:08 | 1 / 1 | 1 / 1 | |||
| a203 | Jun 1 15:56:22 | Jun 1 16:15:02
1120 seconds |
334 / 334 |
334 / 334
3.36 sec/guess |
|||
| a204 | Jun 1 16:05:33 | Jun 1 18:03:19
7066 seconds |
4 / 4 | 2092 / 2091 |
2096 / 2095
3.37 sec/guess |
||
| a202 | Jun 1 16:29:10 | Jun 1 18:03:18
5648 seconds |
3 / 3 | 1662 / 1661 |
1665 / 1664
3.39 sec/guess |
||
|
total: 9 targets 11071 probes |
Jun 1 15:49:25 | Jun 1 18:03:21
8036 seconds |
19 / 4 | 11052 / 2396 |
11071 / 2400
0.73 sec/guess |
||
The only differences within the sequences from 200.213.105.90,
in Brazil, on March 16, was early termination.
That is, compared to the 200.213.105.90->i192 sequence
of 2,854 logins, the others were identical until they
terminated slightly before the end.
And that longest sequence probably terminated itself early,
stopping at virago within one of several alphabetical
subsequences:
..., vicki, vicky, victor1, vikram, vincent1,
violet, violin, virago
Comparing the 200.213.105.90->i192 sequence to the sequences from the attack from 85.114.130.49, in Germany, on June 1, we see that they are basically the same sequences. The differences are early termination plus deletion of a few entries mid-list, probably due to timeouts during that attack.
That 200.213.105.90->i192 sequence is the most complete version of this list seen in two attack sequences widely separated in time and space, click here to see that list.
That sequence appears to be a list of likely passwords
rather than logins.
Just look at the first 10:
12345 abc123 password computer 123456
tigger 1234 a1b2c3 qwerty 123
Users usually aren't assigned identities like those,
but they are the sorts of passwords users
would prefer!
| To derive from 200.213.105.90->i192 sequence | ||
| Attacking host | Target | Changes required to produce the observed sequence |
| 85.114.130.49 | s120 |
Delete entries #1177, 2088, 2110:
lisa, gigi, greta Terminate early |
| 85.114.130.49 | t121 |
Delete entries #1019, 1086, 1475:
creative, gasman, 181818 Terminate early |
| 85.114.130.49 | a201 |
Delete entries #639, 659, 1073, 1698:
godzilla, imagine, florida, april Terminate early |
| 85.114.130.49 | a203 |
Delete entry #310:
morgan Terminate early |
| 85.114.130.49 | a204 |
Delete entries #471, 491, 1270:
rosebud, sunny, penny Terminate early |
| 85.114.130.49 | a202 |
Delete entries #48, 132, 391, 1360, 1445:
bear, biteme, explorer, snowflake, xcountry Terminate early |
Also recall that the target hosts are shown here as the first letter of the host name followed by a 3-digit representation of the last octet of the IP address. The first attack sequence started in numerical order if we disregard one entry transposed by one position, suggesting that the attacks were driven by a list or range of IP addresses or a CIDR representation. That makes sense, as none of the the hosts are public web servers or other prominent hosts. The starting order of the second attack makes less sense, as the target hosts are ordered by neither host name nor IP address. The organization's DNS servers do not allow zone transfers, so an attacker could not have obtained a list of host names that way. Also, some of the systems involved are used only as servers, and so they would not have shown up as clients in some web server's logs. The only thing that really makes sense is for these attacks to have been controlled by ranges of IP addresses.
| Attacker | IP address sequence |
| 200.213.105.90 | xxx.yyy.zzz.053, 192, 193, 201, 202, 204, 203, 205 |
| 85.114.130.49 | xxx.yyy.zzz.120, 053, 121, 201, 193, 192, 203, 204, 202 |
The timing between the starts of the individual sequences also varied between the two attacks. Maybe the order and start times were randomized by the attack software.
| Attacker | Seconds between attack sequence starts |
| 200.213.105.90 | 1, 1, 1, 2, 10, 6, 1 |
| 85.114.130.49 | 3, 309, 89, 1, 1, 14, 551, 1417 |
A similar attack from a larger botnet was seen when analyzing data from two servers at a web hosting company.
These attacks show the impracticality of purely automated analysis. My analysis script can generate an HTML file describing a month's probes against a system. But the file grows proportionally when one botnet attack appears to be hundreds of one-guess attacks. What had been an HTML file of a few hundred kilobytes grows into the tens of megabytes with the addition of all those tables.
It seems that human judgement is needed to tell the difference between an unaggressive botnet and random one-guess attacks. More difficult yet is the situation where a more aggressive single-attacker sequence happens to fall in the middle of a botnet attack. Several did, in this case, but they were spotted by seeing how many consecutive guesses were made by each host. This can be done with use of a simple uniq | sort -n command sequence. Given an average inter-guess timing in the tens of seconds for the botnet versus two to four seconds for a typical single host, the non-botnet hosts stood out and could be removed from consideration.
| Botnet probes by country | Attack #1 | Attack #2 | ||
| c193 | i192 | c193 | i192 | |
| US, United States | 124 | 124 | 20 | 20 |
| DE, Germany | 122 | 121 | 10 | 10 |
| IT, Italy | 67 | 65 | 9 | 9 |
| PL, Poland | 58 | 58 | 8 | 9 |
| BR, Brazil | 51 | 49 | 11 | 11 |
| IE, Ireland | 40 | 40 | 1 | 1 |
| FR, France | 37 | 37 | 11 | 11 |
| ES, Spain | 34 | 33 | 6 | 6 |
| RO, Romania | 34 | 33 | 5 | 5 |
| CZ, Czech Republic | 27 | 24 | 5 | 5 |
| AT, Austria | 25 | 25 | 1 | 1 |
| GB, United Kingdom | 25 | 25 | 8 | 9 |
| NL, Netherlands | 21 | 20 | 1 | 1 |
| AR, Argentina | 18 | 18 | 2 | 2 |
| MX, Mexico | 18 | 18 | 5 | 5 |
| HU, Hungary | 17 | 17 | 2 | 2 |
| RU, Russian Federation | 17 | 15 | 2 | 2 |
| CO, Colombia | 15 | 15 | 2 | 2 |
| BE, Belgium | 16 | 15 | 3 | 3 |
| SE, Sweden | 15 | 15 | 0 | 0 |
| CH, Switzerland | 14 | 13 | 1 | 1 |
| CL, Chile | 12 | 12 | 4 | 4 |
| UA, Ukraine | 12 | 12 | 0 | 0 |
| DK, Denmark | 11 | 11 | 2 | 2 |
| TR, Turkey | 9 | 9 | 1 | 1 |
| SK, Slovakia | 7 | 7 | 0 | 0 |
| PE, Peru | 6 | 6 | 4 | 4 |
| SV, El Salvador | 6 | 6 | 0 | 0 |
| GT, Guatemala | 5 | 5 | 0 | 0 |
| HK, Hong Kong | 5 | 5 | 0 | 0 |
| IL, Israel | 5 | 5 | 2 | 2 |
| LT, Lithuania | 5 | 5 | 0 | 0 |
| TW, Taiwan | 5 | 5 | 2 | 2 |
| CA, Canada | 4 | 4 | 1 | 1 |
| AU, Australia | 3 | 3 | 2 | 2 |
| BG, Bulgaria | 3 | 3 | 0 | 0 |
| CN, China | 3 | 2 | 1 | 1 |
| EE, Estonia | 3 | 3 | 2 | 2 |
| ZA, South Africa | 3 | 3 | 1 | 1 |
| cannot resolve | 3 | 3 | 0 | 0 |
| CI, Cote D'Ivoire | 2 | 4 | 1 | 1 |
| IN, India | 2 | 2 | 0 | 0 |
| JP, Japan | 2 | 2 | 0 | 0 |
| PT, Portugal | 2 | 2 | 0 | 0 |
| ID, Indonesia | 1 | 2 | 1 | 1 |
| KR, Korea, Republic of | 1 | 1 | 2 | 2 |
| LA, Lao People's Democratic Republic | 1 | 1 | 0 | 0 |
| LK, Sri Lanka | 1 | 1 | 0 | 0 |
| MY, Malaysia | 1 | 1 | 0 | 0 |
| RS, Serbia | 1 | 1 | 0 | 0 |
| UZ, Uzbekistan | 1 | 2 | 0 | 0 |
| VE, Venezuela | 1 | 1 | 0 | 0 |
Attack #1: Start: Nov 14 18:35:24 Nov 14 18:35:24 End: Nov 16 20:52:18 Nov 16 20:52:16 Duration: 50:16:54 50:16:52 Guesses made: 882 872 Unique logins: 85 83 Botnet members: 306 309 Target host: c193 i192 Attacking hosts in common: 303 Attack #2: Start: Nov 18 07:32:06 Nov 18 07:32:06 End: Nov 18 18:41:57 Nov 18 18:41:57 Duration: 03:09:51 03:09:51 Guesses made: 142 144 Unique logins: 42 42 Botnet members: 101 102 Target host: c193 i192 Attacking hosts in common: 101
Motivated to look for signs of botnets, two more attacks were spotted against other hosts during that same month! Those attacks seem to have been perpetrated by the same botnet, as there is a large overlap in their member hosts. The attacks were rather different than those of the first botnet, as they attempted guesses for a number of accounts in addition to root.
Logins attacked in sequence #1:
admin allison amanda andy at backup backup1 cathy cecilia charles christoph cookie cpanel crystal cs data ed frances ftp ftpuser games glenn irc jacki james jan jennifer joe john jon justin kevin kim knoppix ldap leonard lillian majordom mark martin matt melissa michel murphy mysql nagios nikki office oracle paul porter postgres postgress qmail rex richard robby robert root samba security steve test test1 test2 test3 test4 todd tomcat ts ts2 web0 web1 web2 web3 web4 web5 web6 web7 webmaster webmin wilson yolanda zach zope
Logins attacked in sequence #2:
amavis amavisd apache clamav contacts cyrus demo demo1 demo3 dhcpd fetchmail games gnats guest lp mail news spam squid student stunnel suse-ncc tcpdump test3 user user1 vscan web0 web1 web10 web11 web12 web13 web2 web3 web4 web5 web7 web8 web9 www wwwrun
Comparing the sets of attacking hosts (the botnet members), 94 were seen in both attack sequences. As seen in the table at right, the distribution of activity varied between the two attacks. The numbers in that table record the number of password guesses, individual attacks or probes, and does not necessarily reflect the number of unique botnet members in that country.
Some stereotypes are reinforced in that table: the high rankings of the U.S. and Europe. Others are dashed: the surprisingly low rankings of China, Hong Kong, and Japan.
Within each of the two attacks, the distributions of probe sources seen on the two target hosts were almost identical. The slight differences were probably due to timeouts leading to abandoned attempts.
For more details, extracts of the logs are available. These show the full sequence of guesses made by each botnet, showing the timestamp, login guessed, and botnet member making that guess:
The timing was less aggressive in these smaller attacks, as shown by the inter-attack timing histograms below. Attacks in the first and second sequences were spaced by 100-200 seconds and about 200 seconds, respectively.
To generate those histograms of inter-attack timing, start with a file "botnet-times" containing one inter-attack time per line:
% awk '{print $1}' botnet-list > botnet-times
% cat > histo.gnuplot << EOF
set style data line
set boxwidth 3
set xlabel "Seconds between guesses"
set ylabel "Number of events with this delay"
set xrange [0:150]
set key off
set term png
set output 'timing-histogram.png'
bw = 1
bin(x,width) = width*floor(x/width)
plot 'botnet-times' using (bin($1,bw)):(1.0) smooth freq with boxes
EOF
% gnuplot histo.gnuplot
"Entropy estimation of symbol sequences",
Thomas Schuermann and Peter Grassberger
Abstract:
"We discuss algorithms for estimating the Shannon entropy h
of finite symbol sequences with long range correlations.
In particular, we consider algorithms which estimate h
from the code lengths produced by some
compression algorithm.
Our interest is in describing their convergence with
sequence length, assuming no limits for the space and
time complexities of the compression algorithms.
A scaling law is proposed for extrapolation from
finite sample lengths.
This is applied to sequences of dynamical systems
in non-trivial chaotic regimes, a 1-D cellular
automaton, and to written English texts."
"Syntactic Clustering of the Web",
Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse,
and Geoffrey Zweig,
SRC Technical Note 1997-015
Abstract:
"We have developed an efficient way to determine the
syntactic similarity of files and have applied it
to every document on the World Wide Web.
Using this mechanism, we built a clustering of all
the documents that are syntactically similar.
Possible applications include a "Lost and Found"
service, filtering the results of Web searches,
updating widely distributed web-pages,
and identifying violations of
intellectual property rights."
"Finding Similar Files in
a Large File System",
Udi Manber, TR 93-33, University of Arizona
Abstract:
"We present a tool, called sif, for finding all
similar files in a large file system.
Files are considered similar if they have
significant number of common pieces, even if they
are very different otherwise.
For example, one file may be contained,
possibly with some changes, in another file,
of a file may be a reorganization of another file.
[....]"
"Clustering and Categorization
Applied to Cryptanalysis",
Claudia Oliveira,
José Antônio Xexéo,
Carols André Carvalho,
Cryptologia
v30, n3, pp 266-280, 2006
A nice discussion of a generalized clustering approach
and its application to cryptanalysis.
"Stylistic Text Classification Using
Functional Lexical Features",
Shlomo Argamon, Casey Whitelaw, Paul Chase,
Sobhan Raj Hota, Navendu Garg, and Shlomo Levitan,
Journal of the American Society for Information Science
and Technology
vol 58, no 6, pp 802-822, 2007
Abstract:
"Most text analysis and retrieval work to date has focused
on the topic of a text; that is, what
it is about.
However, a text also contains much useful information
in its style, or how it is written.
This includes information about its author, its
purpose, feelings it is meant to evoke, and more.
This article develops a new type of lexical feature
for use in stylistic text classification, based
on taxonomies of various semantic functions
of certain choice words or phrases.
We demonstrate the usefulness of such features for
the stylistic text classification tasks of
determining author identity and nationality,
the gender of literary characters, a text's
sentiment (positive/negative evaluation),
and the rhetorical character of scientific
journal articles. [....]"
In addition to referencing
a paper I co-authored,
"Evaluating Internet Resources: Identify, Affiliation,
and Cognitive Authority in a Networked World",
J. W. Fritch and R. L. Cromwell,
JASIST (Journal of the American Society of
Information Science and Technology),
vol. 52, no. 6 (2001), pp 499-507,
they also list a number of other papers
describing work on author attribution and profiling:
|
Previous: Textual Analysis Tools |
|
| Back to the start: The main page | |
|
|
|
|||||||||
|
|||||||||
|
| © Bob Cromwell May 2012. Created with /bin/vi and ImageMagick, hosted on OpenBSD with Apache. Root password available here, privacy policy here. |