Linux keyboard.

Bob's Blog

Learn How To Write A Shell Script To Analyze Logs

I recently wrote an article showing how to write your first shell script. It had very simple goal, displaying the current date and time in multiple formats.

Now let's create a script with a useful function!

The web server process maintains a log file that records all the requests made by clients. We want to use that to create a report listing the twenty most popular pages with their rank, the number of requests, and the path to that page.

This report could be useful to the web site's owner, telling them which pages are getting attention. For me, it's one page about building an Active Directory server on a Raspberry Pi running FreeBSD, and a series of pages about Linux performance tuning.

There are also cybersecurity applications, as you could use this to see what exploits the attackers are trying to use against your server. And, this same approach could help you spot interesting or alarming attack and abuse patterns in other log files.

The result will start something like this, where all of these URLs are on my site but it will instead show this for yours:

1  51578 /open-source/samba-active-directory/freebsd-raspberry-pi.html
2  35999 /open-source/performance-tuning/tcp.html
3  35811 /technical/dsl/
4  30288 /open-source/performance-tuning/nfs.html
5  21761 /open-source/performance-tuning/disks.html
6  18940 /travel/usa/new-york-mcgees/
...

This piece digs into some issues of web site hosting, web client-server communication, and cybersecurity. If you're interested in writing shell scripts but this specific application of scripting isn't your cup of tea, check out one of the other scripting articles I'll be putting together soon. In one, I'll dig deeper into the web server log to see the automated attacks being launched against my server. In another, I'll look for remote authentication attacks in a different log file.

Where Is The Data?

Well, that depends! It depends on which distribution you're using, whether you're using Nginx or the Apache httpd web server, and what customizations you may have made to the configuration.

Figure out the process IDs for your web server process by using whichever of these is appropriate:

$ pgrep nginx		# If you're using Nginx's nginx
  -- or --
$ pgrep httpd		# If you're using Apache's httpd

One of those will return a short list of process ID numbers. Use the first number from the list you get, and run this command with 12345 changed to match your number. The grep -v removes a lot of otherwise confusing details about shared libraries:

$ lsof -p 12345 | grep -vw REG

If there's no output despite having typed everything correctly, then lsof on your system only shows unprivileged non-root users information about their own processes. If that's the case, run it as root:

$ sudo lsof -p 12345 | grep -vw REG

There should be several lines of detailed output. You're looking for something near the bottom of the list that looks like a path to a likely log file. Something like /var/www/logs/access.log, perhaps.

In the following examples and the resulting shell script I will use "access" to represent the path to the web server access log. You can either:

What Does The Log Data Look Like?

Each line of the log file represents a single client request for some page on the site. Take a look at the first few lines of the file:

$ head -5 arrival
75.85.165.186 - - [01/Jan/2022:00:00:04 +0000] "GET /travel/romania/bucovina-gura-humorului/ HTTP/1.1" 200 25623 "-" TLSv1.3 X25519 TLS_AES_256_GCM_SHA384
167.114.101.27 - - [01/Jan/2022:00:00:09 +0000] "GET /travel/japan/kofun/emperor-seimu.html HTTP/1.1" 200 11330 "https://t.co/" TLSv1.2 secp521r1 ECDHE-ECDSA-AES256-GCM-SHA384
71.17.173.104 - - [01/Jan/2022:00:01:20 +0000] "GET /networking/routing.html HTTP/2.0" 200 10110 "-" TLSv1.3 X25519 TLS_AES_256_GCM_SHA384
35.240.26.48 - - [01/Jan/2022:00:01:21 +0000] "GET /cybersecurity/crypto/diffie-hellman.html HTTP/1.1" 200 11339 "-" TLSv1.2 secp521r1 ECDHE-ECDSA-AES128-GCM-SHA256
66.249.79.13 - - [01/Jan/2022:00:02:58 +0000] "GET /travel/usa/new-york-nighthawks/ HTTP/1.1" 200 16228 "-" TLSv1.3 X25519 TLS_AES_256_GCM_SHA384

That's a lot of technical detail! Let's look at just the first line, and I'll split it into fields at the blank spaces:

  1. 75.85.165.186 — that's the client IP address
  2. - — just a dash
  3. - — just another dash
  4. [01/Jan/2022:00:00:04 — the date and time of the request, with a starting "["
  5. +0000] — the timezone offset from UTC on the server, plus an ending "]"
  6. "GET — the specific request used by the client, with a starting "
  7. /travel/romania/bucovina-gura-humorului/ — what the client requests, this is what we're especially interested in
  8. HTTP/1.1" — the protocol used by the client, plus an ending ""
  9. 200 — the return code sent by the server along with the data, with 200 meaning success, 404 meaning there's no such page, 301 meaning that the page still exists but it's been moved, and many others
  10. 25623 — the number of bytes of data in that page
  11. "-" — ideally, this would be the referring page, the page on which someone clicked a link to go to request my page, although few browsers send this now and it's usually empty as we see in this entry (but see the second entry in the file, shown above, a referral from Twitter's link shortener)
  12. TLSv1.3 — the version of TLS used for the connection — starting here, these last three fields are logging customizations of my own
  13. X25519 — the key-agreement method used to set up the session key, this means Elliptic-Curve Diffie-Hellman Ephemeral using the x25519 elliptic curve
  14. TLS_AES_256_GCM_SHA384 — the cipher used to protect the data in transit

What Needs To Happen?

We saw above that we're interested in the 7th field of each line. Those are the page requests.

Some of the pages will appear many times, some not quite so many, some just a few times, and a few pages may appear just once. They will be all jumbled in order, some times with a given page a few or several times in a row. But we're looking at what everyone across the Internet is requesting, so there will be no organization to the order.

There's going to be a set of pages that are asked for one or more times each. We would like to get a list that has one line for each page, with the number of requests and the full path of the page.

Then, make sure that they're in descending order, largest first then counting down, based on the request numbers.

We said the twenty most popular, so the twenty largest request numbers, which will be the first twenty when we have them in descending order.

Finally, we said that we wanted each line in the list to also show the rank — 1 for the most popular, 2 for the second most popular, and so on.

Notice that I have described a pipeline. Start with the original data as stored in a file, do something with that content, sending the resulting output on for a second transformation, then sending that into a third transformation, and so on.

The Alaska Pipeline running southeast from Fairbanks.

The Alaska Pipeline running southeast from Fairbanks toward the oil terminal at Valdez. Each pipe segment simply carries crude oil, the same as all the others. We're building a command pipeline, with different processes applied at each segment.

But How Do We Turn That Into Commands?

As you become familiar with commands, you will come to associate the command with what it does, or how it fits into a story like the above. This becomes easy as you become familiar with the command-line environment. One of the needed steps becomes a pair of commands often used as you'll see it here.

Description Command
Select the Nth field. awk
Count how many of each, which becomes:
Sort them into order, and then count how many are in each block of identical strings.
sort | uniq
Sort that into some specific order sort
Keep just the first twenty head
Number the resulting lines. cat

So We Just Run Those Commands?

We run those commands, but most of them need options specifying the details of what we want to do:

Description Command
Select the 7th field. awk '{print $7}'
Count how many of each. sort | uniq -c
Sort that into reverse numerical order sort -nr
Keep just the first twenty head -20
Number the resulting lines. cat -n

We're Ready To Try This!

Here's what happens I try the above on the access log for cromwell-intl.com for the entire year of 2022:

$ awk '{print $7}' access |
			sort | uniq -c |
			sort -nr |
			head -20 |
			cat -n
     1   253885 /
     2   227872 /robots.txt
     3    53616 /ads.txt
     4    52506 /open-source/samba-active-directory/freebsd-raspberry-pi.html
     5    44206 /pictures/denied.png
     6    36565 /open-source/performance-tuning/tcp.html
     7    36107 /technical/dsl/
     8    30780 /open-source/performance-tuning/nfs.html
     9    22064 /open-source/performance-tuning/disks.html
    10    18975 /travel/usa/new-york-mcgees/
    11    18671 /open-source/pdf-not-authorized.html
    12    18302 /travel/
    13    17338 /turkish/
    14    16915 /open-source/performance-tuning/ethernet.html
    15    16756 /open-source/openbsd-qemu-windows-howto.html
    16    16307 /travel
    17    16177 /open-source/sendmail-ssl.html
    18    14510 /open-source/
    19    14125 /cybersecurity/
    20    14023 /cybersecurity/isc2-ccsp/

I think that's pretty good for a first attempt at this!

But, we can do even better!

What you will notice, starting here, is that this page doesn't seem to have anything to do with the design and creation of a file containing a shell script. (At least not until the very end)

Exactly! That's an excellent observation that I'm sure you were about to make! I am using the interactive command-line environment as an experimental development environment. That's the entire point of how to do this.

Popular 'meme' image of Schwartzenegger-like Simpsons character saying 'That's the joke.'

I will have an initial idea for a command pipeline, and I will try it. I will see what worked, and what didn't, and slightly modify it. I will continue that process until I have what I want. Only then will I copy and paste that into a file and turn that into a new shell script.

Fixing A Few Obvious Issues

Four things in the above output stick out. First, the home page or the root of the web site, simply "/", seems to be the most popular page on the site. However, that page on this site has very little content. It's just a very short, high-level list of choices to start exploring this web site, which I admit is oddly diverse. Why is that URL so overwhelmingly popular? Because there are many bots or automated processes out there reconnecting to web servers to see if they're still there.

Second and third, the /robots.txt and /ads.txt URLs appear very high for similar and additional reasons. The /robots.txt file is intended to tell indexing bots (think Google, Bing, Baidu, Yandex, etc) what to ignore. Their legitimate bots check that file quite frequently.

But also, bots operated by hackers search the Internet for lists of pages that web site administrators don't want to appear in search results. Sometimes that's because they contain sensitive information and those webmasters naïvely thought that listing them in /robots.txt would help instead of making things far, far worse. All the legitimate and hacker bots add up to a lot of /robots.txt requests, none of which do we want to measure. Let's ignore those.

As for /ads.txt, you need to use it when you include Google AdSense and Infolinks ads on your pages. Look, sorry but not sorry, those bring me a significant amount of money each year. The Google AdSense ad content gets to be pretty lurid, but not quite so lurid that I'm going to turn down their money.

Fourth, /pictures/denied.png is frequently requested for an annoying reason. People will see a picture on one of my pages that they want to include on their page. Then, they can't be bothered to steal it in a straightforward fashion by saving the image and actually putting it on their page. They will include a little snippet of code that says "Insert here a picture that the client should download from cromwell-intl.com, incurring a little added bandwidth expense for that site."

Well, I'm wise to that trick. If a browser says it wants just a picture from my site, and if it wasn't referred to get that image file by one of google.com, or printfriendly.com, or yahoo.com, or baidu.com, what ends up within the page trying to leech off my bandwidth is this obnoxiously pink and rather small 5,561 byte PNG image.

Bob's 'This page tries to leech traffic' PNG image.

Let's insert one egrep command early in the pipeline, discarding all lines that contain nothing but "/", or "/robots.txt", or "ads.txt", or "denied.png". In regular expressions, ^ is the beginning of a line and $ is the end of a line, and so ^/$ matches a line with nothing but "/". The -v option means "accept only lines without this pattern". The egrep command supports a 'this|that|the_other' pattern. The log file contains 722 MB of data. It's good to accomplish four tasks with one egrep process instead of one process for each task.

$ awk '{print $7}' access |
			egrep -v '^/$|/robots.txt|/ads.txt|denied.png' |
			sort | uniq -c |
			sort -nr |
			head -20 |
			cat -n
     1    52506 /open-source/samba-active-directory/freebsd-raspberry-pi.html
     2    36565 /open-source/performance-tuning/tcp.html
     3    36107 /technical/dsl/
     4    30780 /open-source/performance-tuning/nfs.html
     5    22064 /open-source/performance-tuning/disks.html
     6    18975 /travel/usa/new-york-mcgees/
     7    18671 /open-source/pdf-not-authorized.html
     8    18302 /travel/
     9    17338 /turkish/
    10    16915 /open-source/performance-tuning/ethernet.html
    11    16756 /open-source/openbsd-qemu-windows-howto.html
    12    16307 /travel
    13    16177 /open-source/sendmail-ssl.html
    14    14510 /open-source/
    15    14125 /cybersecurity/
    16    14023 /cybersecurity/isc2-ccsp/
    17    13390 /turkish/verbs.html
    18    13088 /open-source/tar-and-ssh.html
    19    12856 /open-source/raspberry-pi/sdr-getting-started.html
    20    12775 /.env

I Don't Log Requests For Images And Other Non-HTML Data, But Maybe You Do

I have configured my log server to not log requests for images, the associated *.ico icons, CSS or Cascading Style Sheet content, JavaScript content, or font content. In my nginx.conf that looks like this:

# Don't log bulky data, even HTTP requests that will be redirected.
location ~* \.(jpg|jpeg|png|gif|ico|css|js|ttf)$ {
	access_log off;
} 

Let's add a pipeline stage to ignore that content for our report:

$ awk '{print $7}' access |
			egrep -v '^/$|/robots.txt|/ads.txt|denied.png' |
			egrep -v '\.jpg|\.jpeg|\.gif|\.ico|\.css|\.js|\.ttf' |
			sort | uniq -c |
			sort -nr |
			head -20 |
			cat -n
     1    52506 /open-source/samba-active-directory/freebsd-raspberry-pi.html
     2    36565 /open-source/performance-tuning/tcp.html
     3    36107 /technical/dsl/
     4    30780 /open-source/performance-tuning/nfs.html
     5    22064 /open-source/performance-tuning/disks.html
     6    18975 /travel/usa/new-york-mcgees/
     7    18671 /open-source/pdf-not-authorized.html
     8    18302 /travel/
     9    17338 /turkish/
    10    16915 /open-source/performance-tuning/ethernet.html
    11    16756 /open-source/openbsd-qemu-windows-howto.html
    12    16307 /travel
    13    16177 /open-source/sendmail-ssl.html
    14    14510 /open-source/
    15    14125 /cybersecurity/
    16    14023 /cybersecurity/isc2-ccsp/
    17    13390 /turkish/verbs.html
    18    13088 /open-source/tar-and-ssh.html
    19    12856 /open-source/raspberry-pi/sdr-getting-started.html
    20    12775 /.env

Yes, you could combine the two consecutive egrep commands into one. Feel free to do so!

Fixing An Issue That Wasn't So Obvious

The last thing reasonably obvious in the above is that 18,302 clients asked for /travel/ while 16,307 clients asked for /travel, exactly the same thing except without the trailing "/". Those should have been combined into one entry, moving it up in the list.

How to fix this? An easy way to is insert, before the first sort, commands to first add "/" to every line, then change all resulting ".html/" to just ".html", and finally change any remaining "//" to simply "/".

Changing patterns should make you think of sed. We need to use something other than "/" as a delimiter. And we'll make the purge of doubled slash be g for global, changing all instances on a line, cleaning up the clumsy results caused by people inserting extra slashes in links to my pages. I'll gladly take the barely-functional client referrals, but I'll clean it up to see what's popular.

There's another thing to fix while we're fixing some related issues — some clients ask for /travel/Index.html, which is the same as just /travel/, and so on for all other directories. So let's also change the patterns to delete all trailing "Index.html" strings.

Here's the result, which has led to several count and order changes:

$ awk '{print $7}' access |
			egrep -v '^/$|/robots.txt|/ads.txt|denied.png' |
			egrep -v '\.jpg|\.jpeg|\.gif|\.ico|\.css|\.js|\.ttf' |
			sed 's@$@/@' |
			sed 's@.html/@.html@' |
			sed 's@//@/@g' |
			sed 's@/Index.html@/@' |
			sort | uniq -c |
			sort -nr |
			head -20 |
			cat -n
     1    52516 /open-source/samba-active-directory/freebsd-raspberry-pi.html
     2    42100 /technical/dsl/
     3    36586 /open-source/performance-tuning/tcp.html
     4    34824 /travel/
     5    30812 /open-source/performance-tuning/nfs.html
     6    29984 /turkish/
     7    26345 /open-source/
     8    25313 /cybersecurity/
     9    23044 /networking/
    10    22561 /travel/usa/new-york-mcgees/
    11    22343 /radio/
    12    22117 /open-source/performance-tuning/disks.html
    13    20006 /technical/
    14    18685 /open-source/pdf-not-authorized.html
    15    17756 /cybersecurity/isc2-ccsp/
    16    17484 /3d/
    17    17307 /fun/
    18    17259 /russian/
    19    16920 /open-source/performance-tuning/ethernet.html
    20    16757 /open-source/openbsd-qemu-windows-howto.html

Fixing A Few Issues That Aren't At All Obvious

I have dug much further into the details than just this Top 20 list, and I know that there some other issues we should fix right away.

About 94% of the requests for that year used the GET command. Those are the normal page requests we were thinking about when we started this project. However, some of them will be hackers fishing for various vulnerabilities. WordPress has been a dumpster fire of vulnerabilities. There are so many pointless requests for various things under /wp-admin/ in case my server was vulnerable.

Looking at field #9 as shown above, the return code, the failed attacks based on typical WordPress pages would get a 404 or Not Found, because those pages don't exist on my server.

There will be some requests for pages I have moved and for which I have set up redirects. Those will get a 301 or Permanently Moved return code.

Friedrich Nietzsche, from https://commons.wikimedia.org/wiki/File:Nietzsche187a.jpg

In Beyond Good and Evil Friedrich Nietzsche wrote: "Beware that, when fighting monsters, you yourself do not become a monster... for when you gaze long into the abyss, the abyss gazes also into you." That's very relevant to cybersecurity work.

In this project we're interested in ranking the popularity of existing web pages. In a later article, I'll show you how to find attacks by analyzing rather different log entries.

We can make our report in this project more meaningful by limiting our analysis to legitimate, successful web requests. Those are the ones that do a GET request and receive a 200 return code.

But meanwhile, to look briefly into the abyss...

About 4.8% of the requests over that year use HEAD. Those include legitimate requests by indexing bots which are just checking metadata to see whether a page still exists with no changes, without downloading the content itself. They aren't attacks, but they're periodic checks. We aren't interested in these.

A little over 0.5% of the requests use POST. Those are clearly nothing but hacking attempts since my pages have no forms to fill out and submit.

Then it devolves further into OPTIONS, DEBUG, CONNECT, and various other strings only supported by certain known-to-be-buggy web server implementations.

I'll come back in some later article to wade into that morass. But for now, we can clean up our report by basing it on just those requests containing GET followed by 200, both as isolated words. The regular expression '\<GET\>.*\<200\>' matches those requests. \< means "start of a word", \> means "end of a word", and .* means "zero or more of any character".

I could do this:

$ awk '{print $7}' access |
		grep '\<GET\>.*\<200\>' |
		[... more pipeline stages ...]
		

However, let's save a pipeline stage and thus a process by doing both the line (or record) selection and the field selection in the awk command that already starts the pipeline:

$ awk '/\<GET\>.*\<200\>/ {print $7}' access |
			egrep -v '^/$|/robots.txt|/ads.txt|denied.png' |
			egrep -v '\.jpg|\.jpeg|\.gif|\.ico|\.css|\.js|\.ttf' |
			sed 's@$@/@' |
			sed 's@.html/@.html@' |
			sed 's@//@/@g' |
			sed 's@/Index.html@/@' |
			sort | uniq -c |
			sort -nr |
			head -20 |
			cat -n
     1    51368 /open-source/samba-active-directory/freebsd-raspberry-pi.html
     2    35431 /open-source/performance-tuning/tcp.html
     3    35135 /technical/dsl/
     4    29913 /open-source/performance-tuning/nfs.html
     5    21496 /open-source/performance-tuning/disks.html
     6    18505 /travel/usa/new-york-mcgees/
     7    18019 /open-source/pdf-not-authorized.html
     8    16308 /open-source/performance-tuning/ethernet.html
     9    16267 /open-source/openbsd-qemu-windows-howto.html
    10    15698 /open-source/sendmail-ssl.html
    11    14840 /turkish/
    12    14250 /open-source/
    13    13979 /cybersecurity/isc2-ccsp/
    14    13772 /cybersecurity/
    15    12649 /open-source/tar-and-ssh.html
    16    12583 /open-source/raspberry-pi/sdr-getting-started.html
    17    12354 /open-source/grub-vga-modes.html
    18    12184 /travel/
    19    11762 /travel/france/school-lunch-menus/
    20    11682 /turkish/verbs.html

Good Enough For Now!

I will come back later and look at some of the oddities tried by hacking bots in fields #6 and #8. For now, let's call this good enough to turn into a shell script.

Bring that last command back up in your command history, then use your mouse to highlight it and right-click or press Ctrl-Shift-C to copy it. Or, highlight what's shown on this page in your browser and press Ctrl-C.

Then start an editor session for a new file, perhaps web-analysis, and paste the command into it with Ctrl-Shift-V.

Now, add the initial line #!/bin/bash, clean up the indentation, change the access file name as needed, and add any comments. Experiment with inserting a line of column headers with echo. Something like:

#!/bin/bash

# Analyze the Nginx or Apache web server log to report
# the rank, count, and URL of the top 20 pages.

echo '  Rank    Loads URL'
awk '/\<GET\>.*\<200\>/ {print $7}' access |
	egrep -v '^/$|/robots.txt|/ads.txt|denied.png' |
	egrep -v '\.jpg|\.jpeg|\.gif|\.ico|\.css|\.js|\.ttf' |
	sed 's@$@/@' |
	sed 's@.html/@.html@' |
	sed 's@//@/@g' |
	sed 's@/Index.html@/@' |
	sort | uniq -c |
	sort -nr |
	head -20 |
	cat -n

Make it executable, move it to your private executable directory in your PATH, and try it. Something like:

$ chmod +x web-analysis
$ mv web-analysis ~/bin/
$ web-analysis
  Rank    Loads URL
     1    51368 /open-source/samba-active-directory/freebsd-raspberry-pi.html
     2    35431 /open-source/performance-tuning/tcp.html
     3    35135 /technical/dsl/
     4    29913 /open-source/performance-tuning/nfs.html
     5    21496 /open-source/performance-tuning/disks.html
     6    18505 /travel/usa/new-york-mcgees/
     7    18019 /open-source/pdf-not-authorized.html
     8    16308 /open-source/performance-tuning/ethernet.html
     9    16267 /open-source/openbsd-qemu-windows-howto.html
    10    15698 /open-source/sendmail-ssl.html
    11    14840 /turkish/
    12    14250 /open-source/
    13    13979 /cybersecurity/isc2-ccsp/
    14    13772 /cybersecurity/
    15    12649 /open-source/tar-and-ssh.html
    16    12583 /open-source/raspberry-pi/sdr-getting-started.html
    17    12354 /open-source/grub-vga-modes.html
    18    12184 /travel/
    19    11762 /travel/france/school-lunch-menus/
    20    11682 /turkish/verbs.html
[... except, of course, the 20 most popular on YOUR site! ...]

Now you have a working shell script that generates a useful report.

Next:

What's the Point of Asymmetric Encryption?
Asymmetric encryption is often described as useful for "small messages", but that's misleading. They're absolutely vital in cryptographic protocols such as key agreement and authentication.

Latest:

Routing Through Starlink
By the mid 2020s, Internet connections in remote areas frequently used Starlink, the satellite system owned by the pro-fascist eugenicist Elon Musk. Let's see how Starlink works.

Previous:

How to Start Writing Scripts
Someone asked me, "How can I learn scripting?" It's easy to get started! Bash or Python or whatever!

Why the Command Line Rules
Many tasks are much easier to accomplish from the command line. Some tasks can't be done any other way.

Which Programming Language Should I Learn?
Someone asked me, "Which programming language should I learn?" It depends on what you want to do.

Cybersecurity Certifications are Unfair
Cybersecurity certifications are not a fair test of knowledge, let alone skill. They have an illusion of relevance and meaning, making more money for the certifying companies.