Here is the thing he doesn't seem to understand - all of us who are sysadmins absolutely understand the value of placing complex and large log files into database so that we can query them efficiently. We also understand why having multi-terabyte text log files is not useful.
But what we find totally unacceptable is log files being shoved into binary repositories as the primary storage location. Because you know what everyone has their own idea of what that primary storage location should be, and they are mostly incompatible with each other.
The nice thing about text - for the last 40 years it's been universally readable, and will be for the next 40 years. Many of these binary repositories will be unreadable within a short period, and will be immediately unreadable to those people who don't know the magic tool to open them.
Uh, I don't know what world you live in but I'd like the address because mine sucks in comparison.
Text logs are definitely not a "universal format". Easily accessible, sure. Human readable most of the time? Okay. Universal? Ten times nope.
Give you an example: uwsgi logs don't even have timestamps, and contain whatever crap the program's stdout outputs, so you often end up with three different types of your "universal format" in there. I'm not giving this example because it's contrived, but because I was dealing with it the very moment I read your comment.
Originally, you had a problem - the data wasn't formatted in a manner that you could parse cleanly.
Now, you have a new problem - not only is the data not formatted properly, it's now in some opaque binary file.
Saying that there are poorly formatted text files isn't a hit against text files, it's a hit against poor formatting. The exact same problem exists if the file is in binary form, and not formatted properly.
I think it actually solves most of the problems text logs have that binary don't (inability to easily present structured data, etc.) yet keeps the advantages of a text log (human readable, resistant to file corruption, future-proof).
Despite that, I can be pretty sure when I walk in to a foreign system there will be nginx logs, just where I expect them, almost certainly in the format I'm used to. And even if the format differs, it's not much of a problem. Binary logs, big problem.
The way I read his article, he's not really opposed to additionally keeping your logs around as text. But you make a good point of using text as the primary storage location, since you can always easily feed it to some binary system for further analysis.
Would the best practice then be to keep your logs around as (compressed) text, but additionally feed it to your log analysis system of choice for greater querying capabilities?
But don't cripple me by shoving your primary log files into binary format so I can't quickly pull data out of them with awk/grep/sed when I need to quickly diagnose a local issue.
our product stores all the logs raw in flats files on the file system, we don't use databases for keeping the logs in, this allows you to scale massively (ingestion limit is that of the correlation engine and disk bandwidth). You then just need an efficient search crawler and use of metadata so search performance is good too.
Issue is if you every need to pull the logs for court and you have messed with them (i.e. normalized them and stuffed them into a DB) then your chain of custody is broken.
Best of both worlds means parsed out normalisation so I don't have to remember that Juniper calls source ip srcIP and Cisco SourceIP, but the original logs under the covers for grepping if you need.
Then punch in the face is a universal form of communication. Also EBCDIC is the only encoding future will recognize!
Should I submit patches to jawstats so that it'll support google-log-format 1.0 beta, or the newer Amazon Cloud Storage 5 format? Or both? Or just go with the older Microsoft Log Storage Format? Or wait until Gruber releases Fireball Format? Has he decided yet whether to store dates as little-endian Unix 64 bit int timestamps, or is he still thinking about going with the Visual FoxPro date format, y'know, where the first 4 bytes are a 32-bit little-endian integer representation of the Julian date (so Oct. 15, 1582 = 2299161) and the last 4 bytes are the little-endian integer time of day represented as milliseconds since midnight? (True story, I had to figure that one out once. Without documentation.)
Should I write a new plugin for Sublime Text to handle the binary log formats? Or write something that will read the binary storage format and spit out text? Or is that too inefficient? Or should I give up on reading logs in a text form at all and write a GUI for it (maybe in Visual Basic)?
Do you know when I should expect suexec to start writing the same binary log format as Apache, or should I give up waiting on that and just write a daemon to read the suexec binary logs and translate them to the Apache binary logs?
Should I take the time to write a natural language parsing search engine for my custom binary log format? Do you think that's worth the time investment? I would really like to be able to search for common misspellings when users ask about a missing email, you know, like "/[^\s]+@domain.com/" does now.
I look forward to your guidance. I've been eagerly awaiting the day that I can have an urgent situation on my hands and I can dig through server logs with all of the ease and convenience of the Windows system logs.
I doubt my Linux (including webOS & Android), FreeBSD, and OS X boxes are going to settle on a single binary format in the next couple of decades or even a single API & toolset. In your brave new world the very first thing I'm going to need to do if I have to combine logs across them is to extract data from at least three formats and the most convenient format is often going to be text - i.e. right back where we started, but with extra work for each OS. More likely you'll get a mix of things using the system APIs, custom binary formats, custom text formats, and syslog. Adding more steps to get at the same data doesn't help.
More importantly, binary logs are unreliable when you're dealing with a system that's completely trashed. You can often get usable text logs off a disk that's throwing I/O errors every few dozen bytes or even from a corrupted raw disk image. They may not be cryptographically "sealed", but I'd rather have them than an error message about the binary format being corrupt. That should be an implementation detail, but I haven't seen much interest from the binary logs camp in making the file formats resilient.
And that's the nub of it: text logs are for when you may have many varied, complex reader use-cases, and you don't understand all those cases well enough yet to lock them down forever, and you have a thousand excellent tools at your disposal that you would like to be able to continue to use.
Recent log spelunking for me included 'cat log.? | grep fail | sed 's/^.worker_id$//g' | awk '{ print $5, $4 }' | sort -n -r | sed 30q'.
There's no analogue in any binary logging system I've ever found.
This is really the important point here. For small systems, grep works fine. The number of people administering small systems is much greater than the number of people administering large systems. The systemd controversy has caused people to fear that change they don't want will be imposed on them and their objections insultingly dismissed: a consequence of incredibly bad social "change management" by its proponents.
They are therefore deploying pre-emptive rhetorical covering fire against the day when greppable logs will be removed from the popular Linux distributions. Plain text is the lingua franca; binary formats bind you to their tools with a particular set of design choices, bugs and disadvantages. My adhoc log grepping workflow has a different set of bugs and disadvantages, but they're mine.
That really the key for me. My go to example is searching for IP numbers across different logs. If I have just one machine, and I want to find an IP in the SSH, web and mail logs I shouldn't have to use multiple tools for getting that data.
Logstash, Splunk and other tools store stuff binary, as he writes, and that's perfectly valid, the only solution in fact. But I don't want to be force to run a centralized logging server, if I have just the one or two servers.
If it's okay to claim that binary logging is the only way to go, because you have hundreds of servers, it's also okay to claim that text files are the only solution, because I just have one server.
Finally, isn't those binary logs (those that come from individual services) going to be transformed into text when I transmit them to something like Splunk, only to be transformed back to some internal binary format when received? It seems we could save a transformation in that process.
(If you're shunting _all_ of your log data off at that scale, you're crazy, and you'll melt your switches if you aren't careful.)
The name of the game is to think of the problems that you're solving and how they relate to the business bottom line. No sooner, no later. Additionally, what's most troubling is that we've turned this exercise into an emotional one, not one with any sort of scientific-oriented perspective.
I can personally say with conviction that I'd like to sit down and actually collect data on, e.g., how many instructions it takes to store logs to disk in plain text versus a binary format, how many it takes to retrieve logs from disk in both situations, and how much search latency I incur when trying to retrieve said logs from disk in the same. At scale, which is where most of my attention lies these days, that's the kind of thing that matters because those effects get amplified automatically—often to operators' and capacity planners' horrors—by the number of machines you have.
If you're dealing with smaller systems, it won't matter as much, but at that point, you're probably dealing with the other side of this, which is having information on how many requests you get for historical log data and what sort of criteria were used in that search. If you're getting requests less frequently than, say, once per quarter, it likely wouldn't be worth your time to invest in what Mr. Nagy is evangelizing.
tl;dr: Continue using your ad hoc grep-fu, but be mindful of how much time it takes you to get the data you're looking for. That alone will be your decision criterion for adopting something like this.
But even still - I like to have the text files as journals of original entry - so I can occasionally do a tail -f incoming.log| egrep -i "somedevice".
And having the original files in text format is zero impediment to getting them into handy binary database form.
Do you have any evidence for this statement? Because it sounds all kinds of wrong.
There are a lot of hobbyists, a vast number of people with a Linux box in the corner of the office or a few cloud instances, a smaller number of people running IT for multinationals and one or two people who have whole datacenters to themselves. The larger the system, the lower the computer/human ratio.
If I read it correctly there are about 250 million active sites (roughly). It seems unlikely that they are all massive corporate sites.
As an aside, the idea that systemd is a good thing is hilarious to me at the least because it is so brash about making an important change to a huge chunk of the system. Yes the bugs will eventually get ironed out, but in the meantime? Count me out! I have work to do and am not interested in being a free tester for Redhat on my live systems.
The downside to this is that now you don't have a set of global tools which can easily operate across these separate datasets without writing code against an API. I hear PowerShell tackles this; I don't know how well. The general principle though harms velocity at just getting something simple done, to the benefit of being able to do extremely complex things more easily. See Event Viewer for a good example of this.
Logs don't exist in isolation. I want to use generally global tooling to access and manipulate everything. I don't want to have to write (non-shell) code, recall a logging-specific API or to have to take the extra step of converting my logs back to the text domain in order to manipulate data from them against text files I have exported from elsewhere for a one-off job. An example might be if I have a bunch of mbox files and need to process them against log files that have message IDs in them. I could have an API to read the emails, and an API to read the logs, or I could just use textutils because I know an exact, validating regexp is not necessary and log format injection would have no consequence in this particular task.
I do see the benefits of having logs be better structured data, but I also see downsides of taking plain text logs away. Claiming that there are no downsides, and therefore no trade-off to be made, is futile. It's like playing whack-a-mole, because nobody is capable of covering every single use case.
Actually any non-UNIX OS clone out there, including mainframe and embedded OSes.
If you run any sort of distributed system, this is vital. And while that counts as binary logs, I would argue that on the local boxes it should stay text.
I would agree, if you are running any sort of complex queries on your data - go to logstash, and do it there - it much nicer than regexes.
If on the other hand, you just want to see how a development environment is getting on, or to troubleshoot a known bad component tail'ing to | grep (or just tail'ing depending on the verbosity of your logs) is fine.
I don't have to remember some weird incantation to see the local logs, worry about corruption etc.
One problem I will point out with the setup described is syslon-ng can be blocking. If the user is disconnected from the central logstash, and their local one dies, as soon as the FIFO queue in syslog-ng fills, good luck writing to /dev/log , which means things like 'sudo' and 'login' have .... issues.
Instead, if you have text files being written out, and something like beaver collecting them and sending them to logstash, you have the best of both worlds.
For administering Unix like systems, the ability to use a variety of tools to process streams of text is an advantage and valuable capability.
That said, your needs do change when you're talking about managing 10 vs 10,000 vs 100,000 hosts. I think what you're really seeing here is a movement to "industrialize" the operations of these systems and push capabilities from paid management tools into the OS.
Freeform text logs usually contain more detail as to what exactly happened.
For me the main reason to access plaintext logs is they seldom fail, and they are simple. They are a bore to analyse, they CAN be analysed.
Anyway, this discussion only makes sense if the task at hand involves heavy log analysis, don't complicate what is simple when it isn't needed.
As for the razor analogy, you're right, however I wouldn't change my beard to be "razor compatible only". In the software world I'd say it is still not uncommon to find yourself "stranded in a desert island".
* log file corruption - text parsing would still work,
* tooling gets deleted - there's a million ways you
can still render plain text even when you've lost
half your POSIX/GNU userland,
* network connection problems, breaking push to a
centralised database - local text copies would still
be readable.
In his previous blog post he commented that there's no point running both a local text version and a binary version, but since the entirety of his rant is really about tooling rather than log file format, I'm yet to see a convincing argument against running the two paradigms in parallel.So this really is dependant on the file format of your log data, rather than an inherent difference between text and binary logging.
You do have a bigger concern, but once that needs to be addressed by consulting the log files.
I fully accept that most of the situations I exampled are rare fringe cases, but log files are the go to when all else fails and thus there needs to be a copy that's readable if and when everything else does fail.
The more likely situation would be that the logs are stored on a shared storage server, and the machine you are using to look at the logs doesn't have the logging system installed.
That’s a straw man. If you’re grepping logs, you don’t need a regular expression that matches only valid dates because you can assume that the timestamps on the log records are valid dates. But I suppose
2013-12-(2[4-9]|3.)|2014-..-..|2015-0([123]-..|4-(0.|1[01]))
doesn’t look so bad.The whole thing is similarly exaggerated.
That's the thing about having simple text log files - the cognitive load required to pull data out of them, often into a format that can then be manipulated by another tool (awk, being one of the more well known), is so low that you can perform them without a context switch.
If you have a problem, you can reach into the log files, pull out the data you need, possibly massage/sum/count particular records with awk, all without missing a beat.
This is particularly important for sysadmins who may be managing dozens of different applications and subsystems. Text files pull all of them together.
But, and here is the most important thing that people need to realize - for scenarios in which complex searching is required, by all means move it into a binary format - that just makes sense if you really need to do so.
The argument isn't all text instead of binary, it is at least text and then use binary where it makes sense.
Even _if_ I agreed with your assumption[1], are you actually suggesting that
2013-12-(2[4-9]|3.)|2014-..-..|2015-0([123]-..|4-(0.|1[01]))
is a serious solution? I admit that it is shorter than the author's solution, _but it still proves his point_.And then what about multi-line log lines? `grep` can't tell where the next line is; sure, I can -A, but there's no number I can plug in that's going to just work: I need to guess, and if I get a truncated result or too much output, adjust. Worse, I get too much output _and_ a truncated record where I need it…
log-cat --from 2013-12-24 --to 2015-04-11 | grep <further processing>
[1] most log file formats I've run across do not guarantee the date to appear in a given location.The example with the timestamps is also strange. No matter how you store the timestamps, parsing a humanly reasonable query like "give me 10 hours starting from last Friday 2am" to an actual filter is a complex problem. The problem is complex no matter how you store your timestamp. You can choose to do the complexity before and create complex index structures. You can choose to have complex algorithms to parse simple timestamps in binary or text form, you can build complex regexes. But something needs to be complex, because the problem space is. Just being binary doesn't help you.
And that's really the point here, isn't it? Just being binary in itself is not an advantage. It doesn't even mean by itself that it will save disk space. But text in itself is an advantage, always, because text can be read by humans without help (and in some instances without any training or IT education), binary not.
Yesterday I was thinking there might be something about binary logs. Now I'm convinced there isn't. The only disadvantage seems to be that you also lose disk space if you store it in clear text. But disk space isn't an issue in most situations (and in many situations where it is an issue you might have resources and tools at hand to handle that as well) It is added complexity for no real advantage. Thanks for clearing that up.
When applied widely throughout a system, this leads to the internationalisation of log messages. Thus lessening the anglocentric bias in systems software. Windows has done this for years, at least with its own system logging (other applications can still put free-form text into the event logs if they wish.)
About what you put in the log message: You can also put different fields in a line of text. Not getting the advantage or trade-off here.
About the internationalisations: As non-English developers we force all our systems who have logging internationalisation to English system language so we have a common ground for the messages. Understanding the English message is nearly no burden. Log Messages are Event triggers, either in code or in a developer's/admin's mind. If I get a log message in my native language I don't know which event that triggers, which makes it actually harder.
Really. I don't know any non-English person who considers log internationalisation a good thing. Fighting anglocentricism is a very anglocentric topic. Outside of UK/US that's a non topic. We (non-English people) are happy that there is a language we can use to talk to each other and we don't really care how it came to be that widely known.
And even if you don't speak English, I don't see the advantage of parsing \x03 instead of "Error:.*". Both are strings that have a meaning which is rather independent of its encoding.
journalctl --since="$(date -d'last friday 2am' '+%F %X')" --until="$(date -d'last friday 2am + 10 hours' '+%F %X')"
Now I'm no systemd apologist but maybe some of the hate towards systemd, journald and pals is unwarranted. If one gives these newer tools a chance, they actually have some nice features. Despite the Internet's opinion, seems like they were not actually created to make Linux users' lives difficult.
If binary logs turn out to be the wrong technological decision, I'm sure we'll figure that out and change over to text logs again. All it would take is a few key savvy users losing their logs to journald corruption and the change in the wider "ecosystem" would be made. But if all goes well... then what's to complain about? :-D
Text logs can be corrupted, text logs can be made unusable, you need a ton of domain-specific knowledge to even begin to make sense of text logs, etc.
But there's always a sense that, if you had the time, you could still personally extract meaning from them. With binary logs, you couldn't personally sit there and read them out line by line.
The issue is psychology, not pragmatism, and that's why text logs have been so sticky for so long.
Again if the binary log is simply better compressed data, well we have ways of compressing text already as an afterthought. This really, fundamentally, seems to be a conflict in how people want to administer their systems and, for the most part, this seems to be about creating a "tool" that people then have to pay money for to better understand.
This guy is a first class idiot who knows enough to reformulate a decided issue into yet another troll article. "a database (which then goes and stores the data in a binary format)". How about a text file IS a database. It's encoded 1s and 0s in a universal format instead of the binary DB format which can be corrupted with the slightest modification or hardware failure.
* Journal is just terrible.
* some text logs are perfectly fine.
* when you are in rescue mode, you want text logs
* some people use text logs as a way to compile metrics
I think the most annoying thing for me about journald is that it forces you to do something their way. However its optional, and in centos7 its turned off, or its beaten into such a way that I haven't noticed its there.... (if that is the case, I've not really bothered to look, I poked about to see if logs still live in /var/log/ they did, and that was the end of it. Yes, I know that if this is the case, I've just undermined my case. Shhhhh.)
/var/log/messages for kernel oopes, auth for login, and all the traditional systemy type things are good for text logs. Mainly because 99.9% of the time you get less than 10 lines a minute.
being able to sed, grep, tee and pipe text files are brilliant on a slow connection with limited time/mental capacity. ie. a rescue situation. I'm sure there will be a multitude of stable tools that'll popup to deal with a standardised binary log format, in about ten years.
The last point is the big kicker here. This is where, quite correctly its time to question the use of grep. Regex is terrible. Its a force/problem amplfier. If you get it correct, well done. Wrong? you might not even know.
Unless you don't have a choice, you need to make sure that your app kicks out metrics directly. Or as close to directly as possible. Failing that you need to use something like elastic search. However because you're getting the metrics as an afterthought, you have to do much more work to make sure that they are correct. (although forcing metrics into an app is often non trivial)
If you're starting from scratch, writing custom software, and think that log diving is a great way to collect metrics, you've failed.
if you are using off the shelf parts, its worth Spending the time and interrogating the API to gather stats directly. you never know, collectd might have already done the hard work for you.
The basic argument he puts forth is this: text logs are a terrible way to interchange and store metrics. And yes, he is correct.
Just type journalctl and you should see the data there.
log-cat <binary-log-file>
… that just outputs it in text. Then you can attack the problem with whatever text-based tools you want.But to me, having a utility that I could do things like, get a range of log lines — in sorted order —, or, grep on just the message, would be amazing. These are all things that proponents of grep I'm sure will say "you can!" do with grep… but you can't.
The dates example was a good one. I'd much rather:
log-cat <bin-log> --from 2014-12-14 --to 2015-01-27
Also, my log files are not "sorted". They are, but they're sorted _per-process_, and I might have multiple instances of some daemon running (perhaps on this VM, perhaps across many VMs), and it's really useful to see their logs merged together[2]. For this, you need to understand the notion of where a record starts and ends, because you need to re-order whole records. (And log records' messages are _going_ to contain newlines. I'm not logging a backtrace on one line.) grep doesn't sort. |sort doesn't know enough about a text log to adequately sort, but $ log-cat logs/*.log --from 2014-12-14 --to 2015-01-27
<sorted output!>
Binary files offer the opportunity for structured data. It's really annoying to try to find all 5xx's in a log, and your grep matches the process ID, the line number, the time of day…I've seen some well-meaning attempts at trying to do JSON logs, s.t. each line is a JSON object[1]. (I've also seen it attempted were all that is available is a rudimentary format string, and the first " breaks everything.)
Lastly, log files sometimes go into metrics (I don't really think this is a good idea, personally, but we need better libraries here too…). Is your log format even parseable? I've yet to run across one that had an unambiguous grammar: a newline in the middle of a log message, with the right text on the second line, can easily get picked up as a date, and suddenly, it's a new record. Every log file "parser" I've seen was a heuristic matcher, and I've seem most all of them make mistakes. With the simple "log-cat" above, you can instantly turn a binary log into a text one. The reverse — if possible — is likely to be a "best-effort" transformation.
[1]: the log writer is forbidden to output a newline inside the object. This doesn't diminish what you can output in JSON, and allows newline to be the record separator.
[2]: I get requests from mobile developers tell me that the server isn't acting correctly all the time. In order to debug the situation, I first need to _find_ their request in the log. I don't know what process on what VM handled their request, but I often have a _very_ narrow time-range that it occurred in.
Not that the log files on Linux are all entirely text-based anyway. The wtmpx and btmpx files are of a binary format, with specialised tools for querying. I don't see anyone complaining about these and insisting that they be converted to a text-only format.