Python determine file line count
I worked on several projects where line count was the core function of the software, and working as fast as possible with a huge number of files was of paramount importance. The second potential bottleneck is memory management: the more you load at once, the faster you can process, but this bottleneck is negligible compared to the first.
Hence, there are 3 major ways to reduce the processing time of a line count function, apart from tiny optimizations such as disabling gc collection and other micro-managing tricks:. By far, this is how you can get the biggest speed boosts. If you need variable length, you can always pad smaller lines. This way, you can calculate instantly the number of lines from the total filesize, which is much faster to access. Often, the best solution to a problem is to pre-process it so that it better fits your end purpose.
Then, you can expect to get a multiplier boost in proportion with the number of disks you have. If buying multiple disks is not an option for you, then parallelization likely won't help except if your disk has multiple reading headers like some professional-grade disks, but even then the disk's internal cache memory and PCB circuitry will likely be a bottleneck and prevent you from fully using all heads in parallel, plus you have to devise a specific code for this hard drive you'll use because you need to know the exact cluster mapping so that you store your files on clusters under different heads, and so that you can read them with different heads after.
Indeed, it's commonly known that sequential reading is almost always faster than random reading, and parallelization on a single disk will have a performance more similar to random reading than sequential reading you can test your hard drive speed in both aspects using CrystalDiskMark for example.
If none of those are an option, then you can only rely on micro-managing tricks to improve by a few percents the speed of your line counting function, but don't expect anything really significant. Rather, you can expect the time you'll spend tweaking will be disproportionated compared to the returns in speed improvement you'll see. Stack Overflow for Teams — Collaborate and share knowledge with a private group. Create a free Team What is Teams? Collectives on Stack Overflow.
Learn more. How to get line count of a large file cheaply in Python? Ask Question. Asked 12 years, 8 months ago. Active 1 month ago. Viewed 1. SilentGhost SilentGhost k 61 61 gold badges silver badges bronze badges.
Do you need exact line count or will an approximation suffice? Legend: I bet pico is thinking, get the file size with seek 0,2 or equiv , divide by approximate line length. You could read a few lines at the beginning to guess the average line length.
IanMackinnon Works for empty files, but you have to initialize i to 0 before the for-loop. Show 5 more comments. Active Oldest Votes. Kyle Kyle 7, 1 1 gold badge 15 15 silver badges 10 10 bronze badges. Is it required to 'close '? I think we cannot use 'with open ' in this short statement, right?
It's not any faster than the other solutions, see stackoverflow. Add a comment. You can't get any better than that. Yuval Adam Yuval Adam k 88 88 gold badges silver badges bronze badges. Exactly, even WC is reading through the file, but in C and it's probably pretty optimized.
Tomalak That's a red herring. While python and wc might be issuing the same syscalls, python has opcode dispatch overhead that wc doesn't have. You can approximate a line count by sampling. It can be thousands of times faster. Learn how your comment data is processed.
Related Posts. Leave a Comment Cancel Reply Your email address will not be published. Consenting to these technologies will allow us to process data such as browsing behavior or unique IDs on this site. Not consenting or withdrawing consent, may adversely affect certain features and functions. There must be something wrong in my code but I don't know fileinput enough to find it. Or is it a normal behaviour of fileinput.
If so, for a file very big, the size of the file may exceed the capacity of the RAM. I am not sure of this point: I tried to test with a file of 1,5 GB but it's rather long and I dropped this point for the moment. If this point is right, it constitutes an argument to use the other solution with enumerate. Messing around with a similar problem recently and came up with this class based solution. For example GollyJer GollyJer  Coddy Coddy 3 3 silver badges 14 14 bronze badges.
Could you maybe elaborate on your answer a bit more? Actually I did but not sure why it missed out. Added again, thanks for reminding. Here is how I am implementing these options in my code. This does not provide an answer to the question. To critique or request clarification from an author, leave a comment below their post - you can always comment on your own posts, and once you have sufficient reputation you will be able to comment on any post.
Prune - thanks for the comment and I have included a code snippet from my learning to add clarity to what I was suggesting. Please note that "above" has no context among answers.
Answer votes change and answers can be sorted in a number of different ways. Better to link to the answer to which you are referring.
Sign up or log in Sign up using Google. Strengthen your foundations with the Python Programming Foundation Course and learn the basics. Previous Python program to copy odd lines of one file to other. Next Python - Get number of characters, words, spaces and lines in a file.
Recommended Articles. Python program to Count the Number of occurrences of a key-value pair in a text file. Article Contributed By :. Easy Normal Medium Hard Expert.
Writing code in comment?
0コメント