I usually rely on running a wc -l (line count) to do a sanity check on the number of rows in a file (e.g., dataset metadata). Yesterday, I lost an unhealthy amount of time debugging a weird discrepancy between the wc -l output on a CSV file and the number of rows after I read it into a pandas DataFrame. And I learned something new about it.
I guess I was one of 10,000 lucky people yesterday.
The discrepancy
I was working with a large CSV file (~680k lines), but:
wc -lreported 680k lines. Opening the file in a text editor confirmed that it had 680k lines.pandas.read_csv()only loaded ~530k rows.
To make things worse, the file was large enough that I couldn’t simply load the file into a spreadsheet software to inspect it manually, since {LibreOffice, OnlyOffice, Excel Online} all crashed loading such a big file.
The culprit: “dirty” newlines
After a lot of debugging, I finally realized that the issue wasn’t missing data, rather “dirty” formatting. The CSV file contained fields enclosed in quotes that had accidental newlines (\n) inside them. Something like this:
ISIC_2592579,"Title: ISIC 2024 Basel
Source: University Hospital of Basel
License: Creative Commons Attribution-NonCommercial (CC BY-NC)",CC-BY-NC,,65,lower extremity,,benign,4.10,,,Benign,,,,,,,,,TBP tile: close-up,,,,,,,,,IP_0941046,,119,119,,male,3D: XP
ISIC_2178158,"Title: ISIC 2024 Basel
Source: University Hospital of Basel
License: Creative Commons Attribution-NonCommercial (CC BY-NC)",CC-BY-NC,,65,posterior torso,,benign,7.60,,,Benign,,,,,,,,,TBP tile: close-up,IL_2433748,,,,,,,,IP_0941046,,117,117,,male,3D: XP
This is an actual snippet from the file I was working with (ISIC Archive metadata). What looks like 6 lines here is actually only 2 rows of data, except each row is split into 3 lines due to hidden newlines in some fields.
The fix
Apparently, the default C-based parsing engine in pandas is fast, but it can struggle with such complex(?) edge cases. The suggested solution is to use the Python parsing engine instead, which is “more feature-complete”.
The C and pyarrow engines are faster, while the python engine is currently more feature-complete. Multithreading is currently only supported by the pyarrow engine. [
pandas.read_csvdocumentation]
Additionally, I also needed to use the quotechar parameter to specify that double quotes (") are used to enclose fields.
import pandas as pd
## The default engine might miss rows with "dirty" newlines in fields.
# df = pd.read_csv("path/to/file.csv", sep=",", header="infer")
# The fix: using the Python engine and specifying the quotechar.
df = pd.read_csv(
"path/to/file.csv",
sep=",",
header="infer",
engine="python",
quotechar='"'
)
Although the documentation implies that the Python engine might be slower, I didn’t notice any difference for my CSV file.