cardinal_pythonlib.spreadsheets


Original code copyright (C) 2009-2022 Rudolf Cardinal (rudolf@pobox.com).

This file is part of cardinal_pythonlib.

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.


Manipulate spreadsheets.

Note:

  • openpyxl is dreadfully slow. Its results are picklable, but not sensibly so (e.g. generating a >500Mb picklefile from a 12Mb spreadsheet.

  • xlrd is much faster, but we can’t pickle its results.

class cardinal_pythonlib.spreadsheets.RowHolder(sheetholder: SheetHolder, row: int)[source]

Class to read from a single row of a spreadsheet.

The intended use is to create something like a dataclass, but one that knows its spreadsheet structure. Like this:

from cardinal_pythonlib.spreadsheets import RowHolder, SheetHolder

class ReferralSheetHolder(SheetHolder):
    SHEET_NAME = "Patient Referrals 2018-19"

    def gen_referral_rows(self) -> Iterable["ReferralRow"]:
        for rownum in self.gen_row_numbers_excluding_header_row():
            yield ReferralRow(self, rownum)

class ReferralRow(RowHolder):
    def __init__(self, sheetholder: SheetHolder, row: int) -> None:
        super().__init__(sheetholder, row)

        self.inc_next_col()  # column 0: query period; ignore
        self.patient_id = self.str_int_pp()
        self.referral_id_within_patient = self.int_pp()
        self.age_at_referral_int = self.int_pp()
        self.ethnicity = self.str_pp()
        self.gender = self.str_pp(check_header="Gender")

def import_referrals(book: Book) -> None:
    sheet = ReferralSheetHolder(book)
    for referral in sheet.gen_referral_rows():
        pass  # do something useful here
bool_pp(default: bool | None = None, true_values_lowercase: List[Any] | None = None, false_values_lowercase: List[Any] | None = None, unknown_values_lowercase: List[Any] | None = None, check_header: str | Sequence[str] | None = None) bool | None[source]

Reads a boolean value, then increments the “current” column. Optionally, checks that the header for this column is as expected.

date_pp(default: date | None = None, check_header: str | Sequence[str] | None = None) date | None[source]

Reads a date, then increments the “current” column. Optionally, checks that the header for this column is as expected.

datetime_pp(default: datetime | None = None, check_header: str | Sequence[str] | None = None) datetime | None[source]

Reads a datetime, then increments the “current” column. Optionally, checks that the header for this column is as expected.

decimal_pp(default: float | None = None, check_header: str | Sequence[str] | None = None, dp: int | None = None, rounding: str = 'ROUND_HALF_UP') Decimal | None[source]

Reads a Decimal, then increments the “current” column. Optionally, checks that the header for this column is as expected.

ensure_header(col: int, header: str | Sequence[str]) None[source]

Ensures the column has an appropriate heading value, or raises ValueError.

ensure_heading(col: int, header: str | Sequence[str]) None[source]

Synonym for ensure_header().

ensure_next_col_header(header: str | Sequence[str]) None[source]

Ensures the next column has an appropriate heading value, or raises ValueError.

ensure_next_col_heading(header: str | Sequence[str]) None[source]

Synonym for ensure_next_col_header().

float_pp(default: float | None = None, check_header: str | Sequence[str] | None = None) float | None[source]

Reads a float, then increments the “current” column. Optionally, checks that the header for this column is as expected.

inc_next_col() None[source]

Increments the next column to be read.

int_pp(default: int | None = None, check_header: str | Sequence[str] | None = None) int | None[source]

Reads an int, then increments the “current” column. Optionally, checks that the header for this column is as expected.

property next_col: int

Returns the column number (0-based) that will be used by the next automatic read operation.

none_pp(check_header: str | Sequence[str] | None = None) None[source]

Reads a null value, and ensures that it is null; then increments the “current” column. Optionally, checks that the header for this column is as expected.

property row_one_based: int

One-based row number.

property row_zero_based: int

Zero-based row number.

set_next_col(col: int) None[source]

Resets the next column to be read automatically.

str_int_pp(default: str | None = None, check_header: str | Sequence[str] | None = None) str | None[source]

Reads an integer as a string, then increments the “current” column. Optionally, checks that the header for this column is as expected.

str_nonfloat_pp(default: str | None = None, check_header: str | Sequence[str] | None = None) str | None[source]

Reads something that may be a string or numeric, but if it’s numeric, it’s integer (not float). Then increments the “current” column. Optionally, checks that the header for this column is as expected.

str_pp(default: str | None = None, check_header: str | Sequence[str] | None = None) str | None[source]

Reads a string, then increments the “current” column. Optionally, checks that the header for this column is as expected.

value_pp(check_header: str | Sequence[str] | None = None) Any[source]

Reads a value, then increments the “current” column. Optionally, checks that the header for this column is as expected.

class cardinal_pythonlib.spreadsheets.SheetHolder(book: Book | None = None, sheet_name: str | None = None, sheet_index: int | None = None, sheet: Sheet | None = None, header_row_zero_based: int | None = None, first_data_row_zero_based: int | None = None, null_values: List[Any] | None = None, bool_true_values_lowercase: List[Any] | None = None, bool_false_values_lowercase: List[Any] | None = None, bool_unknown_values_lowercase: List[Any] | None = None, debug_max_rows_per_sheet: int | None = None)[source]

Class to read from an Excel spreadsheet.

There are two ways to specify the sheet:

  1. Provide a workbook via book and…

    1. a sheet number, or

    2. a sheet name.

  2. Provide a worksheet directly via sheet.

You can specify the following as _init__ parameters or (via their capitalized versions) by subclassing:

  • sheet_name

  • header_row_zero_based

  • null_values

  • bool_true_values_lowercase

  • bool_false_values_lowercase

  • bool_unknown_values_lowercase

Initialization parameters take priority over subclassed values.

Parameters:
  • book – Workbook, from which a worksheet should be selected.

  • sheet_name – Name of a sheet to select from within book.

  • sheet_index – Index (zero-based) of a sheet to select from within book.

  • sheet – Worksheet, provided directly.

  • header_row_zero_based – Row number (zero-based) of the header row.

  • first_data_row_zero_based – Row number (zero-based) of the first row containing data.

  • null_values – Values to treat as null (blank) values, converted to Python None.

  • bool_true_values_lowercase – Values to treat, by default, as True in Boolean columns.

  • bool_false_values_lowercase – Values to treat, by default, as False in Boolean columns.

  • bool_unknown_values_lowercase – Values to treat, by default, as missing/unknown in Boolean columns.

  • debug_max_rows_per_sheet – Debugging option: the maximum number of data rows to process.

ensure_header(col: int, header: str | Sequence[str]) None[source]

Ensures that the header is correct for a specified column, or raise ValueError.

You can specify a single correct heading or a sequence (e.g. list) of them.

ensure_heading(col: int, heading: str | Sequence[str]) None[source]

Synonym for ensure_header().

gen_row_numbers_excluding_header_row(with_counter: bool = True) Iterable[int][source]

Generates row numbers.

xlrd uses 0-based numbering, so row 1 is the first beyond a header row.

gen_rows_excluding_header_row(with_counter: bool = True) Iterable[Sequence[Cell]][source]

Generates rows. AVOID; index-based access is faster.

xlrd uses 0-based numbering, so row 1 is the first beyond a header row.

property headers: List[str]

Returns all headings.

property headings: List[str]

Synonym for headers.

property n_data_rows: int

Total number of data rows (below any header row).

property n_rows: int

Total number of rows.

read_bool(row: int, col: int, default: bool | None = None, true_values_lowercase: List[Any] | None = None, false_values_lowercase: List[Any] | None = None, unknown_values_lowercase: List[Any] | None = None, check_header: str | Sequence[str] | None = None) bool | None[source]

Reads a boolean value.

read_date(row: int, col: int, default: date | None = None, check_header: str | Sequence[str] | None = None) date | None[source]

Reads a date from an Excel spreadsheet

https://stackoverflow.com/questions/32430679/how-to-read-dates-using-xlrd

read_datetime(row: int, col: int, default: datetime | None = None, check_header: str | Sequence[str] | None = None) datetime | None[source]

Reads a datetime from an Excel spreadsheet via xlrd.

https://stackoverflow.com/questions/32430679/how-to-read-dates-using-xlrd

read_decimal(row: int, col: int, default: Decimal | None = None, check_header: str | Sequence[str] | None = None, dp: int | None = None, rounding: str = 'ROUND_HALF_UP') Decimal | None[source]

Reads a Decimal from the spreadsheet.

If dp is not None, force the result to a specified number of decimal places, using the specified rounding method.

read_float(row: int, col: int, default: float | None = None, check_header: str | Sequence[str] | None = None) float | None[source]

Reads a float from the spreadsheet.

read_int(row: int, col: int, default: int | None = None, check_header: str | Sequence[str] | None = None) int | None[source]

Reads an integer from a spreadsheet.

read_none(row: int, col: int, check_header: str | Sequence[str] | None = None) None[source]

Reads a value, and checks that it is a none/null value

read_str(row: int, col: int, default: str | None = None, check_header: str | Sequence[str] | None = None) str | None[source]

Reads a string from a spreadsheet.

read_str_int(row: int, col: int, default: str | None = None, check_header: str | Sequence[str] | None = None) str | None[source]

Reads a string version of an integer. (This prevents e.g. “2” being read as a floating-point value of “2.0” then converted to a string.)

read_str_nonfloat(row: int, col: int, default: str | None = None, check_header: str | Sequence[str] | None = None) str | None[source]

Reads something that may be a string or numeric, but if it’s numeric, it’s integer (not float). (This prevents e.g. “2” being read as a floating-point value of “2.0” then converted to a string.)

read_value(row: int, col: int, check_header: str | Sequence[str] | None = None) Any[source]

Retrieves a value from a cell of a spreadsheet.

Optionally, check that the heading for this column is correct (see ensure_header()).

property sheet_name: str

Name of the sheet within the workbook (file).

cardinal_pythonlib.spreadsheets.all_same(items: Iterable[Any]) bool[source]

Are all the items the same?

https://stackoverflow.com/questions/3787908/python-determine-if-all-items-of-a-list-are-the-same-item

… though we will also allow “no items” to pass the test.

cardinal_pythonlib.spreadsheets.attr_all_same(items: Sequence[Any], attr: str) bool[source]

Returns whether the value of an attribute is the same across a collection of items.

Parameters:
  • items – Items to check

  • attr – Name of attribute whose value should be compared across items.

cardinal_pythonlib.spreadsheets.check_attr_all_same(items: Sequence[Any], attr: str, id_attr: str | None = None, fail_if_different: bool = True, ignore_none: bool = False) None[source]

Checks if the value of an attribute is the same across a collection of items, and takes some action if not.

Parameters:
  • items – Items to check

  • attr – Name of attribute whose value should be compared across items.

  • id_attr – If the attributes are not all the same, use the value of this attribute from the first item to give some identifying context to the failure message.

  • fail_if_different – If true, raises ValueError on failure; otherwise, prints a warning to the log.

  • ignore_none – Ignore None values?

cardinal_pythonlib.spreadsheets.colnum_zb_from_alphacol(alphacol: str) int[source]

Reverses column_lettering(), generating a zero-based column index from an alphabetical name (A to Z, AA to AZ, etc.).

cardinal_pythonlib.spreadsheets.column_lettering(colnum: int) str[source]

Converts a zero-based column index into a spreadsheet-style column name (A[0] to Z[25], then AA[26] to AZ[51], etc). Basically, it’s almost base 26, but without a proper sense of zero (in that A is zero, but AA is 26).

cardinal_pythonlib.spreadsheets.load_workbook(spreadsheet_filename: str) Book[source]

Load a workbook.

cardinal_pythonlib.spreadsheets.none_or_blank_string(x: Any) bool[source]

Is x either None or a string that is empty or contains nothing but whitespace?

cardinal_pythonlib.spreadsheets.prefer_attr_all_same(items: Sequence[Any], attr: str, id_attr: str, ignore_none: bool = False) None[source]

Warn if the attr attribute of each item in items is not the same. See check_attr_all_same().

cardinal_pythonlib.spreadsheets.read_int_row(row: Sequence[Cell], colnum: int) int | None[source]

Reads an integer from a spreadsheet, given a row.

AVOID: slower than index access (see SheetHolder, RowHolder).

cardinal_pythonlib.spreadsheets.read_value_row(row: Sequence[Cell], colnum: int) Any[source]

Retrieves a value from a cell of a spreadsheet, given a row.

AVOID: slower than index access (see SheetHolder, RowHolder).

cardinal_pythonlib.spreadsheets.require_attr_all_same(items: Sequence[Any], attr: str, id_attr: str, ignore_none: bool = False) None[source]

Raise if the attr attribute of each item in items is not the same. See check_attr_all_same().

cardinal_pythonlib.spreadsheets.values_by_attr(items: Sequence[Any], attr: str) List[Any][source]

Returns the values of a given attribute for each of the items.

Parameters:
  • items – Items to check

  • attr – Name of attribute whose value should be taken across items.