File utils¶

The module provides convenience functions for dealing with files, directories and paths.

Overview¶

Reading¶

`read_file_lines`(f[, strip, remove_empty, nlines])	Reads all lines of a given file.
`read_file_first_line`(f)	Reads the first line of a given file f.
`read_pickle_file`(f)	Reads a pickled object from a file f.

Writing¶

`write_lines_file`(lns, f[, linedel])	Writes lines into a file.
`write_pickle_file`(e, f)	Writes an object e as a pickle into a file f.

File extensions¶

`file_ext`(f)	Returns the extension of a file f.
`set_ext`(f[, ext])	Changes the file extension of path f to ext by replacing the text starting at the last period.
`file_complete_ext`(f)	Returns the complete extension of a file f.
`set_complete_ext`(f[, ext])	Changes the file extension of path f by changing the text after the first period in its basename to ext.
`remove_complete_ext`(f)	Removes the complete extension of a path f.

Directory listing¶

`list_dir`(d[, match, sort])	Lists files and directories that match a given pattern.
`walk_dir`(d[, match, follow_symlinks])	Walks a given directory and returns files/dirs whose basename matches a given pattern.

Paths¶

`abs_path`(path, d)	Returns an absolute POSIX path of path given a working directory d.
`prepend_dir`(path, d)	Prepends a directory d just before the path path.

Directories¶

safe_create_dir(d) Creates a directory, if it does not already exist.

Sizes¶

`file_size`(f)	Returns the size of a file f, in bytes.
`dir_size`(d[, recursive])	Returns the sum of size of files in a directory d, in bytes.

Symlinks¶

rel_symlink(d, src, dst) Creates a relative symlink, starting at a given directory.

Utilities¶

`file_frame`(fs[, bf, bfne, size, stat])	Constructs a `DataFrame` from file names, having selected columns.
`random_hash`([n])	Generates n random bytes and represents them in a hexadecimal format.

daul.file_utils.abs_path(path, d)¶

Returns an absolute POSIX path of path given a working directory d.

>>> abs_path('dir/file', '/base')
'/base/dir/file'

daul.file_utils.as_posix_path(path)¶

Returns the path path as a POSIX path.

>>> as_posix_path('C:\Windows')
'C:/Windows'
>>> as_posix_path('/home/user')
'/home/user'
>>> as_posix_path('C:\Windows/system32')
'C:/Windows/system32'

daul.file_utils.dir_size(d, recursive=False)¶

Returns the sum of size of files in a directory d, in bytes.

recursive : bool: Whether to recurse into subdirectories.

daul.file_utils.file_complete_ext(f)¶

Returns the complete extension of a file f.

>>> file_complete_ext('abc.txt.gz')
'.txt.gz'

daul.file_utils.file_ext(f)¶

Returns the extension of a file f.

>>> file_ext('abc.txt.gz')
'.gz'
>>> file_ext('abc')
''

daul.file_utils.file_frame(fs, bf=True, bfne=False, size=False, stat=False)¶

Constructs a DataFrame from file names, having selected columns.

fs : list of str: a list of file names.
bf : bool: whether to include the basename of a file (as “bf” column).
bfne : bool: whether to include the basename of a file without complete extension (as “bfne” column).
size : bool: whether to include the size of a file (as “size” column).
stat : bool: whether to include the stat of a file (as “stat” column).

daul.file_utils.file_size(f)¶: Returns the size of a file f, in bytes.

daul.file_utils.list_dir(d, match='*', sort=False)¶

Lists files and directories that match a given pattern.

The pattern is checked against the basename of the file/dir.

d : str: Directory to list.
match : str: Name pattern that the files/dirs must match to be included.
sort : bool: Indicates whether to sort files/dirs by name.

daul.file_utils.prepend_dir(path, d)¶

Prepends a directory d just before the path path.

>>> prepend_dir('file', 'dir')
'dir/file'
>>> prepend_dir('path/file', 'dir')
'path/dir/file'
>>> prepend_dir('/home/user/file', 'dir')
'/home/user/dir/file'

daul.file_utils.random_hash(n=8)¶

Generates n random bytes and represents them in a hexadecimal format.

n : int: Number of bytes to generate.

Let us generate a random hash from 8 bytes:

>>> h = random_hash(n=8)
>>> len(h) # 16: two hex letters for a byte
16

daul.file_utils.read_file_first_line(f)¶: Reads the first line of a given file f.

daul.file_utils.read_file_lines(f, strip=False, remove_empty=False, nlines=None)¶

Reads all lines of a given file.

f : str: Path to the file.
strip: bool: Indicates whether to strip lines of white space.
remove_empty: bool: Indicates whether to remove empty lines.
nlines : int or None: If None, reads all the lines. If int, reads the specified number of lines.

daul.file_utils.read_pickle_file(f)¶: Reads a pickled object from a file f.

daul.file_utils.rel_symlink(d, src, dst)¶

Creates a relative symlink, starting at a given directory.

d : str: Directory to temporarily change the current directory to.
src : str: Name of the source file.
dst : str: Name of the destination file.

daul.file_utils.remove_complete_ext(f)¶

Removes the complete extension of a path f.

>>> remove_complete_ext('abc.txt.gz')
'abc'

daul.file_utils.safe_create_dir(d)¶

Creates a directory, if it does not already exist.

d : str: Path to the directory to create.

Raises:	OSError – If the d exists and it is not a directory, throws an exception.

daul.file_utils.set_complete_ext(f, ext='')¶

Changes the file extension of path f by changing the text after the first period in its basename to ext.

>>> set_complete_ext('abc.txt', '.tex')
'abc.tex'
>>> set_complete_ext('abc.txt.gz', '.ext')
'abc.ext'
>>> set_complete_ext('abc.txt.gz', '')
'abc'

See also

remove_complete_ext() for removing the complete extension.

daul.file_utils.set_ext(f, ext='')¶

Changes the file extension of path f to ext by replacing the text starting at the last period.

>>> set_ext('abc.txt', '.tex')
'abc.tex'
>>> set_ext('abc', '.txt')
'abc.txt'
>>> set_ext('/home/user/file', '.cache')
'/home/user/file.cache'

The changes are applied only to the basename:

>>> set_ext('/home/user/.conf/file', '.cache')
'/home/user/.conf/file.cache'

If multiple periods are present in the basename, it changes only the last one:

>>> set_ext('file.txt.cache', '.ch')
'file.txt.ch'

See also

set_complete_ext() for changing the complete extension.

daul.file_utils.walk_dir(d, match='*', follow_symlinks=False)¶

Walks a given directory and returns files/dirs whose basename matches a given pattern.

d : str: Path to the directory to walk.
match : str: The pattern of files to return.
follow_symlinks : bool: Whether to follow links as in os.walk().

daul.file_utils.write_lines_file(lns, f, linedel='')¶

Writes lines into a file.

lns : list: List of lines.
f : str: Path to the file.
linedel : str: Line delimiter.

daul.file_utils.write_pickle_file(e, f)¶: Writes an object e as a pickle into a file f.

Pandas utilities¶

The module collects utilities for dealing with DataFrames.

Overview¶

Columns¶

`rename_cols`(df, d)	Returns a copy of a frame df with renamed columns as defined by the dictionary d.
`set_cols`(df, cols)	Returns a copy of a frame df with its labels being cols.
`reorder_cols`(df[, first_cols, last_cols])	Returns a copy of frame df with reordered columns.
`prefix_cols`(df, pref)	Returns a copy of a frame df with prefix pref applied to column labels.
`drop_cols`(df, cols)	Returns a copy of a frame df with columns cols dropped.
`safe_drop_cols`(df, cols)	Returns a copy of a frame df with columns cols dropped, but without raising exceptions if they do not exist.

Column updates¶

`update_column`(df, col, values[, copy])	Returns a copy of a frame df with column col set to values.
`update_tuple_col`(df, cola, colb, values[, copy])	Returns a copy of a frame df and sets values of two columns (cola, colb) from a list of 2-len tuples (values).
`update_fixed_column`(df, col, value[, copy])	Returns a copy of a frame df with a column col set to a fixed value value.
`column_apply`(df, col, f[, store_as, copy])	Returns a copy of a frame with a function f being applied over a column col and stored as the same column col.

Transformations¶

`expand`(df, col[, exp_col, remove_col])	Expands the values of a list-like column into separate rows, while keeping the rest of the columns fixed.
`expand_dicts_as_cols`(df, col[, remove_col])	Returns a frame in which the values of a dict-based column col in frame df are expanded into separate columns.

Inner frames¶

`groupby_as_frame`(df, col[, df_col])	Groups a frame by a column and stores the frames as inner frames.
`extract_inner_col`(df, df_col, inndf_col[, aggf])	Extracts values from an inner frame’s column and applies a function over them.
`include_outer_col`(df, inndf_col, col)	Includes column from the outer frame as a fixed column in the inner frame.
`attach_inner_col`(df, col, df_col, inndf_col)	Attaches a column to the outer frame by aggregating the values in the inner frame.
`attach_inner_cols`(df, cols, df_col, aggf[, …])	Attaches selected columns from the inner frame.
`rename_inner_cols`(df, df_col[, d])	Renames columns of inner frames.

Row splitting¶

`row_split_ngroups`(df, n[, empty])	Row-splits the frame into n approximately equal-length frames.
`row_split_group_size`(df, sz[, empty])	Row-splits the frame into frames having sz rows.
`row_split_sizes`(df, lens)	Splits a frame df according to a list lens of lengths.
`row_split_bool`(df, arr)	Row-splits a frame df into two parts: the rows for which the arr is True and those for which it is False.

Joins¶

`left_join`(ldf, rdf, jcol[, rcols])	Returns a copy of the left frame with columns joined from the right frame on a specified column.
`left_join_def`(df, jdf, jcol[, cols, default])	Returns a copy of the left frame with columns joined from the right frame, providing a default value.

Conversion to dictionaries¶

`twocol_dict`(df)	Creates a dictionary from the first two columns of a frame df (keys — the first column, values — the second column).
`twocol_dictf`(df)	Creates a function that maps values from the first column of df to the values in the corresponding rows of the second column.
`twocol_dictfd`(df, default)	Creates a function that maps values from the first column of df to the values in the corresponding rows of the second column, with a default value default.
`twocol_listvaldict`(df)	Creates a dictionary from the first two columns of a frame df, (keys — the first column, values — list-aggregated values from the second column for the same key).
`twocol_listvaldictf`(df)	Creates a function that maps keys from the first column of df into a list of values from the second column corresponding to the same key.
`twocol_listvaldictfd`(df, default)	Creates a function that maps keys from the first column of df into a list of values from the second column corresponding to the same key (with a default value default, if no such key is found).

Selectors¶

`nodup`(df[, cols])	Returns non-duplicated rows of a frame df.
`take_middle_n_rows`(df, n[, error_if_less])	Selects n middle rows of a frame.
`ifst`(x)	Returns the first element of x using iloc of a frame, or a series.

Generic¶

`empty_frame`([cols])	Creates an empty frame with the given column labels cols.
`pdmap`(f, df)	Maps a function f over rows of a frame df, with the function application being done over all columns as positional arguments.
`renumber_index`(df[, copy])	Returns a copy of a frame df, with the index having consecutive values in the range from `0` to `len(df) - 1`.
`values_list`(x)	Returns the x.values of x as a list.
`values_set`(x)	Returns the x.values of x as a set.

Empty-aware utilities¶

eaw_row_concat(dfs, cols) Performs a row-wise concatenation of frames dfs.

Compatibility¶

`sort`(df, col[, ascending])	Sorts frame by given col(s).
`row_concat`(dfs)	Concatenates frames dfs, row-wise.

daul.pandas_utils.attach_inner_col(df, col, df_col, inndf_col, aggf=<type 'list'>)¶

Attaches a column to the outer frame by aggregating the values in the inner frame.

df : DataFrame: The frame to attach the column to.
col : str: The label of the column that will be attached to the outer frame.
df_col : str: The label of the column where the inner frames are stored.
inndf_col : str: The label of the column in the inner frame, which will be aggregated.
aggf : function: The function that aggregates the values.

daul.pandas_utils.attach_inner_cols(df, cols, df_col, aggf, namef=<function idf>)¶

Attaches selected columns from the inner frame.

df : DataFrame

The frame to attach the columns to.

cols : list

The labels of the columns in the inner frame to attach to the outer frame. See also namef parameter.

df_col : str

The label of the column where the inner frames are stored.

aggf : function

The function that aggregates the values from the inner frames.

namef : function

The function that maps labels of the columns from the inner frame to the labels in the outer frame.

Let us create a simple frame that we will group to obtain inner frames:

>>> df = pd.DataFrame({'a': [0, 0, 1, 2], 'b': ['a', 'b', 'c', 'd']})
>>> df
   a  b
0  0  a
1  0  b
2  1  c
3  2  d

Group it:

>>> gdf = groupby_as_frame(df, 'a') # inner frame is in 'df' column

Now we can attach the inner columns as outer ones:

>>> gdf = attach_inner_cols(gdf, ['b'], 'df', list)
>>> gdf[['a', 'b']]
   a       b
0  0  [a, b]
1  1     [c]
2  2     [d]

daul.pandas_utils.column_apply(df, col, f, store_as=None, copy=True)¶

Returns a copy of a frame with a function f being applied over a column col and stored as the same column col.

df : DataFrame: The frame.
col :: The column over which to apply f.
f : function: The function to apply over col.
store_as : column-label or None: The label of the column to store results of applying f. If None, it is the same as col.
copy : bool: Whether to return a copy of the frame.

>>> df = pd.DataFrame({'a': [0, 1]})
>>> df = column_apply(df, 'a', lambda x: x + 1)
>>> df
   a
0  1
1  2

daul.pandas_utils.drop_cols(df, cols)¶

Returns a copy of a frame df with columns cols dropped.

>>> df = empty_frame(['a', 'b', 'c'])  

We can use it for one column:

>>> list(drop_cols(df, 'a').columns)
['b', 'c']

And also for multiple columns:

>>> list(drop_cols(df, ['a', 'b']).columns)
['c']

daul.pandas_utils.eaw_groupby_agg(df, groupby, aggd)¶

Groups a frame and aggregates values, even for an empty frame.

df : DataFrame: Frame to group.
groupby : str: Label of column to groupby.
aggd : dict: Dictionary of labels and functions to as in pandas.DataFrame.agg().

>>> df = pd.DataFrame({'x': [0, 0, 1], 'y': [1, 2, 3]})
>>> df
   x  y
0  0  1
1  0  2
2  1  3
>>> gdf = eaw_groupby_agg(df, 'x', {'y': np.max}).reset_index()
>>> gdf
   x  y
0  0  2
1  1  3
>>> df = empty_frame(['x', 'y'])
>>> gdf = eaw_groupby_agg(df, 'x', {'y': np.max}).reset_index()
>>> gdf
Empty DataFrame
Columns: [x, y]
Index: []

daul.pandas_utils.eaw_row_concat(dfs, cols)¶: Performs a row-wise concatenation of frames dfs. If an empty list is given, returns an empty frame with the specified columns cols.

daul.pandas_utils.empty_frame(cols=[])¶

Creates an empty frame with the given column labels cols.

>>> df = empty_frame(['a', 'b'])  
>>> len(df)
0
>>> list(df.columns)
['a', 'b']

daul.pandas_utils.expand(df, col, exp_col=None, remove_col=False)¶

Expands the values of a list-like column into separate rows, while keeping the rest of the columns fixed.

df : DataFrame: The frame to expand.
col : str: A column of df to expand.
exp_col : str: The label to store the expanded column as. If None, then name of col used.
remove_col : bool: Indicates whether to remove the non-expanded column.

>>> df = pd.DataFrame({'a': [1, 2], 
...                    'fruits': [['orange', 'apple'], 
...                               ['kiwi']]})
>>> df
   a           fruits
0  1  [orange, apple]
1  2           [kiwi]
>>> df = expand(df, 'fruits', 'fruit')
>>> df
   a           fruits   fruit
0  1  [orange, apple]  orange
0  1  [orange, apple]   apple
1  2           [kiwi]    kiwi
>>> 

daul.pandas_utils.expand_dicts_as_cols(df, col, remove_col=False)¶

Returns a frame in which the values of a dict-based column col in frame df are expanded into separate columns.

remove_col : bool: Whether to remove the dict-based column.

Suppose a frame with dict-like column v.

>>> df = pd.DataFrame({'i': [0, 1], 
...                    'v': [{'a': 1, 'b': 2}, 
...                          {'a': 2, 'b': 3}]})
>>> df
   i                   v
0  0  {u'a': 1, u'b': 2}
1  1  {u'a': 2, u'b': 3}

Now let us expand the dictionaries in v.

>>> df = expand_dicts_as_cols(df, 'v', remove_col=True)
>>> df[['i', 'a', 'b']]
   i  a  b
0  0  1  2
1  1  2  3

daul.pandas_utils.extract_inner_col(df, df_col, inndf_col, aggf=<type 'list'>)¶

Extracts values from an inner frame’s column and applies a function over them.

df : DataFrame: The frame.
df_col : str: The label of column in df that holds the inner frames.
inndf_col : str: The label of column in the inner frames.
aggf : function: The function to aggregate the columns.

Let us first create a frame and then group it (see groupby_as_frame()), such that is has an inner frame:

>>> df = pd.DataFrame({'a': [0, 0, 1, 2], 'b': [1, 3, 5, 7]})
>>> gdf = groupby_as_frame(df, 'a') # inner frame in `df`

Now let us extract the values from the columns b from the inner frame:

>>> extract_inner_col(gdf, 'df', 'b')
0    [1, 3]
1       [5]
2       [7]
Name: df, dtype: object

We can also apply some function, e.g., to sum the values:

>>> extract_inner_col(gdf, 'df', 'b', sum)
0    4
1    5
2    7
Name: df, dtype: int64

daul.pandas_utils.groupby_as_frame(df, col, df_col='df')¶

Groups a frame by a column and stores the frames as inner frames.

df : DataFrame: The frame to group.
col : label: The column by which to group.
df_col : label: The label that will hold the resulting subframes.

>>> df = pd.DataFrame({'a': [0, 0, 1], 'b': ['a', 'b', 'c']})
>>> df
   a  b
0  0  a
1  0  b
2  1  c

Let us group the frame on the a column:

>>> gdf = groupby_as_frame(df, 'a')

Let us see the groups:

>>> gdf[['a']]
   a
0  0
1  1

The first group (with a = 0):

>>> gdf['df'].iloc[0]
   a  b
0  0  a
1  0  b

The second group (with a = 1):

>>> gdf['df'].iloc[1]
   a  b
2  1  c

daul.pandas_utils.ifst(x)¶: Returns the first element of x using iloc of a frame, or a series.

daul.pandas_utils.include_outer_col(df, inndf_col, col)¶

Includes column from the outer frame as a fixed column in the inner frame.

df : DataFrame: Frame to update.
inndf_col : label: The label of the column that holds the inner frames.
col : label: The label of the column which to include from df to the inner frames.

daul.pandas_utils.left_join(ldf, rdf, jcol, rcols=[])¶

Returns a copy of the left frame with columns joined from the right frame on a specified column.

ldf : DataFrame

The left frame.

rdf : DataFrame

The right frame.

jcol : str

The label of the column used on which to perform the join.

rcols: list

The columns from rdf to join.

Note

No default behavior if the particular value in jcol is missing in the right frame; see left_join_def().

>>> ldf = pd.DataFrame({'a': [0, 0, 1, 2]})
>>> rdf = pd.DataFrame({'a': [0, 1, 2], 'b': ['zero', 'one', 'two']})
>>> left_join(ldf, rdf, 'a', ['b'])
   a     b
0  0  zero
1  0  zero
2  1   one
3  2   two

daul.pandas_utils.left_join_def(df, jdf, jcol, cols=[], default=None)¶: Returns a copy of the left frame with columns joined from the right frame, providing a default value.

Note

See left_join() for explanation of parameters.

daul.pandas_utils.nodup(df, cols=None)¶

Returns non-duplicated rows of a frame df.

If the cols is None, then all columns are used.

daul.pandas_utils.pdmap(f, df)¶

Maps a function f over rows of a frame df, with the function application being done over all columns as positional arguments.

>>> df = pd.DataFrame({'a': [0, 1]})
>>> df = update_column(df, 'b', df['a'] + 10)
>>> df
   a   b
0  0  10
1  1  11
>>> pdmap(lambda x, y: x + y, df)
[10, 12]

daul.pandas_utils.prefix_cols(df, pref)¶

Returns a copy of a frame df with prefix pref applied to column labels.

>>> df = pd.DataFrame({'a': [], 'b': []})
>>> list(prefix_cols(df, 'l:').columns)
['l:a', 'l:b']

daul.pandas_utils.rename_cols(df, d)¶

Returns a copy of a frame df with renamed columns as defined by the dictionary d.

Let us create a an empty frame with two columns:

>>> adf = pd.DataFrame({'a': [], 'b': []})

Rename a copy of the frame:

>>> bdf = rename_cols(adf, {'b': 'c'})
>>> list(bdf.columns)
['a', 'c']

Note that the column labels of the previous frame are unchanged:

>>> list(adf.columns)
['a', 'b']

daul.pandas_utils.rename_inner_cols(df, df_col, d={})¶

Renames columns of inner frames.

df : DataFrame: The frame to transform.
df_col: str: The label of the column where the inner frames are stored.
d : dict: The renaming dictionary.

daul.pandas_utils.renumber_index(df, copy=True)¶

Returns a copy of a frame df, with the index having consecutive values in the range from 0 to len(df) - 1.

copy : bool: Whether to return a copy of the frame.

daul.pandas_utils.reorder_cols(df, first_cols=[], last_cols=[])¶

Returns a copy of frame df with reordered columns.

The cols first_cols specifies the columns that will be the put first, and last_cols the ones that will be the last.

>>> adf = empty_frame(['a', 'b', 'c', 'd'])
>>> bdf = reorder_cols(adf, ['b'], ['a'])
>>> list(bdf.columns)
['b', 'c', 'd', 'a']

Note that the order remains unchanged in the original frame.

>>> list(adf.columns)
['a', 'b', 'c', 'd']

daul.pandas_utils.row_concat(dfs)¶: Concatenates frames dfs, row-wise.

daul.pandas_utils.row_split_bool(df, arr)¶

Row-splits a frame df into two parts: the rows for which the arr is True and those for which it is False.

tdf, fdf : tuple

>>> df = pd.DataFrame({'a': np.arange(0, 10)})

Split the frame into those with an even and odd numbers:

>>> adf, bdf = row_split_bool(df, df['a'] % 2 == 0)
>>> adf  
   a
0  0
2  2
4  4
6  6
8  8

daul.pandas_utils.row_split_group_size(df, sz, empty='zerolengroup')¶

Row-splits the frame into frames having sz rows.

df : DataFrame: The frame to row-split.
sz : int: The number of rows in a group.
empty : str: Applies only if the frame to split is of zero length. Using ‘zerolengroup’ returns one group of zero length (with the column labels preserved). Using ‘nogroup’ returns an empty list.

>>> df = pd.DataFrame({'x': np.arange(0, 1000)})
>>> dfs = row_split_group_size(df, 5)  
>>> len(dfs)
200
>>> len(dfs[0])
5

See also

row_split_ngroups() for examples when the frame to split has zero rows.

daul.pandas_utils.row_split_ngroups(df, n, empty='zerolengroup')¶

Row-splits the frame into n approximately equal-length frames.

df : DataFrame: The frame to row-split.
n : int: The number of groups.
empty : str: Applies only if the frame to split is of zero length. Using ‘zerolengroup’ returns one group of zero length (with the column labels preserved). Using ‘nogroup’ returns an empty list.

Note

Uses NumPy array_split() for splitting the frame.

Note

If the number of rows r in the frame is less than n, returns r groups.

>>> df = pd.DataFrame({'x': np.arange(0, 1000)})
>>> n = 10  
>>> dfs = row_split_ngroups(df, n)

The total number of frames:

>>> len(dfs)
10

The size of the first frame:

>>> len(dfs[0])
100

Note the behavior if the number of groups is less than rows:

>>> df = pd.DataFrame({'x': np.arange(0, 5)})
>>> dfs = row_split_ngroups(df, 10)
>>> len(dfs)
5

In case of an empty frame, the result is a one group of with an empty frame, by default:

>>> df = empty_frame(['a', 'b'])
>>> dfs = row_split_ngroups(df, 5, empty='zerolengroup')
>>> len(dfs)
1
>>> len(dfs[0])
0

In this case the shape of the frame is preserved and thus further processing of the frame will likely succeed:

>>> list(dfs[0].columns)
['a', 'b']

In case of ‘nogroup’, returns an empty list:

>>> row_split_ngroups(df, 5, empty='nogroup')  
[]

daul.pandas_utils.row_split_sizes(df, lens)¶

Splits a frame df according to a list lens of lengths.

>>> df = pd.DataFrame({'x': np.arange(0, 10)})  
>>> lens = [1, 3, 2, 4]

Split and check whether the lengths correspond.

>>> dfs = row_split_sizes(df, lens)
>>> map(len, dfs)
[1, 3, 2, 4]

Let us take a look at the last frame:

>>> dfs[-1]
   x
6  6
7  7
8  8
9  9

daul.pandas_utils.safe_drop_cols(df, cols)¶

Returns a copy of a frame df with columns cols dropped, but without raising exceptions if they do not exist.

>>> df = empty_frame(['a', 'b', 'c'])

>>> list(safe_drop_cols(df, ['a', 'd']).columns)
['b', 'c']

>>> list(safe_drop_cols(df, 'a').columns)
['b', 'c']

daul.pandas_utils.set_cols(df, cols)¶

Returns a copy of a frame df with its labels being cols.

>>> adf = empty_frame(['a', 'b'])  
>>> bdf = set_cols(adf, ['c', 'd'])
>>> list(bdf.columns)
['c', 'd']

Note that the columns of adf are not changed:

>>> list(adf.columns)
['a', 'b']

daul.pandas_utils.sort(df, col, ascending=True)¶

Sorts frame by given col(s).

df : DataFrame: The frame to sort.
col : str or list: Columns by which to sort.
ascending : bool or list: Whether to sort in ascending order.

Note

Internally uses pd.DataFrame.sort_values() or pd.DataFrame.sort() if not available.

daul.pandas_utils.take_middle_n_rows(df, n, error_if_less=False)¶

Selects n middle rows of a frame.

Raises:

ValueError

If the df has less than n rows, and if error_if_less is set.

>>> df = pd.DataFrame({'col': ['a', 'b', 'c']})  
>>> take_middle_n_rows(df, 1)
  col
1   b
>>> take_middle_n_rows(df, 2)
  col
0   a
1   b

daul.pandas_utils.twocol_dict(df)¶

Creates a dictionary from the first two columns of a frame df (keys — the first column, values — the second column).

See also

twocol_listvaldict() for creating a list-valued dictionary.

>>> df = pd.DataFrame({'k': ['a', 'b'], 'v': ['x', 'y']})[['k', 'v']]
>>> df
   k  v
0  a  x
1  b  y
>>> twocol_dict(df)
{'a': 'x', 'b': 'y'}

daul.pandas_utils.twocol_dictf(df)¶: Creates a function that maps values from the first column of df to the values in the corresponding rows of the second column.

Note

Creates a dictionary using twocol_dict().

daul.pandas_utils.twocol_dictfd(df, default)¶: Creates a function that maps values from the first column of df to the values in the corresponding rows of the second column, with a default value default.

daul.pandas_utils.twocol_listvaldict(df)¶

Creates a dictionary from the first two columns of a frame df, (keys — the first column, values — list-aggregated values from the second column for the same key).

See also

twocol_dict().

>>> df = pd.DataFrame({'k': ['a', 'a', 'b'], 'v': ['c', 'd', 'e']})[['k', 'v']]
>>> df
   k  v
0  a  c
1  a  d
2  b  e
>>> twocol_listvaldict(df)
{'a': ['c', 'd'], 'b': ['e']}

daul.pandas_utils.twocol_listvaldictf(df)¶: Creates a function that maps keys from the first column of df into a list of values from the second column corresponding to the same key.

See also

The function is a wrapper over twocol_listvaldict().

daul.pandas_utils.twocol_listvaldictfd(df, default)¶: Creates a function that maps keys from the first column of df into a list of values from the second column corresponding to the same key (with a default value default, if no such key is found).

See also

The function is a wrapper over twocol_listvaldict().

daul.pandas_utils.update_column(df, col, values, copy=True)¶

Returns a copy of a frame df with column col set to values.

copy: bool: Whether to return a copy.

>>> adf = pd.DataFrame({'v': [0, 1, 2]})

If a column with the desirable label does not exists, a new one is created:

>>> bdf = update_column(adf, 'e', ['a', 'b', 'c'])
>>> list(bdf['e'])
['a', 'b', 'c']

If it does, the column will have new values:

>>> cdf = update_column(bdf, 'e', ['d', 'e', 'f'])
>>> list(cdf['e'])
['d', 'e', 'f']

Note that the previous frame remains the same.

>>> list(bdf['e'])
['a', 'b', 'c']

daul.pandas_utils.update_fixed_column(df, col, value, copy=True)¶

Returns a copy of a frame df with a column col set to a fixed value value.

See also daul.shortcuts.pdufc() for a shorter form.

copy: bool: Whether to return a copy.

>>> df = pd.DataFrame({'a': [0, 1]})
>>> df = update_fixed_column(df, 'b', 'text')
>>> list(df['b'])
['text', 'text']

Warning

Be aware that the objects assigned are identical and if mutable, changing one will result in the change of others.

>>> df = pd.DataFrame({'a': [0, 1]})
>>> df = update_fixed_column(df, 'b', [])
>>> list(df['b'])
[[], []]

>>> df['b'].iloc[0].append(1) # change the first object
>>> list(df['b'])
[[1], [1]]

daul.pandas_utils.update_tuple_col(df, cola, colb, values, copy=True)¶

Returns a copy of a frame df and sets values of two columns (cola, colb) from a list of 2-len tuples (values).

copy: bool: Whether to return a copy.

>>> df = pd.DataFrame({'id': [0, 1]})
>>> values = [('green', '#00ff00'), ('blue', '#0000ff')]
>>> df = update_tuple_col(df, 'name', 'hex', values)  
>>> df
   id   name      hex
0   0  green  #00ff00
1   1   blue  #0000ff

daul.pandas_utils.values_list(x)¶

Returns the x.values of x as a list.

>>> df = pd.DataFrame({'x': np.arange(5)})  
>>> values_list(df['x'])
[0, 1, 2, 3, 4]

daul.pandas_utils.values_set(x)¶

Returns the x.values of x as a set.

>>> df = pd.DataFrame({'x': ['a', 'a', 'a']})  
>>> values_set(df['x'])
set(['a'])

NumPy utilities¶

Dealing with NumPy arrays.

Overview¶

Selectors¶

`first_nrows`(arr, n)	Selects first n rows of a 2-dimensional array arr.
`each_nth`(arr[, n])	Selects each n-th element of an array arr.

Run-length encoding¶

rle(l) Run-length encodes a given list l.

Normalization¶

onesum_norm(x) Normalizes an array x to sum to one.

Rounding¶

floor_decimals(x[, dec]) Floors a value x to a given number dec of decimal places.

Length groups¶

`lengroup_select`(arrs, inds)	Selects elements from a list of arrays using absolute indices as if the arrays were concatenated.
`lengroup_starts_ends`(lens)	Obtains starting and ending positions of individual length-groups with lengths specified by lens.

Array construction¶

nonzeros_at(l, pos, val, dtype) Creates an array of a particular length (l), by specifying all non-zero values (val) and their positions (pos).

Cumulative sums¶

zero_cumsum_nolast(arr) Returns the cumulative sum of an array arr, but starting at the zero and without the last element.

Zero/Empty-aware operations¶

`zaw_1d_concatenate`(arrs[, default, dtype])	Concatenates 1-d arrays arrs, or returns default if an empty list was provided.
`eaw_loc`(arr, loc)	Returns the elements of an array arr specified by indices loc, and returns an empty array if the indices are empty.

daul.numpy_utils.each_nth(arr, n=1)¶

Selects each n-th element of an array arr.

>>> list(each_nth(np.arange(10), 3))
[0, 3, 6, 9]

daul.numpy_utils.eaw_loc(arr, loc)¶

Returns the elements of an array arr specified by indices loc, and returns an empty array if the indices are empty.

>>> list(eaw_loc(np.arange(0, 10), []))
[]
>>> list(eaw_loc([10, 11, 12], [1, 2]))
[11, 12]

daul.numpy_utils.first_nrows(arr, n)¶

Selects first n rows of a 2-dimensional array arr.

>>> arr = np.array([[1, 2], [2, 3], [3, 4]])
>>> arr  
array([[1, 2],
       [2, 3],
       [3, 4]])
>>> first_nrows(arr, 1)
array([[1, 2]])

daul.numpy_utils.floor_decimals(x, dec=0)¶

Floors a value x to a given number dec of decimal places.

>>> x = 123.456

>>> floor_decimals(x, 0)
123.0
>>> floor_decimals(x, 1)
123.4
>>> floor_decimals(x, 2)
123.45

Note the behavior for a negative number of decimals.

>>> floor_decimals(x, -1)
120.0
>>> floor_decimals(x, -2)
100.0
>>> floor_decimals(x, -3)
0.0

daul.numpy_utils.lengroup_outer_inner_indices(lens, lpos)¶

Returns the outer and inner indices given lengths lens of length-grouped arrays and the desired position lpos.

Suppose we have a length-group array with 2, 4, and 3 elements.

>>> lens = np.array([2, 4, 3])

For convenience, let us label the function with a shorter name.

>>> f = lengroup_outer_inner_indices

Now, let us obtain all possible absolute indices.

>>> lpos = np.arange(sum(lens))
>>> o, i = f(lens, lpos)

Let us take a look at outer indices: >>> list(o) [0, 0, 1, 1, 1, 1, 2, 2, 2]

We see that we are getting the proper outer index.

Let us take a look at inner within-group indices: >>> list(i) [0, 1, 0, 1, 2, 3, 0, 1, 2]

Note that an exception is raised if we have out of scope indices.

>>> f(lens, -1)
Traceback (most recent call last):
...
IndexError: The absolute index is not within limits.

>>> f(lens, sum(lens))
Traceback (most recent call last):
...
IndexError: The absolute index is not within limits.

If we provide empty indices, we also get empty ones. >>> o, i = f(lens, []) >>> len(o) == 0 and len(i) == 0 True

daul.numpy_utils.lengroup_select(arrs, inds)¶

Selects elements from a list of arrays using absolute indices as if the arrays were concatenated.

arrs : list of arrays: The arrays from which to select elements.
inds : array_like: Indices to choose.

>>> arrs = [np.array([10, 11, 12]), np.array([21, 22])]
>>> inds = [0, 3]  
>>> lengroup_select(arrs, inds)
[10, 21]

>>> arrs_2d = [np.array([[0, 1], [2, 3]]), np.array([[4, 5, 6], [7, 8, 9]])]
>>> lengroup_select(arrs_2d, inds)
[array([0, 1]), array([7, 8, 9])]

>>> lengroup_select(arrs_2d, [])
[]

daul.numpy_utils.lengroup_starts_ends(lens)¶

Obtains starting and ending positions of individual length-groups with lengths specified by lens.

>>> lens = [2, 4, 3]
>>> s, e = lengroup_starts_ends(lens)  
>>> s 
array([0, 2, 6])
>>> e
array([2, 6, 9])

daul.numpy_utils.nonzeros_at(l, pos, val, dtype)¶

Creates an array of a particular length (l), by specifying all non-zero values (val) and their positions (pos).

dtype : dtype: The dtype of the array.

>>> nonzeros_at(l=5, pos=[2, 3], val=1, dtype=np.int)
array([0, 0, 1, 1, 0])

daul.numpy_utils.onesum_norm(x)¶

Normalizes an array x to sum to one.

>>> onesum_norm(np.array([1, 3]))
array([0.25, 0.75])

daul.numpy_utils.rle(l)¶

Run-length encodes a given list l.

>>> l = ['a', 'a', 'a', 'b', 'c', 'c', 'a']
>>> rle(l)
[('a', 3), ('b', 1), ('c', 2), ('a', 1)]

daul.numpy_utils.zaw_1d_concatenate(arrs, default=array([], dtype=float64))¶

Concatenates 1-d arrays arrs, or returns default if an empty list was provided.

>>> default = np.array([], dtype=np.int32)
>>> zaw_1d_concatenate([], default=default)
array([], dtype=int32)

>>> zaw_1d_concatenate([[1, 2], [3]])
array([1, 2, 3])

daul.numpy_utils.zero_cumsum_nolast(arr)¶

Returns the cumulative sum of an array arr, but starting at the zero and without the last element.

Note

The function is useful when dealing with a list of arrays (of varying lengths). Applying the function over a list of lengths gives starting positions in absolute indices. cumsum() then gives the ending positions.

For the sake of illustration, let us show a practical example using a list of strings.

>>> seqs = ['abc', 'defgh', 'ij', 'klmnop']  

Let us obtain their lengths:

>>> arr = map(len, seqs)

Now we get their starting and ending positions:

>>> starts = zero_cumsum_nolast(arr)
>>> ends = np.cumsum(arr)  

Suppose the string is concatenated:

>>> useq = "".join(seqs)  
>>> useq  
'abcdefghijklmnop'

We can select the corresponding elements like this:

>>> i = 1
>>> s, e = starts[i], ends[i]
>>> useq[s:e]
'defgh'

Shortcuts¶

Abbreviated names of common functions from pandas_utils.

Overview¶

Shortcuts: `DataFrame` columns¶

`pduc`(df, col, values[, copy])	Returns a copy of a frame df with column col set to values.
`pdutc`(df, cola, colb, values[, copy])	Returns a copy of a frame df and sets values of two columns (cola, colb) from a list of 2-len tuples (values).
`pdufc`(df, col, value[, copy])	Returns a copy of a frame df with a column col set to a fixed value value.
`pdca`(df, col, f[, store_as, copy])	Returns a copy of a frame with a function f being applied over a column col and stored as the same column col.

Shortcuts: Conversion of `DataFrame`s to dictionaries¶

`pdtcd`(df)	Creates a dictionary from the first two columns of a frame df (keys — the first column, values — the second column).
`pdtcdf`(df)	Creates a function that maps values from the first column of df to the values in the corresponding rows of the second column.
`pdtcdfd`(df, default)	Creates a function that maps values from the first column of df to the values in the corresponding rows of the second column, with a default value default.
`pdtclvd`(df)	Creates a dictionary from the first two columns of a frame df, (keys — the first column, values — list-aggregated values from the second column for the same key).
`pdtclvdf`(df)	Creates a function that maps keys from the first column of df into a list of values from the second column corresponding to the same key.
`pdtclvdfd`(df, default)	Creates a function that maps keys from the first column of df into a list of values from the second column corresponding to the same key (with a default value default, if no such key is found).

Shortcuts: Inner frames¶

pdioc(df, inndf_col, col) Includes column from the outer frame as a fixed column in the inner frame.

Shortcuts: Other¶

pdri(df[, copy]) Returns a copy of a frame df, with the index having consecutive values in the range from 0 to len(df) - 1.

File utils¶

Overview¶

Reading¶

Writing¶

File extensions¶

Directory listing¶

Paths¶

Directories¶

Sizes¶

Symlinks¶

Utilities¶

Pandas utilities¶

Overview¶

Columns¶

Column updates¶

Transformations¶

Inner frames¶

Row splitting¶

Joins¶

Conversion to dictionaries¶

Selectors¶

Generic¶

Empty-aware utilities¶

Compatibility¶

NumPy utilities¶

Overview¶

Selectors¶

Run-length encoding¶

Normalization¶

Rounding¶

Length groups¶

Array construction¶

Cumulative sums¶

Zero/Empty-aware operations¶

Shortcuts¶

Overview¶

Shortcuts: DataFrame columns¶

Shortcuts: Conversion of DataFrames to dictionaries¶

Shortcuts: Inner frames¶

Shortcuts: Other¶

Shortcuts: `DataFrame` columns¶

Shortcuts: Conversion of `DataFrame`s to dictionaries¶