Python文件处理

开发运维 2022-09-30 Escape 手机阅读

纸上得来终觉浅，绝知此事要躬行。

Python文件处理

1. 文件打开

日常我们使用中，涉及最多就应该是对文件的处理了，其中文件的读取模式主要涉及到open函数，而该函数的参数又比较多，所以理解该函数才是文件处理的核心要点。

open：核心函数

open(file, mode='r',
    buffering=-1, encoding=None, errors=None,
    newline=None, closefd=True, opener=None)

# 使用open函数打开文件
try:
    f = open('note.txt', 'r')
    print(f.read())
finally:
    f.close()


# 使用with文件上下文管理器
with open('note.txt', 'r') as f:
    print(f.read())

（1）mode：指定文件的读取模式

Character	Meaning	Description
r	open for reading (default)	只读模式(默认)
w	open for writing, truncating the file first	只写模式
x	create a new file and open it for writing	创建新文件并打开写模式
a	open for writing, appending to the end of the file if it exists	追加模式
b	binary mode	二进制模式
t	text mode (default)	文本模式(默认)
+	open a disk file for updating (reading and writing)	读写模式
U	universal newline mode (deprecated)	已经弃用了

                     | r   r+   w   w+   a   a+   x   x+
---------------------|----------------------------------
allow read           | ✓   ✓        ✓        ✓        ✓
allow write          |     ✓    ✓   ✓    ✓   ✓    ✓   ✓
create new file      |          ✓   ✓    ✓   ✓    ✓   ✓
open existing file   | ✓   ✓    ✓   ✓    ✓   ✓
erase file contents  |          ✓   ✓
allow seek           |     ✓    ✓   ✓             ✓   ✓
position at start    | ✓   ✓    ✓   ✓             ✓   ✓
position at end      |                   ✓   ✓

In [1]: f = open('update_chis.sh', 'r')

In [2]: f
Out[2]: <_io.TextIOWrapper name='update_chis.sh' mode='r' encoding='UTF-8'>

（2）buffering：设置缓存大小
- 文本模式下，不设置此参数，遇到换行刷新buffer
- 二进制模式下，不设置此参数，根据操作系统自动判断buffer大小
- 二进制模式下，设置此参数为0，关闭buffer

In [4]: f = open('update_chis.sh', buffering=1024)

（3）encoding：指定编码格式
- encoding参数只在文本模式下生效

In [7]: f = open('update_chis.sh', encoding='utf-8')

（4）errors：指定无法解码时的处理模式
- errors只在文本模式下生效
- 参数strict表示严格模式，无法解码抛出异常
- 参数ignore表示忽略模式，无法解码直接pass

In [20]: with open('xxx.txt', errors='ignore') as f:
    ...:     pass
    ...:

（5）newline：指定换行符
- newline所指定换行符None、''、n、r、rn

In [21]: !echo "a.nb.nc." > note.txt

In [23]: cat note.txt
a.
b.
c.

In [24]: f = open('note.txt', newline='n')

In [25]: f.readlines()
Out[25]: ['a.n', 'b.n', 'c.n']

In [26]: f.close()

2. 文件对象

我们这里介绍一下，常用的文件操作函数。

close 函数
- 关闭文件对象，并清空缓冲区
- closed属性用来判断文件是否已经关闭
- 在编程中最好使用with open方法，会自动关闭文件

In [27]: f = open('note.txt', node='rt')

In [28]: f.close()

fileno 函数
- 返回一个文件描述符
- name属性用来获取文件名称

In [29]: f = open('note.txt', mode='rt')

In [30]: f.fileno
Out[30]: <function TextIOWrapper.fileno>

In [31]: f.fileno()
Out[31]: 11

In [33]: f.name
Out[33]: 'note.txt'

flush 函数
- 强制刷新缓冲区
- 将写入在缓冲区的内容，直接写入文件

In [45]: f = open('note.txt', mode='w')

In [47]: f.write('abc')
Out[47]: 3

In [48]: !cat note.txt

In [49]: f.flush()

In [50]: !cat note.txt
abc

In [51]: f.close()

read 函数
- 读取时包括换行符
- 读取文件，指定长度，表示可以指定读取字节(二进制)或者字符(文本模式)
- 读取文件，不指定长度，表示读取到文件末尾，参数-1同理

In [51]: !echo "a.nb.nc." > note.txt

In [52]: f = open('note.txt', 'r+')

In [53]: f.read(2)
Out[53]: 'a.'

In [54]: f.read(2)
Out[54]: 'nb'

In [55]: f.close()

readline 函数
- readline函数经常和strip函数一起使用，用于除去行尾换行符
- readlines函数一次读入所有行，将结果保存到一个列表中，包含换行符
- 当读到文件末尾时，read和readline返回空string/空bytes，而readlines返回空的list列表

In [56]: f = open('note.txt', 'r+')

In [57]: f.readline()
Out[57]: 'a.n'

In [58]: f.readline()
Out[58]: 'b.n'

In [59]: f.close()

In [60]: f = open('note.txt', 'r+')

In [61]: f.readlines()
Out[61]: ['a.n', 'b.n', 'c.n']

In [62]: f.close()

seek 函数
- 移动文件指针
- seek函数中，参数whence表示从哪里开始移动
- 当文本模式打开时，whence只能是0
- 当二进制模式打开时，whence都可用

whence	Description
0	从起始位置开始移动，offset偏移量应为零或者整数
1	从当前位置开始移动，offset偏移量可能为负数
2	从末尾位置开始移动，offset偏移量通常为负数

In [63]: f = open('note.txt', 'r+')

In [64]: f.readline()
Out[64]: 'a.n'

In [65]: f.seek(0)
Out[65]: 0

In [66]: f.readline()
Out[66]: 'a.n'

In [67]: f.tell()
Out[67]: 3

In [68]: f.seekable()
Out[68]: True

In [70]: f.closed
Out[70]: False

In [72]: f.close()

write 函数
- 写操作时，换行符始终需要显示传入
- 每次以单行的方式写文件，多行的话需要添加换行符n
- writelines函数用于写入字符串，和write函数类似(字符串也是可迭代对象)
- writelines函数用于写入一个可迭代对象，循环遍历写入，不会在每个元素之后自动添加换行符，需要手动添加n符号

In [73]: f = open('note.txt', 'a+')

In [74]: f.writable()
Out[74]: True

In [75]: f.write('dn')
Out[75]: 2

In [76]: f.writelines(['e', 'f'])

In [77]: f.flush()

In [78]: !cat note.txt
a.
b.
c.
d
ef
In [79]: f.close()

3. 序列化和反序列化

在分布式系统中，很多数据都是需要传输的，所以就需要将数据转换成可传输的二进制流。传输到对应机器上之后，又需要把该二进制流转成对应数据，这就是序列化和反序列化。

pickle

适用范围
- 这是Python自带的序列化和反序列化工具
参数说明
- object: 要持久化保存的对象，即用于传输的数据。
- file: 一个拥有write()方法的对象，并且这个write()方法能接收一个字符串作为参数。这个对象可以是一个以写模式打开的文件对象或者一个StringIO对象，或者其他自定义的满足条件的对象。
- protoco: 这是一个可选的参数，默认为0。如果设置为1或True，则以高压缩的二进制格式保存持久化后的对象，否则以ASCII格式保存。

# 官方文档
Functions:
    dump(object, file[, protocol])
    dumps(object[, protocol]) -> string
    load(file) -> object
    loads(string) -> object

示例说明

In [80]: import pickle

In [81]: dct = { 'a': 1, 'b': 2, 'c':3 }

In [82]: pickle_date = pickle.dumps(dct)

In [83]: pickle_date
Out[83]: b'x80x03}qx00(Xx01x00x00x00aqx01Kx01Xx01x00x00x00bqx02Kx02Xx01x00x00x00cqx03Kx03u.'

In [85]: with open('data.pickle', 'wb') as f:
    ...:     f.write(pickle_date)
    ...:

In [87]: with open('data.pickle', 'rb') as f:
    ...:     print(pickle.loads(f.read()))
    ...:
{'a': 1, 'b': 2, 'c': 3}

json

适用范围
- pickle和json的转换协议不同，所以看到的结果也不一样
- 通常会优先使用json格式，因为其跨平台的特性，用起来很方便

In [89]: import json

In [90]: dct = { 'a': 1, 'b': 2, 'c':3 }

In [91]: json_data = json.dumps(dct)

In [92]: json_data
Out[92]: '{"a": 1, "b": 2, "c": 3}'

In [93]: with open('json_date', 'w') as f:
    ...:     f.write(json_data)
    ...:

In [94]: with open('json_data') as f:
    ...:     data = json.loads(f.read())
    ...:     print(data)
    ...:

示例说明

Encoding basic Python object hierarchies::
    >>> import json
    >>> json.dumps(['foo', {'bar': ('baz', None, 1.0, 2)}])
    '["foo", {"bar": ["baz", null, 1.0, 2]}]'
    >>> print(json.dumps(""foobar"))
    ""foobar"
    >>> print(json.dumps('u1234'))
    "u1234"
    >>> print(json.dumps('\'))
    "\"
    >>> print(json.dumps({"c": 0, "b": 0, "a": 0}, sort_keys=True))
    {"a": 0, "b": 0, "c": 0}
    >>> from io import StringIO
    >>> io = StringIO()
    >>> json.dump(['streaming API'], io)
    >>> io.getvalue()
    '["streaming API"]'

Compact encoding::
    >>> import json
    >>> from collections import OrderedDict
    >>> mydict = OrderedDict([('4', 5), ('6', 7)])
    >>> json.dumps([1,2,3,mydict], separators=(',', ':'))
    '[1,2,3,{"4":5,"6":7}]'

Pretty printing::
    >>> import json
    >>> print(json.dumps({'4': 5, '6': 7}, sort_keys=True, indent=4))
    {
        "4": 5,
        "6": 7
    }

Decoding JSON::
    >>> import json
    >>> obj = ['foo', {'bar': ['baz', None, 1.0, 2]}]
    >>> json.loads('["foo", {"bar":["baz", null, 1.0, 2]}]') == obj
    True
    >>> json.loads('"\"foo\bar"') == '"foox08ar'
    True
    >>> from io import StringIO
    >>> io = StringIO('["streaming API"]')
    >>> json.load(io)[0] == 'streaming API'
    True

Specializing JSON object decoding::
    >>> import json
    >>> def as_complex(dct):
    ...     if '__complex__' in dct:
    ...         return complex(dct['real'], dct['imag'])
    ...     return dct
    ...
    >>> json.loads('{"__complex__": true, "real": 1, "imag": 2}',
    ...     object_hook=as_complex)
    (1+2j)
    >>> from decimal import Decimal
    >>> json.loads('1.1', parse_float=Decimal) == Decimal('1.1')
    True

Specializing JSON object encoding::
    >>> import json
    >>> def encode_complex(obj):
    ...     if isinstance(obj, complex):
    ...         return [obj.real, obj.imag]
    ...     raise TypeError(repr(obj) + " is not JSON serializable")
    ...
    >>> json.dumps(2 + 1j, default=encode_complex)
    '[2.0, 1.0]'
    >>> json.JSONEncoder(default=encode_complex).encode(2 + 1j)
    '[2.0, 1.0]'
    >>> ''.join(json.JSONEncoder(default=encode_complex).iterencode(2 + 1j))
    '[2.0, 1.0]'


Using json.tool from the shell to validate and pretty-print::
    $ echo '{"json":"obj"}' | python -m json.tool
    {
        "json": "obj"
    }
    $ echo '{ 1.2:3.4}' | python -m json.tool
    Expecting property name enclosed in double quotes: line 1 column 3 (char 2)