generated from datawhalechina/repo-template
-
Notifications
You must be signed in to change notification settings - Fork 411
Open
Description
line = line.replace(' ', '')
line_len = len(enc.encode(line))
if line_len > max_token_len:
# 如果单行长度就超过限制,则将其分割成多个块
num_chunks = (line_len + token_len - 1) // token_len
for i in range(num_chunks):
start = i * token_len
end = start + token_len
# 避免跨单词分割
while not line[start:end].rstrip().isspace():
start += 1
end += 1
if start >= line_len:
break
curr_chunk = curr_chunk[-cover_content:] + line[start:end]
chunk_text.append(curr_chunk)
# 处理最后一个块
start = (num_chunks - 1) * token_len
curr_chunk = curr_chunk[-cover_content:] + line[start:end]
chunk_text.append(curr_chunk)上述处理单行超过限制的处理逻辑,是不是有问题,求解答🙏
由于line进行过replace,去除了行内的空白,当进行while循环的时候,while not line[start:end].rstrip().isspace()会不会一直不成立?导致最后的line[start:end]是空白字符串?
示例代码:
line = "Thisisaverylongsentence"
start = 0
end = 8
line_len = len(line)
while not line[start:end].rstrip().isspace():
print(f"start: {start}, end: {end}, line[start:end]: '{line[start:end]}'")
start += 1
end += 1
if start >= line_len:
break
print(f"执行完循环后,line[start:end] 为: '{line[start:end]}'")
"""输出
start: 0, end: 8, line[start:end]: 'Thisisav'
start: 1, end: 9, line[start:end]: 'hisisave'
start: 2, end: 10, line[start:end]: 'isisaver'
start: 3, end: 11, line[start:end]: 'sisavery'
start: 4, end: 12, line[start:end]: 'isaveryl'
start: 5, end: 13, line[start:end]: 'saverylo'
start: 6, end: 14, line[start:end]: 'averylon'
start: 7, end: 15, line[start:end]: 'verylong'
start: 8, end: 16, line[start:end]: 'erylongs'
start: 9, end: 17, line[start:end]: 'rylongse'
start: 10, end: 18, line[start:end]: 'ylongsen'
start: 11, end: 19, line[start:end]: 'longsent'
start: 12, end: 20, line[start:end]: 'ongsente'
start: 13, end: 21, line[start:end]: 'ngsenten'
start: 14, end: 22, line[start:end]: 'gsentenc'
start: 15, end: 23, line[start:end]: 'sentence'
start: 16, end: 24, line[start:end]: 'entence'
start: 17, end: 25, line[start:end]: 'ntence'
start: 18, end: 26, line[start:end]: 'tence'
start: 19, end: 27, line[start:end]: 'ence'
start: 20, end: 28, line[start:end]: 'nce'
start: 21, end: 29, line[start:end]: 'ce'
start: 22, end: 30, line[start:end]: 'e'
"""
执行完循环后,line[start:end] 为: ''Metadata
Metadata
Assignees
Labels
No labels