我想处理一个包含大量 html 和 uuencode 字符的文本文件:
例如,请参阅以下链接中的 .txt 文件:
https://www.sec.gov/Archives/edgar/data/1522690/000121390016011794/0001213900-16-011794.txt
我正在使用以下代码:
从 bs4 导入 BeautifulSoup
def strip_non_ascii(string):
''' Returns the string without non ASCII characters'''
stripped = (c for c in string if 0 < ord(c) < 127)
return ''.join(stripped)
with open("C:/EDGAR/forms_to_process/10K/20160322_10-K_edgar_data_1522690_0001213900-16-011794_1.txt") as f:
lines = f.readlines()
with open("PROCESSED.txt", 'w', encoding='utf-8') as f1:
i=1
for line in lines:
soup = BeautifulSoup(line, "lxml")
print(i, "Initial line: ", line)
print(i, "Soup get text line: ", soup.get_text())
bs_line = soup.get_text()
ascii_line = strip_non_ascii(bs_line)
print(i, "Ascii line: ", ascii_line)
f1.write(ascii_line)
i=i+1
f.close()
f1.close();
这将文件从 8.5 MB 减少到 2.5 MB,但它仍然有很多我不需要的元素,例如:
</tr>
<tr style="vertical-align: bottom; background-color: #cceeff;">
<td
style="padding: 0px 0px 0px 10pt; text-indent: -10pt;"><font style="font-family: 'times new roman', times, serif;"> </font></td>
<td><font style="font-family: 'times new roman', times, serif;"> </font></td>
<td style="text-align: left;"><font style="font-family: 'times new roman', times, serif;"> </font></td>
<td style="text-align: right;"><font style="font-family: 'times new roman', times, serif;"> </font></td>
<td style="text-align: left;"><font style="font-family: 'times new roman', times, serif;"> </font></td>
<td><font style="font-family: 'times new roman', times, serif;"> </font></td>
<td style="text-align: left;"><font style="font-family: 'times new roman', times, serif;"> </font></td>
<td style="text-align: right;"><font style="font-family: 'times new roman', times, serif;"> </font></td>
<td style="text-align: left;"><font style="font-family: 'times new roman', times, serif;"> </font></td>
和
EXCEL
86
Financial_Report.xlsx
IDEA: XBRL DOCUMENT
begin 644 Financial_Report.xlsx
M4$L#!!0 ( J%=D@6'2-4(0( $8I 3 6T-O;G1E;G1?5'EP97-=
M+GAM;,W:2V[;,! &X*L8VA86S62DZ(U
MW")I8^#?6):'G!EII&_EJV\/@=+BX(8QK:LNY_"!L=1TY&RJ?:"Q1#8^.IO+
M:=RR8)N=W1(3JY5AC1\SC7F9IQS5]=67/<78M[3X> Q,N=>5#6'H&YM[/[+]
MV)YD7?K-IF^H]M31U1=D.=\L- Z5S]8^2I\@UM[-V07U3X\=[5D89Y3>KZ\%CJTZ%D2>6W=56B
MZ5D53C?^K;/>34,+X_:W'=/Y/U[+R4WM[KY[OWO-QX2FJVJI7898%L;M([5?MWH+T\0ZDC_M D56@2*K0)%5H,@J4&05*+(*%%D%BJP215:)(JM$D56BR"I19)4HLDH4626*
MK!)%5HDBJT*15:'(JE!D52BR*A19%8JL"D56A2*K0I%5HMBJP:15:-(JM&D56CR*I19-4HLAH460V*K 9%5H,BJT&1U:#(:E!D-2BR&A19
有没有办法删除这些并只保留文本文件中包含的相关文本信息?
编辑:从我提供的链接中,我想保留的一个文本示例是:
<P STYLE="font: 10pt/normal Times New Roman,serif; margin: 0; text-align: justify">The table above indicates the current yields
to maturity (YTM) for the senior bonds of selected life insurance carriers with durations, on average, that our similar to our
life insurance portfolio. The average yield to maturity of these bonds was 3.02% which, we believe, reflects in part the
financial market’s judgement that credit risk is low with regard to these carriers’ financial obligations. It should
be noted that the obligations of life insurance carriers to pay life insurance policy benefits is senior in rank to any other obligation.
This “super senior” priority is not reflected in the yield to maturity in the table and, if considered, would result
in a lower yield to maturity all else being equal. As such, as long as the respective premium payments have been made, it is highly
likely that the owner of the insurance policy will collect the insurance policy benefit upon the mortality of the insured.</P>
即我想删除所有的 html 标签和 uuencoding 二进制文件,只保留文本。
编辑2:
Gerrit 在下面的响应肯定非常接近我想要实现的目标,至少对于正在考虑的 .txt 文件。但是,它仍然在文件末尾留下以下部分:
Actuarial Pricing Systems, LP Model Actuarial
Pricing Systems, LP 33(Q7.U=JG''<]S7/R,ZG4BCJ0V3TKG/'&I;?V=X:N-K;9;C]RA^O4_EFG
M:==/<^*KESYJ(^GP2")\_*26SQV-%M9T2^ER$N(E=_96.&'X J:]=&,<=*\L\2V
MWB>ZTU9M7LH$M[;D-$5!4'CL3QTKH]*\07E[I&CVUFT;(NYU=))9E+!!&!G@$
M9)RO?O6N(3G%3OKL88:2IRE"SMNCL=X]*7--R3Z'/J"VI>Y=WC\L,/)7RB<9
MSR?>CD8O:)['4%@!D\#UKE_'K!O",S @CS(R#G_:%5AKUS=23VDLUO<03V<[
MI)#"Z!2HZ MPXP>HJ'Q!@?#*UQ_SRM^G_ :TI1:J1OW1E6FG3E;LS)70=)?X
M>KJDR>7>>4S"7>?F8,0!CH<]*W_AU<3/X==)22D<[)%GLN G1LGELK%Y,Q;NN>.3R>^>
MU;5IIIPO=W,*$&I1G;:RM]YUV[C.* V:YBPU'4KQX[33Q:0I;6L#R"56;<77(
M5<'@ #KS4"ZI=P2R0V,5LDL^JR6Y+[B.$SN//7CH./I7+RL[>=;G79J.;?Y;
M>7]_:=OUQQ7+2>([Z&W6;*>2TAG%Z]I)<%28U"KNW;_N9M#%]J
M6U6(=SLC*@("<'!YY S^-)Q:!33T/-/#;Z,NHW2>)(B97; >7.U7R=V[N#[F
MO3=$TO3]+LW73&+6\SF4'?O'(QP?3BN?U:#PGX@L)+\7=M%-LW"='VOTXW+W
M^A%8O@W4;^QT'5)X4\R"V>.0HP) &?W@'OMYKLJIU(.:NMM'M\CBHVI3Y&D;]
M]5^IZAFDWO7[&]EL8U>TAFCA$PC:0J-NYWV@Y8#*C ]Z;%>ZC=Z_I
MC07]M);2V;2OY<;;'PR@D#/7GCTYZURM1VNNZ[=)IKJM@HU'>D8*M^Z*Y.X\\\ \<N>*ZN*020QOO1M
MR@[D/!]Q[5+BUN4I)G@##YC]:3%.;[Q^M)7T9\M<3%:WAO\ Y#D'T;^1K*K4
M\.?\AR#Z-_(UE7_AR-B23R64-RK3D%]RNW0D\9Z=33+?1-'MM9?5HK>X%TQ9B2C[
M06ZD#'^/\ 9%.TO6[IY[:QDMY9B$03W&[.
M&9-^>@&.0*J\M7W8NRTI8_P"BQ(%V,N\*#N'S @L,@_A4
M=QX@U18YW2TAAV6;3[)F.X,'*YZ=.,]J/>[A[BOH6;K1-)NWN#(EZ([EM\T2
M&14=O[Q4=^E33Z=837+W*-?V\LBA9&MS;(GF <#=CJ??K5.;Q#=V=U/$8/M,K
M2X2)"<*!$C, 0.>3QFGR^)+M))!'IZLB-(H+SX/[M0S9&..#^='O#O THH;.
M&^>\1+CSWB6%BR.:J0Z1I4.MR:NL5R;Q\Y9E<@9&#
这似乎是 uuencoding 二进制部分。知道如何摆脱这个吗?