英语翻译神器

231469242
2015年6月19日
讀畢需時 3 分鐘

如果一篇英文专业性强，翻译工作量将很大。

为了减轻翻译难度，我写了程序用于智能统计文章的核心词汇。

一篇单词核心词汇就十几个，它们可反复出现10-150次，找出它们，翻译工作可以降低百分之八十。

程序可以分为低级版本和高级版本。

低级版本用于学生教学。

高级版本用于专业翻译和阅读。

低级版本运行后展示：

以gettysburg（林肯--葛底斯堡演讲稿）为例

Four score and seven years ago, our fathers brought forth on this continent a new nation. conceived in Liberty, and dedicated to the proposition that all men are created equal. Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. We are met on a great battle field of that war. We have come to dedicate a portion of that field as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this. But, in a larger sense, we can not dedicate we cannot consecrate we cannot hallow, this ground. The brave men, living and dead, who struggled here, have consecrated it far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us, the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the great task remaining before us that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion. that we here highly resolve that these dead shall not have died in vain that this nation, under God, shall have a new birth of freedom. And that government of the people, by the people, for the people, shall not perish from the earth.

低级版本程序代码：

调用了6个函数，3个库。扩展后还会实现更多功能，例如计算机进一步智能细分有意义高频动词和名词，并以图形展示出来。 #word遍历行，word处理，去除各种标点，怪异符号，最后计算word出现个数

#判断单词是否是在停用词列表

def none_sense_words_list_judeg(word1):

file_noneSense_words=open("nonesense_words.txt",'r')

list_noneSense_words=[]

for line in file_noneSense_words:

line_list=line.split()

for word in line_list:

list_noneSense_words.append(word)

if word1 not in list_noneSense_words:

return word1

import string

def processLine(line,wcDict):

line=line.strip()

wordList=line.split()

for word in wordList:

if word !='--':

word=word.lower()

word=word.strip()

word=word.strip(string.punctuation)

if word==none_sense_words_list_judeg(word):

addWord(word,wcDict)

#计算单词个数

def addWord(w,wcDict):

if w in wcDict:

wcDict[w]+=1

else:

wcDict[w]=1

#图形显示

import numpy

import pylab

def barGragph(wcDict):

wordList=[]

for key,val in wcDict.items():

if val>2 and len(key)>3:

wordList.append((key,val))

wordList.sort()

keyList=[key for key,val in wordList]

valList=[val for key,val in wordList]

barWidth=0.5 #测试，当barWidth=2时，条形图会重叠;barWidth=1时，条形图会紧密挨在一起，没有空隙;当barWidth小于0.5时条形图宽度会缩小，根据显示参数数量来调整barwidth宽度，最好小于0.5

xVals=numpy.arange(len(keyList))

pylab.xticks(xVals+barWidth/2.0,keyList,rotation=45) #xVals+barWidth/2时，刻度标记在条形图正中

pylab.bar(xVals,valList,width=barWidth,color='r') pylab.title("gettysburg'valueable words")

pylab.show()

#饼状图更加简洁，3行搞定（可选） pylab.pie(valList,labels=keyList)

pylab.title("gettysburg'valueable words")

pylab.show()

#美观输出

def prettyPrint(wcDict):

valKeyList=[]

for key,val in wcDict.items():

valKeyList.append((val,key)) #注意为了方便排序，把val,key换了方向，生成新的列表valKeyList

valKeyList.sort(reverse=True) #sort（reverse=True）值由高到低排序

print '%-10s%10s'%('word','count')

print '-'*21

for val,key in valKeyList:

print "%-12s %3d"%(key,val) #美观输出的时候，又调换顺序，key,val顺序输出

barGragph(wcDict)

def main():

wcDict={}　　　　　　　　　　　　　　＃字典是可变的，不用重复申明

fObj=open('gettysburg2.txt','r')

for line in fObj:

processLine(line,wcDict,)

prettyPrint(wcDict)

高级版本功能包括：

1.程序自动过滤单词变体，例如go的变体goes，gone,going。缩小词汇库，缩短程序运行时间。

2.对比不同文章相同单词，不同单词

3.程序自动分析样本文章翻译难度

高级版本程序运行展示：

样本是一篇专业介绍肺炎英文文档，网址：

https://en.wikipedia.org/wiki/Pneumonia

程序自动统计出此文的核心单词，记住它们，阅读无障碍。

Toby

生物英语翻译 / 数据分析

英语翻译神器

留言

Featured Posts

捕食者仿真器

线性分析_糖尿病预测

Recent Posts