本章介紹如何使用Java從Word文檔中提取簡(jiǎn)單文本數(shù)據(jù)。 如果您想從Word文檔中提取元數(shù)據(jù),請(qǐng)使用Apache Tika。
對(duì)于.docx文件,我們使用類(lèi)org.apache.poi.xwpf.extractor.XPFFWordExtractor從Word文件中提取和返回簡(jiǎn)單數(shù)據(jù)。 同樣,我們有不同的方法從Word文件中提取標(biāo)題,腳注,表數(shù)據(jù)等。
以下代碼顯示如何從Word文件提取簡(jiǎn)單文本:
import java.io.FileInputStream; import org.apache.poi.xwpf.extractor.XWPFWordExtractor; import org.apache.poi.xwpf.usermodel.XWPFDocument; public class WordExtractor { public static void main(String[] args)throws Exception { XWPFDocument docx = new XWPFDocument( new FileInputStream("create_paragraph.docx")); //using XWPFWordExtractor Class XWPFWordExtractor we = new XWPFWordExtractor(docx); System.out.println(we.getText()); } }
將上述代碼保存為 WordExtractor.java 。從命令提示符處編譯并執(zhí)行,如下所示:
$javac WordExtractor.java $java WordExtractor
它將生成以下輸出:
At tutorialspoint.com, we strive hard to provide quality tutorials for self-learning purpose in the domains of Academics, Information Technology, Management and Computer Programming Languages.
更多建議: